inos@bas_ac_uk
Member
Hello all,I have ported CESM2.1.0 to the national HPC facility in the UK, ARCHER. This is a Cray XC30 and I am using the Intel compiler suite to build the model. I want to run the FXHIST compset (WACCM-X histiorical) which works in principle, but only if I use up to 10 nodes (240 MPI tasks, 24 pes/node). If I use 12 nodes or more I get this kind of error:Rank 221 [Mon Feb 18 09:58:29 2019] [c4-1c1s0n2] Fatal error in PMPI_Ibsend: Invalid tag, error stack:
PMPI_Ibsend(208): Invalid tag, value is 2120221The ARCHER documentation states that the maximum tag value available in the CRAY version of MPICH installed on ARCHER is 2097151, so clearly this value is being exceeded by the code, causing the failure. I assume this only occurs when I use more processors because more messages need to be passed between them. However, on other clusters it seems that CESM is being run with many more processors than I was attempting to do. Is there something specific about the compset I am using which means that I can't use more processors? Or does it look like I'm doing something wrong? With just 10 nodes, the model is not very fast and it will take me a minimum of 3 months to complete one full simulation... Is there is a way to speed things up and get around this problem with the tag value? Any suggestions are welcome!Thanks,Ingrid
PMPI_Ibsend(208): Invalid tag, value is 2120221The ARCHER documentation states that the maximum tag value available in the CRAY version of MPICH installed on ARCHER is 2097151, so clearly this value is being exceeded by the code, causing the failure. I assume this only occurs when I use more processors because more messages need to be passed between them. However, on other clusters it seems that CESM is being run with many more processors than I was attempting to do. Is there something specific about the compset I am using which means that I can't use more processors? Or does it look like I'm doing something wrong? With just 10 nodes, the model is not very fast and it will take me a minimum of 3 months to complete one full simulation... Is there is a way to speed things up and get around this problem with the tag value? Any suggestions are welcome!Thanks,Ingrid