Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPI tag exceeds limit when using >240 MPI tasks

Hello all,I have ported CESM2.1.0 to the national HPC facility in the UK, ARCHER. This is a Cray XC30 and I am using the Intel compiler suite to build the model. I want to run the FXHIST compset (WACCM-X histiorical) which works in principle, but only if I use up to 10 nodes (240 MPI tasks, 24 pes/node). If I use 12 nodes or more I get this kind of error:Rank 221 [Mon Feb 18 09:58:29 2019] [c4-1c1s0n2] Fatal error in PMPI_Ibsend: Invalid tag, error stack:
PMPI_Ibsend(208): Invalid tag, value is 2120221The ARCHER documentation states that the maximum tag value available in the CRAY version of MPICH installed on ARCHER is 2097151, so clearly this value is being exceeded by the code, causing the failure. I assume this only occurs when I use more processors because more messages need to be passed between them. However, on other clusters it seems that CESM is being run with many more processors than I was attempting to do. Is there something specific about the compset I am using which means that I can't use more processors? Or does it look like I'm doing something wrong? With just 10 nodes, the model is not very fast and it will take me a minimum of 3 months to complete one full simulation... Is there is a way to speed things up and get around this problem with the tag value? Any suggestions are welcome!Thanks,Ingrid
 
Just a follow-up comment. When I avoid the use of the ESMF library, by turning off the electrodynamics in the ionosphere (set "-ionosphere wxi" instead of "-ionosphere wxie" and add "-nlev 126" in CAM_CONF_OPTS in env_build.xml) and switching USE_ESMF_LIB from TRUE to FALSE in env_build.xml, I can run successfully with 480 tasks on 20 nodes. So it looks like the issue is arising in the ESMF library.
 
Top