Port problem: model hang after finishing initialization for B1850

QINKONG

QINQIN KONG
Member
This looks an awful lot like a bug that was in early versions of openmpi but has since been solved. Which reminds me that there was a similar bug in older versions of impi - according to your config_machines.xml file you have some really old system software. Can you get some newer versions?
besides compiler and mpi library netcdf, hdf5 and pnetcdf are also very old
You are right! I changed the config_machines.xml file to include the newest version of intel compiler, netcdf and hdf5 (openmpi 3.1.4 and pnetcdf1.10.0 are already the newest version in our module system though).
After that, openmpi works perfectly fine with I and B case. impi works fine with I case, but fails with B case, with the following error message in cesm.log file. Still seems to be a MPI problem. Any idea how to resolve this?

The updated config_machines.xml, config_compilers.xml and cesm.log, cpl.log files for impi case were attached.

Thanks!


Abort(740945679) on node 91 (rank 91 in comm 0): Fatal error in PMPI_Startall: Other MPI error, error stack:
PMPI_Startall(144)........: MPI_Startall(count=3, req_array=0x2ba2a6ebc100) failed
MPID_Startall(78).........:
MPID_Isend(345)...........:
MPIDI_OFI_send_normal(376): Out of memory (unable to allocate a 'Send Pack buffer alloc')
.
.
.

 

Attachments

ykp990521

ykp990521
Member
Hi Jedwards, thanks for the reply. The resolution is f19_g17, NTASKS=192 for all components. But, I do increase the NTASKS to 240 for all components for another try with FHIST compset, model still hang at the same place. For the HPC cluster I'm using, the memory is about 90G for 24 cores, so the total available memory should be around 900G.

I will try f45_g37 first (but this is not scientifically supported for B1850) to see if it works.

Thanks for the help!
does T45_g37 work for any compset for 2.1.3? Thanks a lot!
 
Back
Top