Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Port problem: model hang after finishing initialization for B1850

QINKONG

QINQIN KONG
Member
This looks an awful lot like a bug that was in early versions of openmpi but has since been solved. Which reminds me that there was a similar bug in older versions of impi - according to your config_machines.xml file you have some really old system software. Can you get some newer versions?
besides compiler and mpi library netcdf, hdf5 and pnetcdf are also very old
You are right! I changed the config_machines.xml file to include the newest version of intel compiler, netcdf and hdf5 (openmpi 3.1.4 and pnetcdf1.10.0 are already the newest version in our module system though).
After that, openmpi works perfectly fine with I and B case. impi works fine with I case, but fails with B case, with the following error message in cesm.log file. Still seems to be a MPI problem. Any idea how to resolve this?

The updated config_machines.xml, config_compilers.xml and cesm.log, cpl.log files for impi case were attached.

Thanks!


Abort(740945679) on node 91 (rank 91 in comm 0): Fatal error in PMPI_Startall: Other MPI error, error stack:
PMPI_Startall(144)........: MPI_Startall(count=3, req_array=0x2ba2a6ebc100) failed
MPID_Startall(78).........:
MPID_Isend(345)...........:
MPIDI_OFI_send_normal(376): Out of memory (unable to allocate a 'Send Pack buffer alloc')
.
.
.

 

Attachments

  • cesm.log.10172394.210219-143816.txt
    54 KB · Views: 5
  • config_compilers.xml.txt
    41.5 KB · Views: 4
  • config_machines.xml.txt
    109.5 KB · Views: 2
  • cpl.log.10172394.210219-143816.txt
    65.9 KB · Views: 0

ykp990521

ykp990521
Member
Hi Jedwards, thanks for the reply. The resolution is f19_g17, NTASKS=192 for all components. But, I do increase the NTASKS to 240 for all components for another try with FHIST compset, model still hang at the same place. For the HPC cluster I'm using, the memory is about 90G for 24 cores, so the total available memory should be around 900G.

I will try f45_g37 first (but this is not scientifically supported for B1850) to see if it works.

Thanks for the help!
does T45_g37 work for any compset for 2.1.3? Thanks a lot!
 
Top