Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

B cases hanging during initialization on Pleiades

jamiller

J Miller
New Member
Hello,

I'm running CESM v2.1.5 on NASA's Pleiades machine. The F1850 compset runs fine as expected, but for the B1850 compset, the run seems to hang near the end of the initialization. None of the log files show an error message. Depending on the number of cores, sometimes it hangs on the first run, and sometimes it hangs when starting a resubmitted run.

There's another forum post with a similar issue here. In this post, the user solved the issue by updating their compiler. I'm running the latest version of my compiler (comp-intel/2023.2.1) but the issue still persists.

My guess is that the default configuration for the modules/compiler Pleiades uses has gotten out of date, but I'm rather new to this. Does anyone have a env_mach_specific.xml file of a successful B run on Pleiades that I can use to check module versions?

Attached are the cesm log (with the middle trimmed out to fit the file size limit), cpl log, and my env_mach_specific.xml file. Thanks!
 

Attachments

  • env_mach_specific.xml.txt
    2.1 KB · Views: 2
  • cpl.log.txt
    83.8 KB · Views: 0
  • cesm.log_part2.txt
    634 KB · Views: 3
  • cesm.log_part1.txt
    367.3 KB · Views: 0

jedwards

CSEG and Liaisons
Staff member
MPT: Received signal 15

I believe that this indicates you've run out of memory. You may try to increase the pelayout or try a different version of the mpt library if available.
 

jamiller

J Miller
New Member
Thanks for the response. I am trying a few different versions of the mpt library(mpi-hpe/mpt.2.30 and 2.28), but so far they haven't made any difference.

The "MPT: Received signal 15" message comes when the run is killed - the run hangs just before this line.

I've tried setting up the B1850 case (f19_g17 resolution) on Pleiades-ivy with 960 cores (48 nodes) that has 64GB per node, surely that should be enough memory.

Do you have any suggestions for a different pelayout?
 

jamiller

J Miller
New Member
I've used the default which gives every component 960 cores, though NTASKS_WAV had to be set to 600 max. This setup hangs on the initial run.
./xmlquery NTASKS
NTASKS: ['CPL:960', 'ATM:960', 'LND:960', 'ICE:960', 'OCN:960', 'ROF:960', 'GLC:960', 'WAV:600', 'ESP:960']
I've also tried 600 cores for all components, which does the initial run fine but hangs on the resubmitted run.
Giving it 1200 cores seems to break the ocean model during initialization, and gives the error:
POP aborting...
(init_moc_ts_transport_arrays) SH is not a regular lat-lon grid. The southern b
oundary for region 2 ("Atlantic") cannot be specified.
 
Top