Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM lifelock when running on multiple nodes

gabriel2029

Gabriel Dengler
New Member
Hello,
I tried to port CESM on Azure CycleCloud with Intel Xeon Platinum 8168 processor cores per nodes. Building of the software works fine and also the run completes on the example case (compset: B1850, res: f19_g17) with up to two nodes. Unfortunately, if you use three nodes or more, the application lifelocks, that means, that the output files (in the scratch directory) are only created up to a specific time and after that they remain unchanged, although the CPU usage is at 100 percent. This happens with Intel Parallelstudio as well as with GNU/OpenMPI as compiler.

All relevant files are in the /mnt/nfs_shares/homes/cesm/ folder. The /mnt/nfs_shares/ directory is a NFS directory which is shared across all nodes. Because I thought that this is a problem with NFS, I built NetCDF without PnetCDF, but the problem still exists. Or is there some extra stuff I missed?

For completeness, I attached all important configuration files and scripts in a zip (this time without PnetCDF because I thought this will help to solve the problem).

Best regards,
Gabriel
 

Attachments

  • CESM-configurations.zip
    5.5 KB · Views: 0

gabriel2029

Gabriel Dengler
New Member
// Update: I moved the scratch folder onto a beegfs file system. There occurs the same error, so I think the problem is not NFS related. Furthermore, it seems that the error always happens at the same time (for the same configuration, e.g. number of nodes), and happens earlier, when you use more nodes.
 

gabriel2029

Gabriel Dengler
New Member
// Another update (I would have edited my first post, but this is not possible, unfortunately. Hopefully, you don't count it as spam): I have enabled the debug output with the ./xmlchange-command and found out, that in the log files of glc and cesm is the following (nearly identical) content:

glc.log.2.201108-033629:
Code:
*******************************************************************************
    Opening file 4-nodes-intel-debug.b.e20.B1850.f19_g17.test.cism.initial_hist.
0001-01-01-00000.nc for output;
      Write output at start of run and every    1.00000000000000       years
   Creating variables internal_time, time, and tstep_count
   Creating variable level
   Creating variable lithoz
   Creating variable staglevel
   Creating variable stagwbndlevel
   Creating variable x0
   Creating variable x1
   Creating variable y0
   Creating variable y1
   Creating variable artm
   Creating variable smb
   Creating variable thk
   Creating variable topg
   Creating variable usurf
*******************************************************************************
    Writing to file 4-nodes-intel-debug.b.e20.B1850.f19_g17.test.cism.initial_hi
st.0001-01-01-00000.nc at time   0.000000000000000E+000

cesm.log.2.201108-033629:
Code:
    Opening file 4-nodes-intel-debug.b.e20.B1850.f19_g17.test.cism.initial_hist.
0001-01-01-00000.nc for output;
      Write output at start of run and every    1.00000000000000       years
   Creating variables internal_time, time, and tstep_count
   Creating variable level
   Creating variable lithoz
   Creating variable staglevel
   Creating variable stagwbndlevel
   Creating variable x0
   Creating variable x1
   Creating variable y0
   Creating variable y1
   Creating variable artm
   Creating variable smb
   Creating variable thk
   Creating variable topg
   Creating variable usurf
    Writing to file 4-nodes-intel-debug.b.e20.B1850.f19_g17.test.cism.initial_hi
st.0001-01-01-00000.nc at time   0.000000000000000E+000
 

jedwards

CSEG and Liaisons
Staff member
We have no experience with the system you are trying to use. In my opinion you should start by trying to run a simple case
such as an X or A compset - then move to more complicated cases as you solve issues. The B compset is the most complicated configuration of cesm, as such it is not a good place to start.
 
Top