Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Run aborts at writing output

jshaman

New Member
Hi,

I have CAM 4.0 building on an linux cluster with Intel compiler. The model runs and outputs the first month's cice and clm history and restart files. It also outputs the first month restart cam file, but at the history cam file the model hangs with the following:

WSHIST: nhfil( 1 )=camrun.cam2.h0.2000-01.nc
Opening netcdf history file camrun.cam2.h0.2000-01.nc
Opened file camrun.cam2.h0.2000-01.nc to write 27
H_DEFINE: Successfully opened netcdf file
Creating new decomp: 255096144
Creating new decomp: 355026096144
Creating new decomp: 355027096144
Creating new decomp: 354026095144
Creating new decomp: 354026096144
print_memusage iam 0 before write restart. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 99710 38313 3865 3722 0
Opened file camrun.cam2.r.2000-02-01-00000.nc to write 28
print_memusage iam 0 restart init. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 99710 38321 3871 3722 0
print_memusage iam 0 restart hycoef. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 99710 38323 3873 3722 0

At that point the run stalls. Eventually the run is aborted when I exceed my allotted wall time on the cluster.

Does anyone recognize this problem?

Thanks,
Jeff
 

eaton

CSEG and Liaisons
It looks like the cam history file was written successfully, but that writing the restart file is where things hung (the message that the restart file was opened is the last one in the logfile output). This can be a sign that you've run out of memory. To reduce the memory requirements you could try using more mpi tasks on more nodes (using more mpi tasks on the same number of nodes probably won't help).

Do the print_memusage lines help determine whether you've exceeded the memory available on a node? I'm not sure what the units are, but the size of 99710 times the number of tasks on a node should be the total memory use. If the units are KB then that's about 100-MB per task.
 
Top