Run aborts at writing output

jshaman · Aug 23, 2010

Hi,

I have CAM 4.0 building on an linux cluster with Intel compiler. The model runs and outputs the first month's cice and clm history and restart files. It also outputs the first month restart cam file, but at the history cam file the model hangs with the following:

WSHIST: nhfil( 1 )=camrun.cam2.h0.2000-01.nc
Opening netcdf history file camrun.cam2.h0.2000-01.nc
Opened file camrun.cam2.h0.2000-01.nc to write 27
H_DEFINE: Successfully opened netcdf file
Creating new decomp: 255096144
Creating new decomp: 355026096144
Creating new decomp: 355027096144
Creating new decomp: 354026095144
Creating new decomp: 354026096144
print_memusage iam 0 before write restart. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 99710 38313 3865 3722 0
Opened file camrun.cam2.r.2000-02-01-00000.nc to write 28
print_memusage iam 0 restart init. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 99710 38321 3871 3722 0
print_memusage iam 0 restart hycoef. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 99710 38323 3873 3722 0

At that point the run stalls. Eventually the run is aborted when I exceed my allotted wall time on the cluster.

Does anyone recognize this problem?

Thanks,
Jeff

eaton · Aug 25, 2010

It looks like the cam history file was written successfully, but that writing the restart file is where things hung (the message that the restart file was opened is the last one in the logfile output). This can be a sign that you've run out of memory. To reduce the memory requirements you could try using more mpi tasks on more nodes (using more mpi tasks on the same number of nodes probably won't help).

Do the print_memusage lines help determine whether you've exceeded the memory available on a node? I'm not sure what the units are, but the size of 99710 times the number of tasks on a node should be the total memory use. If the units are KB then that's about 100-MB per task.

Run aborts at writing output

jshaman

New Member

eaton

CSEG and Liaisons