Memory error on hopper

Wouldn't you know it? I have one more simulation of quarter degree fvCAM5 to make using cesm1_0_3 on hopper.nersc.gov. Since the OS upgrade, I get the following error, more or less at random times during the first month of integration. ccsm.log.150406-024752:[NID 05064] 2015-04-06 04:26:54 Apid 49402235: OOM killer terminated this process.

 This means that the code has run out of memory. I have not had this error before the OS upgrade on 100s of similar jobs. In fact, I had one case, compiled prior to the OS that still runs after the upgrade with no such error. I think there are two possibilities.a) the new compiler has a memory leak.orb) The new compiler makes a bigger code. What are my options?RegardsMichael
 

jedwards

CSEG and Liaisons
Staff member
Hi Michael,See if you can figure out which task or tasks are giving the oom error - if you are running a B case giving more tasks to pop might help, ifyou are running an f case you might try changing the PIO_STRIDE value to spread out the IO a little.   And it might be a problem that we've already fixed in cesm 1.0.5 also the coupler log prints out the memory usage each time it prints dt (usually once every 24 hours ) - if there is a memory leak you should be able to see it in the cpl log.  
 
It does not appear to be a leak. The cpl log prints statements like  memory_write: model date =   170314       0 memory =     569.22 MB (highwater)       4854.35 MB (usage)  (pe=    0 comps= cpl ocn atm lnd ice glc)

 It is an f case, i.e../create_newcase -case  $CASEROOT -compset F_1850_CAM5 -res f02_f02 -mach hopp2  -din_loc_root_csmdata $SCRATCH/cam5_input_netcdf_files

So I upped ATM_PIO_NUMTASKS from 39 to 50. Do you have a recommended value?m
 
Back
Top