CESM multi-instances run does NOT move on

I do a multi-instances CESM run (F_2000_CAM5), and the instance number is 20.
At first, the job runs OK without errors, but suddenly everything just stop and don't move on and there is no error. The program is still on, but no output.
This the last output of cam component:

READ_NEXT_TRCDATA ozone
READ_NEXT_TRCDATA BCDEPWET
BCPHODRY BCPHIDRY
OCDEPWET OCPHODRY
OCPHIDRY DSTX01DD
DSTX02DD DSTX03DD
DSTX04DD DSTX01WD
DSTX02WD DSTX03WD
DSTX04WD

I wonder if it is because I choose to use INTERP_MISSING_MONTHS aerodep_flx_type in atm_in, and the program couldn't do so many interpolations.
aerodep_flx_file = 'aerosoldep_rcp4.5_monthly_1849-2104_1.9x2.5_c100402.nc'
aerodep_flx_specifier = 'BCDEPWET', 'BCPHODRY', 'BCPHIDRY', 'OCDEPWET', 'OCPHODRY', 'OCPHIDRY', 'DSTX01DD', 'DSTX02DD',
'DSTX03DD', 'DSTX04DD', 'DSTX01WD', 'DSTX02WD', 'DSTX03WD', 'DSTX04WD'
aerodep_flx_type = 'INTERP_MISSING_MONTHS'


The attached is the atm log files.
 

eaton

CSEG and Liaisons
Are you using twice as many nodes to run 20 instances as were used in the successful case of 10 instances? 
 
yes, just twice.For 10 instances, I use 80 tasks in 3 nodes.For 20 instances, I use 160 tasks in 5 nodes.Each node has 32 processors, and for 10 instances case, there are 16 processors which are idle, but for 20 instances case, all processors in 5 nodes are running.
 

eaton

CSEG and Liaisons
You could try using 6 nodes for the 20 instance run to see whether it is a memory issue.  I don't see anything in the atm logs to indicate a model problem.  Since you can run 10 members successfully the failure with 20 members would seem to be a system issue. 
 
Back
Top