Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM multi-instances run does NOT move on

I do a multi-instances CESM run (F_2000_CAM5), and the instance number is 20.
At first, the job runs OK without errors, but suddenly everything just stop and don't move on and there is no error. The program is still on, but no output.
This the last output of cam component:

READ_NEXT_TRCDATA ozone
READ_NEXT_TRCDATA BCDEPWET
BCPHODRY BCPHIDRY
OCDEPWET OCPHODRY
OCPHIDRY DSTX01DD
DSTX02DD DSTX03DD
DSTX04DD DSTX01WD
DSTX02WD DSTX03WD
DSTX04WD

I wonder if it is because I choose to use INTERP_MISSING_MONTHS aerodep_flx_type in atm_in, and the program couldn't do so many interpolations.
aerodep_flx_file = 'aerosoldep_rcp4.5_monthly_1849-2104_1.9x2.5_c100402.nc'
aerodep_flx_specifier = 'BCDEPWET', 'BCPHODRY', 'BCPHIDRY', 'OCDEPWET', 'OCPHODRY', 'OCPHIDRY', 'DSTX01DD', 'DSTX02DD',
'DSTX03DD', 'DSTX04DD', 'DSTX01WD', 'DSTX02WD', 'DSTX03WD', 'DSTX04WD'
aerodep_flx_type = 'INTERP_MISSING_MONTHS'


The attached is the atm log files.
 

eaton

CSEG and Liaisons
Are you using twice as many nodes to run 20 instances as were used in the successful case of 10 instances? 
 
yes, just twice.For 10 instances, I use 80 tasks in 3 nodes.For 20 instances, I use 160 tasks in 5 nodes.Each node has 32 processors, and for 10 instances case, there are 16 processors which are idle, but for 20 instances case, all processors in 5 nodes are running.
 

eaton

CSEG and Liaisons
You could try using 6 nodes for the 20 instance run to see whether it is a memory issue.  I don't see anything in the atm logs to indicate a model problem.  Since you can run 10 members successfully the failure with 20 members would seem to be a system issue. 
 
Top