Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CAM6 Crashes on Derecho

dharmendraks841

Dharmendra Kumar Singh
Member
Dear CESM Support and CAM6 Users,

I am running CAM6 simulations using meteorological data interpolated from MERRA2 (72 levels) to the CAM6 model grid (32 levels) via vertical interpolation scripts (GeoCAT). My simulations for the years 2005 and 2023 complete successfully using the same setup and interpolation workflow. However, the simulation for 2006 consistently crashes during initialization, and 2024 runs intermittently—sometimes crashing, sometimes succeeding.

For 2006, the model terminates with an ESMF stack trace error involving libesmf.so, without producing any detailed CAM error message or traceback. This suggests the issue may relate to initialization, threading, or corrupted input.

Here’s what I’ve checked so far:

Verified the interpolated meteorological files for 2006 are complete and structurally identical to those from 2005 and 2023.

Ran with --threads 1 and on 20 Derecho nodes to rule out threading or memory issues.

Confirmed that time, dimensions, and variable headers are consistent across all years.

Used the same vertical interpolation method successfully applied to other years.

The crash occurs shortly after reading the PS field during initialization:

INFLD_REAL_2D_2D: read field PS
READ_NEXT_PS: Read meteorological data

This is followed by an abrupt ESMF error with no further logging.

My suspicion is that either 2006 has a corrupted or subtly malformed field (e.g., PS, T, Q), or that ESMF is encountering instability due to memory/threading interactions specific to this year’s input.

Questions for the Community:

Has anyone encountered CAM6 crashes like this tied to a specific year or day, despite similar interpolation workflows working for other years?

Could there be hidden issues in the time axis, fill values, or metadata that escape normal ncdump checks but trip up CAM6?

Are there known ESMF/libesmf.so issues that can be triggered by specific forcing file conditions?

Technical Setup Summary:

Model: CESM2.2 with CAM6_3_128

Resolution: f19_f19_mg17

Input Forcing: Vertically interpolated MERRA2

Years tested: 2005 ✅, 2006 ❌, 2023 ✅, 2024 ⚠️ (inconsistent)

Machine: Derecho

Run Options: --threads 1, 20 nodes (initially I used default dercho 4 nodes, then 8 nodes, then 10 nodes, finally 20)

Any suggestions or experiences with similar issues would be greatly appreciated.
I’d also be happy to share snippets of my input file headers or logs if that helps others debug.

Thanks in advance for your support and insights.


Please look the following case directory and all its run in cesm and atm log file and more based on you expertise.
dksingh@derecho3:/glade/derecho/scratch/dksingh/06_control_startup/run> tail cesm.log.9587298.desched1.250521-010938

dec2036.hsn.de.hpc.ucar.edu 935: libesmf.so 0000153A637D4792 enter 2321 ESMCI_VMKernel.C

dec1758.hsn.de.hpc.ucar.edu 675: libesmf.so 0000154D8673FBA8 ESMCI_FTableCallE 824 ESMCI_FTable.C

dec2036.hsn.de.hpc.ucar.edu 935: libesmf.so 0000153A637BDE70 enter 1216 ESMCI_VM.C

dec1758.hsn.de.hpc.ucar.edu 675: libesmf.so 0000154D86BC8792 enter 2321 ESMCI_VMKernel.C

dec2036.hsn.de.hpc.ucar.edu 935: libesmf.so 0000153A6334CF4F c_esmc_ftablecall 981 ESMCI_FTable.C

dec1758.hsn.de.hpc.ucar.edu 675:

dec2036.hsn.de.hpc.ucar.edu 935: libesmf.so 0000153A63E233B8 esmf_compmod_mp_e 1223 ESMF_Comp.F90

dec1758.hsn.de.hpc.ucar.edu 675: Stack trace terminated abnormally.

dec2036.hsn.de.hpc.ucar.edu 935:

dec2036.hsn.de.hpc.ucar.edu 935: Stack trace terminated abnormally.





dksingh@derecho3:/glade/derecho/scratch/dksingh/06_control_startup/run> tail atm.log.9587298.desched1.250521-010938

INFLD_REAL_2D_2D: read field QFLX

INFLD_REAL_2D_2D: read field TAUX

INFLD_REAL_2D_2D: read field TAUY

INFLD_REAL_2D_2D: read field TS

INFLD_REAL_2D_2D: read field SST

INFLD_REAL_2D_2D: read field ICEFRAC

READ_NEXT_METDATA: Read meteorological data

INFLD_REAL_2D_2D: read field PS

INFLD_REAL_2D_2D: read field PS

READ_NEXT_PS: Read meteorological data





dksingh@derecho3:/glade/derecho/scratch/dksingh/06_control_startup/run> tail atm.log.9563515.desched1.250519-092049

2.873563218390813E-002

-----------------------------------

do_press_fix_llnl: dpress_g = 269.731764992269

do_press_fix_llnl: dpress_g = 269.731764992269

nstep, te 761 0.21789444314978194E+10 0.21794113481848497E+10 0.25939115270269181E-01 0.98289772824383763E+05 0.22552395239472389E+03

-----------------------------------

photo_timestep_init: diagnostics

calday, last, next, dels = 16.8541666666667 1 2

2.945402298850567E-002

-----------------------------------

dksingh@derecho3:/glade/derecho/scratch/dksingh/06_control_startup/run> tail atm.log.9525465.desched1.250515-065111

2.873563218390813E-002

-----------------------------------

do_press_fix_llnl: dpress_g = 269.731764992269

do_press_fix_llnl: dpress_g = 269.731764992269

nstep, te 761 0.21789444314978194E+10 0.21794113481848497E+10 0.25939115270269181E-01 0.98289772824383763E+05 0.22552395239472389E+03

-----------------------------------

photo_timestep_init: diagnostics

calday, last, next, dels = 16.8541666666667 1 2

2.945402298850567E-002

-----------------------------------

dksingh@derecho3:/glade/derecho/scratch/dksingh/06_control_startup/run> tail atm.log.9498971.desched1.250512-051102

17.nc

open_met_datafile:

/glade/derecho/scratch/dksingh/2006/interpolated//interp_MERRA2_0.9x1.25_200601

17.nc

INFLD_REAL_2D_2D: read field PS

INFLD_REAL_2D_2D: read field PS

READ_NEXT_PS: Read meteorological data

do_press_fix_llnl: dpress_g = 269.670514201235

do_press_fix_llnl: dpress_g = 269.670514201235

nstep, te 762 0.21789410850100784E+10 0.21794077004771843E+10 0.25922381275663660E-01 0.98289772848527806E+05 0.22552395239472389E+03
 

fvitt

CSEG and Liaisons
Staff member
I think you input files are corrupted. I am seeing NaNs in some of the fields:
NaNf, NaNf, NaNf, NaNf, NaNf, NaNf, NaNf, NaNf, NaNf, NaNf, NaNf...
 

dharmendraks841

Dharmendra Kumar Singh
Member
What could be the other reasons as the same file structure was successfully ran for 2005 and 2023 but same file structure and same interpolation method for 2006 was crashed
 
Top