Dear CESM support team,
I am porting CESM2.1.5 to a new machine and am encountering a persistent runtime failure in a BHIST test case. I would appreciate any advice on how to diagnose this further.
Model/case information:
The model initializes successfully. CLM reads the restart and custom surface/land-use data without reporting a fatal error. POP also reads the ocean restart and overflow restart successfully. The case runs through January 1940 and consistently fails at the first-month boundary, around 1940-02-01.
The fatal error is:
At the time of the failure, the component logs show that the model is at the first monthly history output/month transition:
ATM log:
LND log:
OCN log:
ICE log:
The PE layout is:
The PIO is:
I previously successfully ran the startup FHIST for 5 years on this machine. However, errors occur whenever I run hybrid B compset or I compset. Given that this issue arises with different CLM input datasets, I reasonably suspect that the model termination is triggered during PIO data reordering or monthly historical output, rather than by the CLM input dataset itself. But I truly don't know how to resolve this problem. I have uploaded the machine configuration file and log files. Any suggestions would be greatly appreciated.
I am porting CESM2.1.5 to a new machine and am encountering a persistent runtime failure in a BHIST test case. I would appreciate any advice on how to diagnose this further.
Model/case information:
- CESM version: CESM2.1.5
- Compset / grid: BHIST, f09_g17
- Test: SMS_Lm2.f09_g17.BHIST
- Compiler: Intel compiler 2021.4
- MPI: MPICH 4.0.1 built with Intel compiler
- PIO version: PIO1
- PIO_TYPENAME: netcdf for all components
- Run type: hybrid
- Reference case: b.e21.BHIST.f09_g17.CMIP6-historical.001
- Reference date: 1940-01-01
- Custom CLM inputs:
- fsurdat: custom 0.9x1.25 historical surface dataset for simyr 1940
- flanduse_timeseries: custom historical land-use timeseries
- finidat: CLM restart from the 1940-01-01 reference case
- CLM options:
- check_dynpft_consistency = .true.
- init_interp_method = 'general'
The model initializes successfully. CLM reads the restart and custom surface/land-use data without reporting a fatal error. POP also reads the ocean restart and overflow restart successfully. The case runs through January 1940 and consistently fails at the first-month boundary, around 1940-02-01.
The fatal error is:
Abort(738894862) on node 97 (rank 97 in comm 0): Fatal error in internal_Wait: Message truncated, error stack:
internal_Wait(89): MPI_Wait(request=0x7ffff549ba1c, status=0x10720ea0) failed
MPIR_Wait(911)...:
(unknown)(): Message truncated
Abort(738894862) on node 49 (rank 49 in comm 0): Fatal error in internal_Wait: Message truncated
Abort(604677134) on node 1 (rank 1 in comm 0): Fatal error in internal_Wait: Message truncated
Abort(537568270) on node 145 (rank 145 in comm 0): Fatal error in internal_Wait: Message truncated
At the time of the failure, the component logs show that the model is at the first monthly history output/month transition:
ATM log:
nstep, te 1488 ...
chem_surfvals_set: ncdate=19400201 co2vmr=3.111476480501564E-004
READ_NEXT_TRCDATA ac_CO2
LND log:
clm: completed timestep 1488
hist_htapes_wrapup : Creating history file
...clm2.h0.1940-01.nc at nstep = 1488
htape_create : Successfully defined netcdf history file 1
hist_htapes_wrapup : Writing current time sample to local history file
...clm2.h0.1940-01.nc at nstep = 1488
OCN log:
Local Time- and Space-Averages for Nino Regions: 00:00:00 1940-02-01
(io_pio_init) create file
...pop.h.1940-01.nc
ICE log:
(ice_pio_wopen) create file
...cice.h.1940-01.nc
Finished writing
...cice.h.1940-01.nc
The PE layout is:
GLOBAL: 432 pes
CPL: 384 pes, pelist 0-383
ATM: 384 pes, pelist 0-383
LND: 192 pes, pelist 0-191
ICE: 192 pes, pelist 192-383
OCN: 48 pes, pelist 384-431
ROF: 192 pes, pelist 0-191
GLC: 384 pes, pelist 0-383
WAV: 96 pes, pelist 0-95
ESP: 1 pe, pelist 0
The PIO is:
PIO_TYPENAME: ['CPL:netcdf', 'ATM:netcdf', 'LND:netcdf', 'ICE:netcdf', 'OCN:netcdf', 'ROF:netcdf', 'GLC:netcdf', 'WAV:netcdf', 'ESP:netcdf']
PIO_NUMTASKS: ['CPL:-99', 'ATM:-99', 'LND:-99', 'ICE:-99', 'OCN:-99', 'ROF:-99', 'GLC:-99', 'WAV:-99', 'ESP:-99']
PIO_STRIDE: ['CPL:48', 'ATM:48', 'LND:48', 'ICE:48', 'OCN:48', 'ROF:48', 'GLC:48', 'WAV:48', 'ESP:48']
PIO_ROOT: ['CPL:1', 'ATM:1', 'LND:1', 'ICE:1', 'OCN:1', 'ROF:1', 'GLC:1', 'WAV:1', 'ESP:1']
PIO_REARR_COMM_MAX_PEND_REQ_COMP2IO: 0
PIO_REARR_COMM_MAX_PEND_REQ_IO2COMP: 64
I previously successfully ran the startup FHIST for 5 years on this machine. However, errors occur whenever I run hybrid B compset or I compset. Given that this issue arises with different CLM input datasets, I reasonably suspect that the model termination is triggered during PIO data reordering or monthly historical output, rather than by the CLM input dataset itself. But I truly don't know how to resolve this problem. I have uploaded the machine configuration file and log files. Any suggestions would be greatly appreciated.