Retrying a RESUBMIT>0 run crashes

samrabin · Jul 26, 2022

I have a simple test setup on Cheyenne that's supposed to run for 5 days, then resubmit once to run for 5 more. This works fine when called with ./case.submit… the first time I try it. If I then do ./xmlchange RESUBMIT=1,CONTINUE_RUN=FALSE and submit again, the first segment runs fine, but the second segment crashes. The error is in components/cmeps/cesm/driver/esmApp.F90 at line 148 (the last line of the below):

Code:

  if (ESMF_LogFoundError(rcToCheck=urc, msg=ESMF_LOGERR_PASSTHRU, &
       line=__LINE__, &
       file=__FILE__)) &
       call ESMF_Finalize(endflag=ESMF_END_ABORT)

Debug mode shows the following:

Code:

 PIO rearranger options:
   comm type     = p2p
   comm fcd      = 2denable
   max pend req (comp2io)  =           64
   enable_hs (comp2io)     =  T
   enable_isend (comp2io)  =  F
   max pend req (io2comp)  =           64
   enable_hs (io2comp)    =  F
   enable_isend (io2comp)  =  T
MPT ERROR: Rank 0(g:0) is aborting with error code 1.
        Process ID: 19655, Host: r6i6n33, Program: /glade/scratch/samrabin/chain_20220722_01/bld/cesm.exe
        MPT Version: HPE MPT 2.22  03/31/20 15:59:10

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/19655/exe, process 19655

[...]

MPT: #9  0x00002b0ca8ff5b80 in esmf_initmod::esmf_finalize (
MPT:     keywordenforcer=<error reading variable: Cannot access memory at address 0x0>, endflag=...,
MPT:     rc=<error reading variable: Cannot access memory at address 0x0>)
MPT:     at /glade/p/cesmdata/cseg/PROGS/build/28560/esmf-8.2.0b23/src/Superstructure/ESMFMod/src/ESMF_Init.F90:1226
MPT: #10 0x0000000000432c1d in esmapp ()
MPT:     at /glade/u/home/samrabin/ctsm/components/cmeps/cime_config/../cesm/driver/esmApp.F90:148
MPT: #11 0x00000000004142a2 in main ()
MPT: #12 0x00002b0caea15a35 in __libc_start_main ()
MPT:    from /glade/u/apps/ch/os/lib64/libc.so.6
MPT: #13 0x00000000004141a9 in _start () at ../sysdeps/x86_64/start.S:118

Is there some extra step I need to perform, aside from ./xmlchange RESUBMIT=1,CONTINUE_RUN=FALSE, in order for this to work? I've tried deleting the rpointer files in the run directory, but that didn't help.

Details:

ctsm5.1.dev092
Case directory: /glade/u/home/samrabin/cases_ctsm/chain_20220722_01
Full log from which I took above excerpt: /glade/scratch/samrabin/chain_20220726.03.01/run/cesm.log.5172199.chadmin1.ib0.cheyenne.ucar.edu.220726-121510

samrabin · Jul 26, 2022

Okay, it looks like doing ./case.setup -r and then rebuilding does the trick. But is there any way to avoid the rebuilding step?

samrabin · Jul 26, 2022

… and I guess this doesn't always work.

oleson · Jul 26, 2022

I replicated your failure. It seems to be due to missing the cpl restart file, as noted in PET0.ESMF_LogFile:

20220726 135256.677 ERROR PET0 (esm_time_read_restart) ERROR: nf90_open: chain_20220722_01.cpl.r.0001-01-06-00000.nc
20220726 135256.677 ERROR PET0 esm_time_mod.F90:168 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 ensemble_driver.F90:266 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:764 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:457 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 esmApp.F90:146 Failure - Passing error in return code
20220726 135256.677 INFO PET0 Finalizing ESMF

At the first go around, the short-term archiver copies the restart files into the archive directory, leaving a copy of chain_20220722_01.cpl.r.0001-01-06-00000.nc in the run directory. The restart then works.
The second time around, the short-term archiver doesn't leave a copy of the cpl restart file in the run directory after the initial run (or any other restart file). It simply deletes the restart files. It seems to think they are interim restart files.
I tried ./xmlchange DOUT_S_SAVE_INTERIM_RESTART_FILES=TRUE. In that case it moved them into the archive directory instead of deleting them, so they were still missing on the resubmission.
I also tried deleting the archive directory between submissions, but that didn't work either.
What did work was to delete the restart files that were left in the run directory from the end of the first submission:

chain_20220722_01.clm2.r.0001-01-11-00000.nc
chain_20220722_01.clm2.rh0.0001-01-11-00000.nc
chain_20220722_01.cpl.r.0001-01-11-00000.nc
chain_20220722_01.datm.r.0001-01-11-00000.nc

I'm not sure why that would be needed though.

samrabin · Jul 26, 2022

Aha! That works, thanks!

Retrying a RESUBMIT>0 run crashes

samrabin

Sam Rabin

Member

samrabin

Sam Rabin

Member

samrabin

Sam Rabin

Member

oleson

Keith Oleson

CSEG and Liaisons

samrabin

Sam Rabin

Member