Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Retrying a RESUBMIT>0 run crashes

samrabin

Sam Rabin
Member
I have a simple test setup on Cheyenne that's supposed to run for 5 days, then resubmit once to run for 5 more. This works fine when called with ./case.submit… the first time I try it. If I then do ./xmlchange RESUBMIT=1,CONTINUE_RUN=FALSE and submit again, the first segment runs fine, but the second segment crashes. The error is in components/cmeps/cesm/driver/esmApp.F90 at line 148 (the last line of the below):

Code:
  if (ESMF_LogFoundError(rcToCheck=urc, msg=ESMF_LOGERR_PASSTHRU, &
       line=__LINE__, &
       file=__FILE__)) &
       call ESMF_Finalize(endflag=ESMF_END_ABORT)

Debug mode shows the following:
Code:
 PIO rearranger options:
   comm type     = p2p
   comm fcd      = 2denable
   max pend req (comp2io)  =           64
   enable_hs (comp2io)     =  T
   enable_isend (comp2io)  =  F
   max pend req (io2comp)  =           64
   enable_hs (io2comp)    =  F
   enable_isend (io2comp)  =  T
MPT ERROR: Rank 0(g:0) is aborting with error code 1.
        Process ID: 19655, Host: r6i6n33, Program: /glade/scratch/samrabin/chain_20220722_01/bld/cesm.exe
        MPT Version: HPE MPT 2.22  03/31/20 15:59:10

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/19655/exe, process 19655

[...]

MPT: #9  0x00002b0ca8ff5b80 in esmf_initmod::esmf_finalize (
MPT:     keywordenforcer=<error reading variable: Cannot access memory at address 0x0>, endflag=...,
MPT:     rc=<error reading variable: Cannot access memory at address 0x0>)
MPT:     at /glade/p/cesmdata/cseg/PROGS/build/28560/esmf-8.2.0b23/src/Superstructure/ESMFMod/src/ESMF_Init.F90:1226
MPT: #10 0x0000000000432c1d in esmapp ()
MPT:     at /glade/u/home/samrabin/ctsm/components/cmeps/cime_config/../cesm/driver/esmApp.F90:148
MPT: #11 0x00000000004142a2 in main ()
MPT: #12 0x00002b0caea15a35 in __libc_start_main ()
MPT:    from /glade/u/apps/ch/os/lib64/libc.so.6
MPT: #13 0x00000000004141a9 in _start () at ../sysdeps/x86_64/start.S:118

Is there some extra step I need to perform, aside from ./xmlchange RESUBMIT=1,CONTINUE_RUN=FALSE, in order for this to work? I've tried deleting the rpointer files in the run directory, but that didn't help.

Details:
  • ctsm5.1.dev092
  • Case directory: /glade/u/home/samrabin/cases_ctsm/chain_20220722_01
  • Full log from which I took above excerpt: /glade/scratch/samrabin/chain_20220726.03.01/run/cesm.log.5172199.chadmin1.ib0.cheyenne.ucar.edu.220726-121510
 

samrabin

Sam Rabin
Member
Okay, it looks like doing ./case.setup -r and then rebuilding does the trick. But is there any way to avoid the rebuilding step?
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I replicated your failure. It seems to be due to missing the cpl restart file, as noted in PET0.ESMF_LogFile:

20220726 135256.677 ERROR PET0 (esm_time_read_restart) ERROR: nf90_open: chain_20220722_01.cpl.r.0001-01-06-00000.nc
20220726 135256.677 ERROR PET0 esm_time_mod.F90:168 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 ensemble_driver.F90:266 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:764 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:457 Failure - Passing error in return code
20220726 135256.677 ERROR PET0 esmApp.F90:146 Failure - Passing error in return code
20220726 135256.677 INFO PET0 Finalizing ESMF

At the first go around, the short-term archiver copies the restart files into the archive directory, leaving a copy of chain_20220722_01.cpl.r.0001-01-06-00000.nc in the run directory. The restart then works.
The second time around, the short-term archiver doesn't leave a copy of the cpl restart file in the run directory after the initial run (or any other restart file). It simply deletes the restart files. It seems to think they are interim restart files.
I tried ./xmlchange DOUT_S_SAVE_INTERIM_RESTART_FILES=TRUE. In that case it moved them into the archive directory instead of deleting them, so they were still missing on the resubmission.
I also tried deleting the archive directory between submissions, but that didn't work either.
What did work was to delete the restart files that were left in the run directory from the end of the first submission:

chain_20220722_01.clm2.r.0001-01-11-00000.nc
chain_20220722_01.clm2.rh0.0001-01-11-00000.nc
chain_20220722_01.cpl.r.0001-01-11-00000.nc
chain_20220722_01.datm.r.0001-01-11-00000.nc

I'm not sure why that would be needed though.
 
Top