Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Model stopped partway through with no archiving or crash error

One of my simulations seems to have stopped partway through last night, but I don't see any particularly errors in the logs that would suggest a crash. It runs one year at a time and in the run directory I have history files for October (2122-10) and restarts for November (2022-11) but it never ran past that and completed short term archiving so the output from the mostly finished year is just sitting in /run. These are the ends of all the log files:

CESM (this is the only one with a possible error):
1: Opened file TuneRCP85CoupledExt.cam.r.2122-01-01-00000.nc to write 1769472
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: NetCDF: Invalid dimension ID or name
1: Opened file TuneRCP85CoupledExt.cam.rs.2122-01-01-00000.nc to write 131072

CPL:
Write restart file at 21220101 0
(seq_rest_write) write rpointer file rpointer.drv
(seq_io_wopen) create file TuneRCP85CoupledExt.cpl.r.2122-01-01-00000.nc
tStamp_write: model date = 21220101 0 wall clock = 2022-05-01 19:39:46 avg dt = 16.50 dt = 115.27
memory_write: model date = 21220101 0 memory = 526.72 MB (highwater) -0.00 MB (usage) (pe= 0 comps= cpl ATM LND OCN ICE GLC ROF WAV)

(seq_mct_drv): =============== SUCCESSFUL TERMINATION OF CPL7-CCSM ===============
(seq_mct_drv): =============== at YMD,TOD = 21220101 0 ===============
(seq_mct_drv): =============== # simulated days (this run) = 365.000 ===============
(seq_mct_drv): =============== compute time (hrs) = 1.673 ===============
(seq_mct_drv): =============== # simulated years / cmp-day = 14.345 ===============
(seq_mct_drv): =============== pes min memory highwater (MB) 42.479 ===============
(seq_mct_drv): =============== pes max memory highwater (MB) 756.027 ===============
(seq_mct_drv): =============== pes min memory last usage (MB) -0.001 ===============
(seq_mct_drv): =============== pes max memory last usage (MB) -0.001 ===============

POP:
------------------------------------------------------------------------
===================
completed POP_Final
===================

ATM:
Number of completed timesteps:385440
Time step 385441 partially done to provide convectively adjusted and time filtered values for history tape.
------------------------------------------------------------

Total run time (sec) : 6073.10027311801
Time Step Loop run time(sec) : 6006.92394515709
SYPD : 14.3727176610261

******* END OF MODEL RUN *******

LND:
./TuneRCP85CoupledExt.clm2.r.2122-01-01-00000.nc
------------------------------------------------------------

(OPNFIL): Successfully opened file ./rpointer.lnd on unit= 85
Successfully wrote local restart pointer file
Successfully wrote out restart data at nstep = 385440
------------------------------------------------------------

ICE:
(ice_pio_wopen) create file TuneRCP85CoupledExt.cice.r.2122-01-01-00000.nc
Writing TuneRCP85CoupledExt.cice.r.2122-01-01-00000.nc
Restart written 17520 66919392000.0000 3942000000.00000

ROF:
(OPNFIL): Successfully opened file ./rpointer.rof on unit= 65
Successfully wrote local restart pointer file
Successfully wrote out restart data at nstep = 64240
------------------------------------------------------------



Would it be okay to just set STOP_OPTION to months in env_run.xml and try to run for one month to see if that gets it to the end of the year or should I delete the rpointer and 2122 files in /run and replace them with the restarts from the end of the previous completed year? Thanks!
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
From the logs you posted, it's not clear to me what happened here. But as a general rule: if you have a consistent set of restart files from the same time, then it is safe to restart the model from that time. So in your case: yes, if you have a set of rpointer and restart files that all have the same date stamp, then you can rerun from that point.

According to the log messages you copied, though, it looks like the restart files are from 2122-01-01 – i.e., Jan 1 – not November as you said. But maybe I'm not seeing the complete picture here.
 
Thanks Bill! I actually did go ahead and try restarting for one month a little after posting this and it ran successfully. I also thought from the logs that the restarts were for 2122-01-01 but when I looked at the rpointer files they were for 2122-01-11. It ran and archived and gave me restarts for December so I ran it one more month and now I'm about to restart the experiment as normal from 2123 (after changing stop_option back to nyears). I'm not sure what happened that made it stop but it seems to be all good now!
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
Great, glad to hear you have things working. Two frequent causes of mysterious crashes are:
- Hitting the wallclock limit requested in the batch job submission
- Running out of memory
 
Top