How to continue a model run if it failed at a time point

Eric

Eric
Member
Hello,

I have set each job to be running 2 years and resubmit 7 times. The initial run is a hybrid one and started from 2015. The initial run(2015-2017) and the first resubmit(2017-2019) succeeded while the second resubmit(2019-2021) failed. So now in the run directory, there are restart files from 2015 and from 2019. If I want to continue the model run from 2019, can I just run ./case.submit given that the CONTINUE_RUN is TRUE now? I am not sure how to continue a model run if it failed at a timepoint. Thanks!
 

katec

CSEG and Liaisons
Staff member
Yes, to continue a run at any point, just make sure the most recent restart files are in your run directory, make sure CONTINUE_RUN is TRUE, and type ./case.submit. Your run will restart the segment that failed.
 

Eric

Eric
Member
Hi Kate,

I just tried ./case.submit and it failed after several seconds. Here is the content of the log file:

MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out
MPT: Launcher on r9i1n0 failed to receive connection(s) from: r9i1n0.ib0.cheyenne.ucar.edu r9i1n1.ib0.cheyenne.ucar.edu r9i1n2.ib0.cheyenne.ucar.edu r9i1n3.ib0.cheyenne.ucar.edu r9i1n4.ib0.cheyenne.ucar.edu r9i1n5.ib0.cheyenne.ucar.edu r9i1n6.ib0.cheyenne.ucar.edu r9i1n7.ib0.cheyenne.ucar.edu r9i1nMPT ERROR: could not launch executable
(HPE MPT 2.19 02/23/19 05:31:12)

I also attached the log file of the failed second resubmit (2019-2021) here. I am not sure if this is a problem of Cheynne or a problem of my experiment. Thanks!!
 

katec

CSEG and Liaisons
Staff member
Hi Eric,

Yeah, a CESM log that starts right away with an error is very likely a Cheyenne problem. I saw somebody else with the exact same thing yesterday. They did another case.submit again and it worked on the second try. Try submitting again. When there's a problem with your actual run (not the machine) you will see the usual CESM start-up stuff at the beginning of the log and then the error when it can't find or load the restart files down a bit.
 

Eric

Eric
Member
Hi Kate,

I just tried another case.submit and it failed again. But after I manually copied restart files from the archive directory to the run directory and typed case.submit, it worked. I suppose there are some issues with the manuscript that loads the restart files down.
 
Back
Top