Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

How to continue a model run if it failed at a time point

Eric

Eric
Member
Hello,

I have set each job to be running 2 years and resubmit 7 times. The initial run is a hybrid one and started from 2015. The initial run(2015-2017) and the first resubmit(2017-2019) succeeded while the second resubmit(2019-2021) failed. So now in the run directory, there are restart files from 2015 and from 2019. If I want to continue the model run from 2019, can I just run ./case.submit given that the CONTINUE_RUN is TRUE now? I am not sure how to continue a model run if it failed at a timepoint. Thanks!
 

katec

CSEG and Liaisons
Staff member
Yes, to continue a run at any point, just make sure the most recent restart files are in your run directory, make sure CONTINUE_RUN is TRUE, and type ./case.submit. Your run will restart the segment that failed.
 

Eric

Eric
Member
Hi Kate,

I just tried ./case.submit and it failed after several seconds. Here is the content of the log file:

MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out
MPT: Launcher on r9i1n0 failed to receive connection(s) from: r9i1n0.ib0.cheyenne.ucar.edu r9i1n1.ib0.cheyenne.ucar.edu r9i1n2.ib0.cheyenne.ucar.edu r9i1n3.ib0.cheyenne.ucar.edu r9i1n4.ib0.cheyenne.ucar.edu r9i1n5.ib0.cheyenne.ucar.edu r9i1n6.ib0.cheyenne.ucar.edu r9i1n7.ib0.cheyenne.ucar.edu r9i1nMPT ERROR: could not launch executable
(HPE MPT 2.19 02/23/19 05:31:12)

I also attached the log file of the failed second resubmit (2019-2021) here. I am not sure if this is a problem of Cheynne or a problem of my experiment. Thanks!!
 

Eric

Eric
Member
Here is the log file of the failed second resubmit (2019-2021)
 

Attachments

  • cesm.log.9042431.chadmin1.ib0.cheyenne.ucar.edu.210623-133349.txt
    8.6 KB · Views: 7

katec

CSEG and Liaisons
Staff member
Hi Eric,

Yeah, a CESM log that starts right away with an error is very likely a Cheyenne problem. I saw somebody else with the exact same thing yesterday. They did another case.submit again and it worked on the second try. Try submitting again. When there's a problem with your actual run (not the machine) you will see the usual CESM start-up stuff at the beginning of the log and then the error when it can't find or load the restart files down a bit.
 

Eric

Eric
Member
Hi Kate,

I just tried another case.submit and it failed again. But after I manually copied restart files from the archive directory to the run directory and typed case.submit, it worked. I suppose there are some issues with the manuscript that loads the restart files down.
 
Top