Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Error in running regional case in resubmitting the job

jiamengl

Jiameng Lai
Member
Hello, I created a regional CTSM case in the MidWest US, and wanted it to run for 21 years. To avoid exceeding the wallclock time, I set STOP_OPTION=nmonths, STOP_N=4, RESUBMIT=62. The case run successfully for 4 months and seems like created restart files, but then failed with error message which I do not understand. I checked the output file, which seems fine to me.

I attached the cesm.log file here for reference. Thanks!
Jiameng
 

Attachments

  • cesm.log.4115388.chadmin1.ib0.cheyenne.ucar.edu.231108-125253.txt
    48.7 KB · Views: 5

oleson

Keith Oleson
CSEG and Liaisons
Staff member
In this file in your case directory:

run.Midwest.ctrl.t2.o4115388

I see this:

=>> PBS: job killed: walltime 21613 exceeded limit 21600

So you must have run out of wallclock time right at the end of the run.
 

jiamengl

Jiameng Lai
Member
In this file in your case directory:

run.Midwest.ctrl.t2.o4115388

I see this:

=>> PBS: job killed: walltime 21613 exceeded limit 21600

So you must have run out of wallclock time right at the end of the run.
Thanks! I will try to run for less months in each submit. One thing I am confused is that, I used to be able to run global run for several years within each wallclock time, why now for regional case even several month run exceed the wallclock time? Is that because Cheyenne is going to retire? Or anything is wrong with my setting?
 

jiamengl

Jiameng Lai
Member
In this file in your case directory:

run.Midwest.ctrl.t2.o4115388

I see this:

=>> PBS: job killed: walltime 21613 exceeded limit 21600

So you must have run out of wallclock time right at the end of the run.
Hello, I tried to set STOP_N=3, but again it failed at the end of the run with the same error. Is that possible that something goes wrong with writing the restart file (but the lnd. and atm. log seems to indicate this step is finished successfully) or whatever steps in the end of a running cycle? If so, how could I check this?
 

slevis

Moderator
Staff member
I have seen such behavior sometimes. It's not easy to resolve. If the model is getting stuck while writing a restart file, it may be attempting to write bad data. Troubleshooting may involve figuring out which variable gets stuck and a way to do this may be to add write statements in the code until you isolate the variable. Even if you find the variable that causes the model to get stuck, you may then need to understand why in order to fix it. In this process, you may benefit from building the code in debug mode (see env_build.xml).
 
Top