Question about continuation run crash in CTSM

shahla salimpour

shahla salimpour
New Member
Hi all,

I am running a CTSM case in 10-year chunks with CONTINUE_RUN=TRUE. The first chunk completed successfully and produced valid restart files. The continuation run also progressed and wrote the next restart files, but the job still ended with FAILED status.

In the logs, I see a column-level water balance warning followed by many ,SIGABRT, messages. Since the restart files for the next date were written successfully, I am unsure what the main cause of the crash is which effecting the next chunks to continue.

Thank you very much for your guidance
 

slevis

Moderator
Staff member
I think that the "warning" is unrelated. The SIGABRT messages suggest a bad value such as NaN or Inf or similar.

Troubleshooting ideas:
- Start the second segment again but for more than 10 years (e.g. 11), to see if the model stops at the same timestep with the same error. I would expect it to fail the same way. This time though I suggest generating restart files more frequently by changing the REST_N setting in your case's env_run.xml. This way you can get to the failing timestep more quickly next time. This makes subsequent troubleshooting more efficient.
- Unless you can come up with the reason that the model crashes from one of your custom changes (whether to the input data or to the code), you will need to go into debugging mode, where you might start adding "write" statements in key places in the code that may eventually reveal where and why the model is crashing.
 
Vote Upvote 0 Downvote
Back
Top