Main menu


CAM5 run time error

6 posts / 0 new
Last post
CAM5 run time error

I have an update on the above issue. I tried to do a simple continuation run from few model days before the crash takes place with the corresponding restart files, just to make sure that the crash has nothing to do with memory issues. But the model crashed exactly at the same point as before with the same messages.

Then I tried a branch run from the same point so that I can modify some parameters and resolve the issue. The model runs successfully this time. I expected the model to run in a bit-to-bit fashion and crash at the same point as before as I did not change anything in the namelist except providing values for cam_branch_file, nrevsn, ice_ic, restfilm and restart_file which correspond to the master restart files for atm, clm, cice, docn and cpl.

Further, the output is not same as before. There are differences.

Additionally, there is a shift in the time at the starting point of the model run. This holds only in the case of the history stream h1 (this contains 6 hourly instantaneous fields). The master restart files correspond to the time unit 1998-07-01-00000. So the output file should write the h1 files in 1998-07-**-00000 (each file containing 4 time levels) fashion. On the contrary, I get them as 1998-07-01-21600 with the very first time skipped at the beginning.  I have cross-checked and confirmed that the time settings in the restart files and namelists match. May be this is a trivial issue but I am not able to trace this. Doing branch runs in standalone CAM is not documented well. I collected information from different documentations and got the run right. It is possible that I am missing something.

In short, I am wondering about:

1) Why does the model run successfully in the branch mode when no parameters are changed?

2) Why is the output different than the start-up run?

3) What causes the model to skip the first time level?

I have uploaded the namelist settings I used for the start-up run and for the branch run.  If anyone has a slightest of an idea regarding these, please comment.




A branch run should give identical results to the restart as you expected.  If it doesn't then something is wrong.  Sometimes it's a feature of the system that changes and you don't have any control over it.  In your original post there is not much useful information from the debugger output.  This is typical of a production run where the executable was not built with the debug flags.  In that case it is often useful to rebuild the executable with debug options on, and then do the restart run.  Sometimes this will provide more information about the failure.  Other times the run will go right past the original point of failure which indicates a possible problem with the optimized code or perhaps the system had a failure during the original run.  These kinds of problems can be extremely difficult to track down.

I think the output from the branch run in the h1 file is correct.  On a branch you won't get a time sample at 1998-07-01-00000 because that output was part of the run that you branched from.  On the other hand an initial run starting from 1998-07-01-00000 will contain a time sample in the h1 file at hour 0 because the initial conditions (updated by a partially complete timestep) are written to all history file sequences except for the monthly average one.






Thank you Eaton !

I did a simple continuation of the crashed simulation in the debug mode. The error message in the log file now is "PE RANK 2 exit signal Floating point exception" at the crash time. This appears to come from an arithmetic operation in the subroutine qsinvert() in the module "uwshcu.F90" as indicated by the core dump file (attachment).

I think this problem is connected to the "pLCL does not converge and is set to psmin in uwshcu.F90" and "mixing ratio violated at......" messages appearing in the log file. How can we resolve this?  I noticed in the output history files that over some points lying direclty over coastlies, specific humidity has values like 9.9e+36. Can this be the cause of the error? If it is, I wonder how did the model run for 15 years successfully and reasonably well?

One more question: can we change the model time step and paramertrization schemes in a branch run or it is possible only in a hybrid run?





Tracking down this kind of problem is never easy.  My first assumption would be that qsinvert is getting an unrealistic atm state and so I'd try to identify the column which is causing the problem in qsinvert, and then trace back where the bad value is coming from.  It's also possible that qsinvert has a bug and that is triggered by a realistic atm state, but that seems less likely.  When there is an obvious problem like a specific humidity value of 9.9e+36 then tracing back where that is coming from would be the first thing to do.

Sometimes reducing the timestep will be a successful way to get a run going which has encountered a stability problem.  This has to be done using a hybrid run.


I wonder how is the daily sst put into the model to run? Is there any documentation on this use of daily data rather than monthly in CAM?

Log in or register to post comments

Who's new

  • ccchang3@...
  • iccp.stein@...
  • hegreaves@...
  • sallyz
  • damian.insua@...