Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

model crashing - POP "solver not converged"

I've been running CESM1.2 successfully for a while, but recently one of my simulations crashed.  If I try to restart it from the last restart file, it runs for a few years and crashes again at the same spot.  The relevant error in the cesm log file appears to be:

 197:POP Exiting...
 197:POP_SolversChronGear: solver not converged
 197:POP_SolverRun: error in ChronGear
 197:POP_BarotropicDriver: error in solver
 197:Step: error in barotropic

A search of the CESM forum suggests that decreasing the ocean timestep might help (https://bb.cgd.ucar.edu/node/1001945).  I have not tried that yet.  That suggestion is from 5 years ago.  Is it still the best course of action?
 

njn01

Member
Yes, this is still the best course to follow.  See the POP2 FAQ for more information:http://www.cesm.ucar.edu/models/cesm1.2/pop2/doc/faq/#runtime_solver 
 
Great, thanks.  Should I continue using the decreased ocean timestep until the end of the simulation, or return it to normal once I pass the original point of noncovergence?
Also, is there a good reason why this would happen now?  I wouldn't have expected a problem like this late in the simulation.
 
Thanks.  I've reduced the timestep temporarily to get passed a few crashes (although for the more recent time even dt_count=26 didn't work, so I had to use dt_count=29, which is still within acceptable limits).One concern: does the use of a reduced ocean timestep have the potential to alter the climatology of a simulation?  For instance, is it possible that a preindustrial simulation run with dt_count=23 (the default) could differ in any important way from one run with dt_count=26?  I'm asking because these ocean non-convergence problems have only started happening in the very last section of my equilibrium simulation.  Since I was going to end the simulation soon anyway, would it be better to just stop the simulation now and use the section before this issue arose for analysis purposes?
 

njn01

Member
We expect the ocean climatology to be the same for these various timestep values. The ocean-model non-convergence that you encountered is not uncommon, particularly if you are running a fully coupled simulation. If the simulation successfully continues with dt_count=29, I would not be concerned.  But if in the future your case encounters convergence problems with dt_count=29, you could reasonably suspect a problem that is bigger than a temporary increase in ocean forcing.
 
I am also having problems with the barotropic solver so I am trying to diagnose it systematically. I need some clarification on the following comment in the FAQ linked above: "Many times when the run fails with an ocean nonconvergence error, particularly early in the model run, the ocean-model solver cannot converge because it has received bad boundary conditions passed from another model. In this situation, the user should investigate what is being passed to the ocean and perhaps write high-frequency output to help diagnose the problem." Could you please elaborate on what it means to have received bad boundary conditions from another model? Maybe an example? I am having some difficulty seeing what model could affect the ocean model's convergence. Could you also please help me understand what high-frequency output to write and how to do it? Thank you,
Deepak
 

njn01

Member
"Bad boundary conditions from another model" can occur when the ocean model receives extreme, unrealistic values from another component in the coupled system. The other component might not trap the error, but if the output from that component (which is input to the ocean) is extreme enough, it may trigger an error condition that is detected by the ocean model.As for high-frequency output, there is information in the POP FAQ that describes how to do this.  Note that it may or may not help you diagnose the problem; this is only a suggestion that can sometimes help you figure out what is happening. But as an example, you could output "nstep" ocean "tavg" files for one day just prior to the nonconvergence.Also, for a short period of time prior to the nonconvergence, you could also set the CESM variable "INFO_DBUG" to 2, which will cause the ocean model to compute its diagnostics every timestep.  If your experiment has run successfully for at least a few months, you might try reducing the ocean timestep and rerunning from the last restart.  If this gets you past the nonconvergence, then you might later try returning the timestep to its original value at a later point in time. Temporarily reducing the timestep is the simplest thing to try, and if you have not already done this, I recommend that you give it a try. 
 
Hi,Sorry to disturb you. I met the same problem with fully coupled B1850CN run (f19_g16). I have successfully run the model for 120 years but it crashed at year 121 with info.  "solver not converged"I tried the suggested method to reduce time step from 23 to 29, and it got through the crashing point. I ran it for 5 years under reduced time step and output restart file every year. Then I continued with the default time step using above restart files (start from year 121, 122, 123, 124, and 125). However, the model still crashed with the same infomation.What should I do then?  Any suggestation?   Thank!
 

njn01

Member
The default 1-hour timestep in POP (dt_count = 23) is known to be fairly close-ish to the edge of numerical stability, particularly in a fully coupled experiment in which the atmospheric forcing can be larger than the smoother forcing in an ocean-only or ocean-ice simulation.  It is reasonable to try to return to a larger ocean timestep after temporarily reducing it, but if you encounter instability (non-converged solver) again, then I would recommend just continuing your experiment with the smaller timestep (dt_count = 29).    
 
Top