Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Advice on getting past WACCM crashes

We have been validating the WACCM compsets in the latest release, CESM1.1.1, running on Yellowstone. We found that our runs will occasionally crash, something which we did not find in standard compset runs using the CESM1.0 code base on bluefire and other machines. We have found that decreasing the dynamical timestep, by increasing the namelist variable "nsplit" from 8 to 10 for a period of one month during which the crash occurred, we are able to get past the crash. "nsplit" may then be decreased to its default value of 8 for WACCM, and the run continues.

We will continue to investigate the cause of this increased crash frequency.
 

santos

Member
An update:
The most common cause of these crashes in WACCM is a crossing of the Lagrangian levels used in the FV dycore's vertical advection scheme. Therefore, it is best to set the namelist option "nspltvrm" to "2" rather than increasing nsplit. This doubles the frequency of vertical remapping without impacting the rest of the dynamics.
"nspltvrm=2" will be the default as of CESM 1.2. In CESM 1.2, a crossing of the Lagrangian levels will also trigger an error message advising the user to increase nspltvrm, rather than simply crashing.
 
Hi Sean,I just ran into this error after my model ran to year 23. It was almost done but came up with the error in the atmosphere log that Lagrangian levels are crossing and that the Run will abort. It also suggested to increase nspltvrm. I have version cesm 1.2.1, so given your statement above, it should already be at a value of 2, correct? Should I increase it further? And would I need to start the model over from the beginning since I am changing the namelist? How do I set this in the namelist?
 

santos

Member
It should default to 2, but you can check this by looking in atm_in to see what nspltvrm is set to.The proper value of this setting can depend on many things, including resolution (especially vertical), the time steps, and or very strong physics forcings that push the model to the edge of stability (e.g. very high stresses or temperature gradients). The setting of 2 seems appropriate for standard WACCM configurations at 2 degrees.If you change nspltvrm, you do not have to start the run over; it can be adjusted at any time. But if you change it, you may also have to change nsplit, which is the bulk dynamics time step. These two settings control nested loops, so nsplit must be a multiple of nspltvrm. Here are the standard settings for WACCM:nsplit = 8
nspltvrm = 2Here's one change that you could attempt, by placing the following line in your user_nl_cam:nspltvrm = 4This works because 4 divides 8. If you wanted to increase nspltvrm further, you might have to try something like this:nsplit = 12
nspltvrm = 6However, only a few runs have required very significant changes to nsplit, and this generally means that either you have some serious bug, or that you are doing something that's substantially different from any supported way of running the model.There is one more setting for the FV dycore, which is called "nspltrac", and controls the tracer advection (which can be expensive and is done less frequently than bulk dynamics). We generally allow this to be set automatically by the model, but, if set, it must be a multiple of nspltvrm and nsplit must be a multiple of it. To put it differently, the three variables must satisfynsplit/nspltrac = m
nspltrac/nspltvrm = nwhere m and n are positive integers (either or both can be 1).
 
Thanks Sean. Right now I am going to try restarting the model from where the previous restart file was and see if it comes up with the error again. If it does, I will try your suggestion. I can't imagine I have a serious bug or that I am doing something that is quite different in terms of running the model. My other run has been fine so far and the only difference I added in this run is a solar cycle by specifying my solar data and parms file. Can this error happen just by itself?
 

santos

Member
Can this happen by itself? Yes and no.Yes, in the sense that it can happen intermittently if the model is on the edge of stability. Prior to CESM 1.2, this could show up even in out-of-the-box runs intermittently (on average, maybe every 50-100 years), but this was less common in B1850 runs.But also no, in the sense that we thought that we had gotten away from "the edge" since setting nspltvrm to 2. That is to say, I don't know of any definitive cases where an out-of-the-box case has encountered this error with nspltvrm set to 2, vs. multiple centuries  of successful runs.I think it's still a frustratingly open question, this matter of how the WACCM physics affects numerical stability of the dycore. The CAM-SE dycore (HOMME) can produce an equivalent error that has proven much harder to conquer. Also, some CARMA cases have encountered errors that may be the result of large heating rates from the radiation interacting with the dycore.
 
Dear Mike,I tried to run a 3xCO2 experiment on WACCM (CESM1.2.2). I changed just the boundary condition and CO2 concentration. But the model crashed in first several steps, and reported"  20: Run will ABORT!  20: Suggest to increase NSPLTVRM  20:(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping" In this case, I will be more than appreciate if you could give me some suggestion on that, because this error seems very confusing to me... Need I to increase nspltvrm from 2 to 4?Best regards,Wanying 
 

mmills

CSEG and Liaisons
Staff member
Wanying,You can try increasing nsplitvrm. You may also need to change nspltrac and nsplit. nspltrac needs to be a multiple of nspltvrm. nsplit needs to be a multiple of nspltrac.
 
Hello Mike,    I also met the same error report. Could you please tell me what is the reason or physical explanation  for this error ?     Thanks.
 

mmills

CSEG and Liaisons
Staff member
nspltvrm controls the number of vertical re-mapping timesteps per physics timestep.http://www.cesm.ucar.edu/cgi-bin/eaton/namelist/nldef2html-cam5If there is an instability in the vertical levels (generally an issue in the upper atmosphere), nspltvrm must be increased to get through the period of instability.
 

mmills

CSEG and Liaisons
Staff member
Yes. High temperatures, such as can occur near the top of WACCM during high auroral activity, often causes the vertical levels to cross. We are testing a new method for avoiding these crashes in WACCM by limiting the value of Bz (the north-south component of the interplanetary magnetic field) in mag_parms.F90. For example:      real(r8), parameter :: bzmin = -5.0_r8       ! minimum bz      call solar_parms_get( kp_s = wkp, f107_s = wf107 )      if( present( by ) ) then         by  =  0._r8      end if      if( present( bz ) ) then         bz = .433726_r8 - wkp*(.0849999_r8*wkp + .0810363_r8) &              + wf107*(.00793738_r8 - .00219316_r8*wkp)         if (bz.lt.bzmin) then           write(iulog,'(a,f6.2,a,f6.2,a,f6.2,a,f6.1)') 'mag_parms.F90: low bz:',bz,                   ' limited to ',bzmin,'; Kp=',wkp,'; F107=',wf107           bz=bzmin         end if      end if 
 
Top