model is blowing up: CFL condition likely violated. CAM module wanna help

Redlichia · Nov 19, 2022

Hello, everyone！Thanks for opening this thread. I am running fully coupled climate simulation (use B1850 compesite) with CESM1.2.
The model has been running successfully for 3000 years.When I change to 'branch' continue to run the next, running successfully for six months and failed at the seventh month. The error message I got is mainly from CAM module.
Here is the last few lines from cesm.log file:
--------------------------------------------------------------------------
1823 SPHDEP: ****** MODEL IS BLOWING UP: CFL condition likely violated *********
1824 SPHDEP: ****** MODEL IS BLOWING UP: CFL condition likely violated *********
1825
1826
1827 Parcel associated with longitude 85, level 9 and latitude 9 is outside the model domain.
1828 Possible solutions: a) reduce time step
1829 b) if initial run, set "DIVDAMPN = 1." in namelist and r
1830 erun
1831 c) modified code may be in error
1832 (shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
1833
1834
1835 Parcel associated with longitude 87, level 10 and latitude 8 is outside the model domain.
1856 --------------------------------------------------------------------------
1857 slurmstepd: error: *** STEP 6181587.0 ON h02r2n12 CANCELLED AT 2022-10-26T20:20:37 ***
1858 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
1859 srun: error: h02r2n12: task 2: Killed
1860 srun: launch/slurm: _step_signal: Terminating StepId=6181587.0
1861 srun: error: h02r2n12: task 8: Exited with exit code 233
1862 srun: error: h02r2n12: tasks 0-1,3-7,9-29: Killed
1863 srun: error: h02r2n13: tasks 30-59: Killed
--------------------------------------------------------------------------
I also checked the atm.log file:
1556 *** Original Courant limit exceeded at k,lat= 1 9 (estimate = 1.174) ***
1557 *** Original Courant limit exceeded at k,lat= 2 9 (estimate = 1.118) ***
1558 *** Original Courant limit exceeded at k,lat= 3 9 (estimate = 1.052) ***
1559 NSTEP =31544704 8.685526565645961E-05 9.114958076179490E-06 275.333 9.87396E+04 5.278570600512381E+01 1.69 0.93
1560 nstep, te 31544705 0.37097820527521362E+10 0.37097951667912240E+10 0.36258312668091194E-03 0.98739623381941856E+05
1561 COURLIM: *** Courant limit exceeded at k,lat= 1 9 (estimate = 1.168), solution has been truncated to wavenumber 26 ***
1562 COURLIM: *** Courant limit exceeded at k,lat= 2 9 (estimate = 1.117), solution has been truncated to wavenumber 27 ***
1563 COURLIM: *** Courant limit exceeded at k,lat= 3 9 (estimate = 1.038), solution has been truncated to wavenumber 29 ***
1564 *** Original Courant limit exceeded at k,lat= 1 9 (estimate = 1.168) ***
1565 *** Original Courant limit exceeded at k,lat= 2 9 (estimate = 1.117) ***
1566 *** Original Courant limit exceeded at k,lat= 3 9 (estimate = 1.038) ***
1567 NSTEP =31544705 8.697680900706860E-05 1.045402640107233E-05 275.331 9.87396E+04 5.278784553213968E+01 1.83 1.18
1568 nstep, te 31544706 0.37097809943776278E+10 0.37097951667912240E+10 0.39184547030246612E-03 0.98739646779716568E+05
--------------------------------------------------------------------------
Attached is my cesm.log and atm.log file

I believe as the tip says, I triggered the "Courant limit " in the seventh month begin.
A "Courant limit exceeded" message is issued whenever the algorithm is employed.

There are two reasons I am aware of that trigger the limiter:

1) the wind fields can get extremely strong in the middle atmosphere at the winter pole. This is a natural phenoma and occurs in the real world as well as the model. The limiter will kick in under these circumstances and reduce the wind speed to maintain stability. This is a perfectly normal occurance, and nothing to worry about.

2) If there is an instability generated by any other aspect of the model (for instance you might have introduced a bug), then it can amplify, and occasionally the Courant limiter will begin firing, The model will soon halt (something will go terribly wrong) You may see other manifestations of such an instability (ie other warning messages will begin appearing).
--------------------------------------------------------------------------
For information I try running with dtime 1200, 600, 300 and DIVDAMPN = 1. but the run still aborted.
Do you have any good solutions to this limitation to keep the model running?
I hope to get your help. I would really appreciate your help!
Thank you!

Best,Chen

Quote Reply
Report

peverley · Nov 21, 2022

Hi Chen,

One idea is to set:

kmxhdc = 0

in user_nl_cam.

kmxhdc defaults to 5 (which means that the courant limit will be applied on the top 5 levels of the model. Setting this to 0 will remove the limit. Let me know if this isn't what you're looking for.

Courtney

Redlichia · Nov 30, 2022

peverley said:
Hi Chen,

One idea is to set:

kmxhdc = 0

in user_nl_cam.

kmxhdc defaults to 5 (which means that the courant limit will be applied on the top 5 levels of the model. Setting this to 0 will remove the limit. Let me know if this isn't what you're looking for.

Courtney

Hi ！Courtney. Thank you for your patient reply, your advice was very useful to me. When I add "kmxhdc = 0" to the user_nl_cam file, the model still gives the same error message as before.
--------------------------------------------------------------------------
1819 SPHDEP: ****** MODEL IS BLOWING UP: CFL condition likely violated *********
1820
1821
1822 Parcel associated with longitude 36, level 2 and latitude 47 is outside the model domain.
1823 Possible solutions: a) reduce time step
1824 b) if initial run, set "DIVDAMPN = 1." in namelist and r
1825 erun
1826 c) modified code may be in error
1827 (shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
Attached is my new cesm.log and atm.log file
I checked my settings in atm_in and "kmxhdc = 0" should have changed to take effect, but I'm still getting the same error message as before, which is confusing me. I have set "kmxhdc = 0" and it should not trigger the "Courant limit " , But it is still triggered, do you know if there are other reasons?
I am also actively looking for the reasons for this, and if I solve it I will share the reasons with you.
--------------------------------------------------------------------------
Thank you for your help !

Best,Chen

peverley · Dec 2, 2022

Hi Chen,

Are you still getting the "Courant limit exceeded" messages in the atm.log file?

Overall, it looks like "SPHDEP: ****** MODEL IS BLOWING UP: CFL condition likely violated *********" cannot be turned off. You can see what the code is doing in src/dynamics/eul/sphdep.F90. Have you tried the first two possible solutions described in the error message? (reduce time step, set DIVDAMPN=1), if so, you may have an error in your code mod or your configurations.

Redlichia · Dec 3, 2022

peverley said:
Hi Chen,

Are you still getting the "Courant limit exceeded" messages in the atm.log file?

Overall, it looks like "SPHDEP: ****** MODEL IS BLOWING UP: CFL condition likely violated *********" cannot be turned off. You can see what the code is doing in src/dynamics/eul/sphdep.F90. Have you tried the first two possible solutions described in the error message? (reduce time step, set DIVDAMPN=1), if so, you may have an error in your code mod or your configurations.

Hi, Courtney. Thank you for your patience in replying.

Yes. I still get the "Courant limit exceeded" message in the atm.log file.

I tried to run it before with dtime 1200, 600, 300 and DIVDAMPN = 1, but the run still aborted. In the atm.log file for both of these already tried (reduce time step, set DIVDAMPN=1) methods, the error message reported is also "Courant limit exceeded" .

I have read and checked the src/dynamics/eul/sphdep.F90. file carefully, but I don't find the error. I don't know how to change the sphdep.F90. file to remove the "Courant limit exceeded" restriction.

Do you have any suggestions for changes to the sphdep.F90. file to remove the "Courant limit exceeded" restriction? Attached is my sphdep.F90. file.
--------------------------------------------------------------------------
Thank you for your help !

Best,Chen

nusbaume · Dec 8, 2022

Hi Chen,

When you created the branch run, did you modify anything so that it was different from the out-of-the-box B1850 compset?

Also, when you say that you are modifying dtime, are you doing that in the namelist, or via ATM_NCPL in env_run.xml? In general it is recommended to use the ATM_NCPL option, as that will ensure that the time step is properly updated throughout the model. Also please note that the "coupling interval" is the same as the model timestep, so a value of 48 intervals per day means it is a timestep of 30 minutes.

Once you get back to me I can try and contact the scientist(s) responsible for the Eulerian dycore to see if they have any suggestions, although I should note that technically CESM1.2 is no longer supported (only CESM2 is), so it may be a while before I hear back.

Thanks, and have a great day!

Jesse

Redlichia · Dec 9, 2022

nusbaume said:
Hi Chen,

When you created the branch run, did you modify anything so that it was different from the out-of-the-box B1850 compset?

Also, when you say that you are modifying dtime, are you doing that in the namelist, or via ATM_NCPL in env_run.xml? In general it is recommended to use the ATM_NCPL option, as that will ensure that the time step is properly updated throughout the model. Also please note that the "coupling interval" is the same as the model timestep, so a value of 48 intervals per day means it is a timestep of 30 minutes.

Once you get back to me I can try and contact the scientist(s) responsible for the Eulerian dycore to see if they have any suggestions, although I should note that technically CESM1.2 is no longer supported (only CESM2 is), so it may be a while before I hear back.

Thanks, and have a great day!

Jesse

Hi！Jesse. Thank you for your patient reply, your reply was very useful to me.

After 3000 years of running, I changed to the "branch" experiment (B1850 compesite, difference is only change aerosol loading file) and the model continued to run for another six months, crashing at the seventh month, I believe that the rapid cooling triggered the "Courant limit ".

And I change the dtime by modifying ATM_NCPL in env_run.xml file. The default interval "ATM_NCPL "=48, (dtime=1800seconds), I tried the following three intervals: ATM_NCPL=72 (dtime=1200seconds), ATM_NCPL=144 (dtime=600seconds),ATM_NCPL=288 (dtime=300seconds).

When I set "ATM_NCPL"=288 (dtime=300 seconds), the simulation is also paused, cesm.log file prompt an error that “clm dtime 1800 and Eclock dtime 300 never align”.
--------------------------------------------------------------------------
1131 clm dtime 1800 and Eclock dtime 300 never align
1132 ENDRUN:lnd_init_mct ERROR: time out of sync
1133 -------------------------------------------------------------------------
1134 MPI_ABORT was invoked on rank 54 in communicator MPI COMMUNICATOR 9 CREATE FROM 0
1135 with errorcode 1.
1136
1137 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
1138 You may or may not see output from other processes, depending on
1139 exactly when Open MPI kills them.
--------------------------------------------------------------------------

Strangely, I checked the dtime value in the atm_in and lnd_in files, both of which were dtime=300. My changes should have taken effect, but cesm.log still prompted me “clm dtime 1800 and Eclock dtime 300 never align.” I don't know what went wrong, do you have any solution to solve this error.
Attached is my cesm.log file.

Thank you for your help !

Best,Chen

nusbaume · Dec 9, 2022

Hi Chen,

My mistake, I forgot that you actually can’t change the timestep when doing a “branch” run. Instead you’ll need to do a “hybrid” run. You can see the final post here for an explanation as to why. It could also explain why your previous attempts to change the model timestep weren’t working either.

So, I would recommend trying a hybrid run instead of a branch run (climatologically they should basically be the same), and if you still get the CFL failure to use ATM_NCPL again to try and decrease the time step.

Of course if you go through all of those steps and still get a CFL failure even with a very short time step then please let me know.

Thanks, and good luck!

Jesse

Redlichia · Dec 13, 2022

Hi！Jesse.
Thank you for your patient reply!
As you said, I believe none of my previous tests to change the dtime worked.
According to your suggestion, when I changed to "hybrid" and continued to run, I got a new error. The new error occurred in the pop module. I'm a little confused because I didn't change the dtime of the pop module, maybe it has something to do with cpl.

Here is the last few lines from cesm.log file:
2282 --------------------------------------------------------------------------
2283 pop2 ymd= 30010102 pop2 tod= 0
2284 sync ymd= 10103 sync tod= 0
2285 Internal pop2 clock not in sync with Sync Clock
2286 --------------------------------------------------------------------------
2287 MPI_ABORT was invoked on rank 46 in communicator MPI_COMM_WORLD
2288 with errorcode 1001.
2289
2290 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
2291 You may or may not see output from other processes, depending on
2292 exactly when Open MPI kills them.
2293 --------------------------------------------------------------------------
I also checked the ocn.log file:
--------------------------------------------------------------------------
5308 spChl_SURF: 5.409049444685417E-002
5309 spC_zint_100m: 3266.25312924189
5310 spCaCO3_zint_100m: 98.0537912374824
5311 diatChl_SURF: 1.300554135823338E-002
5312 diatC_zint_100m: 1218.91115298740
5313 diazChl_SURF: 5.346678639632185E-003
5314 diazC_zint_100m: 500.023103900677
5315 (io_pio_init) create file ./AI_500X.pop.h.ecosys.nday1.3001-01-01.nc
5316
5317 data appended to tavg file: ./AI_500X.pop.h.ecosys.nday1.3001-01-01.nc
5318 pop2 ymd= 30010102 pop2 tod= 0
5319 sync ymd= 10103 sync tod= 0
5320 Internal pop2 clock not in sync with Sync Clock
5321 (shr_sys_abort) ERROR: ocn_run_mct:: Internal pop2 clock not in sync with Sync Clock
5322 (shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------

This may happen with the run fails in the middle of writing restarts. How should I do to avoild this? I'm puzzled again.

Attached is my cesm.log file and pop.log file.

Thank you for your help !

Best,Chen

nusbaume · Dec 14, 2022

Hi Chen,

Sadly issues with POP are outside my area of expertise. I would thus recommend posting this error in the POP forum here:

POP

Support of specific science objectives of the CESM and maintaining a state-of-the-art ocean component for the CESM as well as conducting related, but curiosity-driven, research leading to new contributions to the CESM community are the primary goals of the Ocean Model Working Group (OMWG). The...

bb.cgd.ucar.edu

Of course if they are able to fix your issue and afterwards you run into another CAM problem then feel free to let me know in this thread.

Thanks, and good luck with the POP error!

Jesse

model is blowing up: CFL condition likely violated. CAM module wanna help

Redlichia

Yihui Chen

New Member

peverley

Courtney Peverley

Moderator

Redlichia

Yihui Chen

New Member

peverley

Courtney Peverley

Moderator

Redlichia

Yihui Chen

New Member

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

Redlichia

Yihui Chen

New Member

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

Redlichia

Yihui Chen

New Member

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

POP