Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Model Slowdown

Hi CESM2 Users/Developers,

I am encountering slow downs of model performance. I think they are related to MARBL I am running fully coupled ~1x1 simulations with 40 nodes, 36 cores (Piz Daint CSCS Lugano). Regardless of output frequency (sub-daily or monthly), I'll encounter an issue where the simulation will suddenly slow down. Typically, I get about 10-15 minutes of wall clock time per model month. When the slowdown occurs it'll be 60-70 min of wall clock time per model month.

I've tried perturbing ocean TS at 1.e-14 and adjusting dt_count = up to 40.

This occurs frequently, at every other 5 year chunk of simulations. And it is chewing through my computer time allotment. Perhaps there are ideas about how to fix this?

Cheers,
-Jonathan
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
@mlevy since Jonathan thinks this might be related to MARBL, do you have any advice?
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Jonathan

Hmmm. I think you'll need to "catch" the model for both the times when it's fast and times when it's slow. The timing files should help you to figure out what is going on, if you can get clean simulations with slow and fast behavior. Is the slowdown consistent for a given section of the simulation (in time)? If so you can rerun for the periods where it's slow and also where it's fast. If that's not the case, the slowdown might be something more complex that has to do with the machine state, or some type of race conditions within the model.
 
Hi Erik,

Do the timing files for this investigation need to be model completed exit, or will timing files from incomplete jobs be useful?

Cheers,
-Jonathan
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
The jobs need to be completed, as the timing files are only output when the simulation finishes.
 
Thanks Erik, when I catch this happening again, I'll rerun a short simulation so it exits successfully, then try to run it again with adjustments to execute faster. But, in the mean time, all of my adjustments that I make to the simulations execute faster are added to user_nl_pop.
for example a current simulation that is running at the 'normal' speed:

user_nl_pop
init_ts_perturb = 1.0e-14
dt_count = 46
 
Hi Erik,

Here are two examples. The ocean seems to be the limiting factor in both cases. In this case, increasing the model dt_count did not add to performance. I did not perturb the initial conditions of the ocean this time. The MARBL warning why I adjust the model time step:

(Task 277, block 1) Message from (lon, lat) ( 345.671, 83.866), which is global (i,j) (135, 376). Level: 14
(Task 277, block 1) MARBL WARNING (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) it = 4
(Task 277, block 1) MARBL WARNING (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) x1,f = 0.3455379E-009-0.1578166E-005
(Task 277, block 1) MARBL WARNING (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) x2,f = 0.5476407E-006-0.6724740E-005
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): bounding bracket for pH solution not found
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) dic = 0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) ta = 0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) pt = 0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) sit = 0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) temp = -0.8400911E+002
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) salt = -0.2193564E+002
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:comp_htotal): Error reported from drtsafe
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:marbl_co2calc_interior): Error reported from comp_htotal()
(Task 277, block 1) MARBL ERROR (marbl_interior_tendency_mod:compute_carbonate_chemistry): Error reported from marbl_co2calc_interior() with dic
(Task 277, block 1) MARBL ERROR (marbl_interior_tendency_mod:marbl_interior_tendency_compute): Error reported from compute_carbonate_chemistry()
(Task 277, block 1) MARBL ERROR (marbl_interface:interior_tendency_compute): Error reported from marbl_interior_tendency_compute()
(Task 277, block 1) MARBL ERROR (ecosys_driver:ecosys_driver_set_interior): Error reported from marbl_instances(1)%set_interior_forcing()
ERROR reported from MARBL library

Timing Files after pop time step adjustment:

user_nl_pop
dt_count = 42
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time: 1200.805 seconds 38.736 seconds/mday 6.11 myears/wday
CPL Run Time: 36.912 seconds 1.191 seconds/mday 198.80 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
ATM Run Time: 629.256 seconds 20.299 seconds/mday 11.66 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
LND Run Time: 357.691 seconds 11.538 seconds/mday 20.52 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
ICE Run Time: 71.036 seconds 2.291 seconds/mday 103.30 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
OCN Run Time: 972.675 seconds 31.377 seconds/mday 7.54 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
ROF Run Time: 10.170 seconds 0.328 seconds/mday 721.54 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
GLC Run Time: 1.201 seconds 0.039 seconds/mday 6109.98 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
WAV Run Time: 39.255 seconds 1.266 seconds/mday 186.93 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 517.374 seconds 16.689 seconds/mday 14.18 myears/wday


---------------- DRIVER TIMING FLOWCHART ---------------------

user_nl_pop
dt_count = 50
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time: 1488.896 seconds 48.029 seconds/mday 4.93 myears/wday
CPL Run Time: 26.001 seconds 0.839 seconds/mday 282.22 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
ATM Run Time: 574.612 seconds 18.536 seconds/mday 12.77 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
LND Run Time: 313.910 seconds 10.126 seconds/mday 23.38 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
ICE Run Time: 81.695 seconds 2.635 seconds/mday 89.82 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
OCN Run Time: 1453.741 seconds 46.895 seconds/mday 5.05 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
ROF Run Time: 8.589 seconds 0.277 seconds/mday 854.36 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
GLC Run Time: 1.376 seconds 0.044 seconds/mday 5332.91 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
WAV Run Time: 37.203 seconds 1.200 seconds/mday 197.24 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 868.972 seconds 28.031 seconds/mday 8.44 myears/wday


---------------- DRIVER TIMING FLOWCHART ---------------------
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
OK, so I take it the second one is the one with the slowdown? The top one OCN runs the slowest at 7.54 myears/wday. So the overall rate of 6.11 seems about right. ATM is the next slowest, at 11.66, and the CPL.COMM at 14.18.

The bottom one has OCN slow down to 5.05, and CPL.COMM also slow down to 8.44. The other components in some cases are actually slightly faster, like ATM at 12.77. But, none of that will matter with the bottleneck being OCN. So I don't see anything you can do to help this, unless someone can suggest something else to improve the OCN.
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
Sorry to take so long to join the conversation. The error you are seeing

Code:
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): bounding bracket for pH solution not found
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) dic =  0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) ta =  0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) pt =  0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) sit =  0.0000000E+000
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) temp = -0.8400911E+002
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) salt = -0.2193564E+002
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:comp_htotal): Error reported from drtsafe
(Task 277, block 1) MARBL ERROR (marbl_co2calc_mod:marbl_co2calc_interior): Error reported from comp_htotal()

is common to see when you are violating the CFL condition -- i.e. your time step is too large. This isn't a problem in MARBL so much as MARBL acting as the canary in the coal mine to tell you the run is going off the rails. Notice that the BGC tracers (DIC, ALK, PO4, SiO3) are all 0, while temperature is -84 C and salinity is -22 psu; when you adjust dt_count are you doing that in a restart around when the model is slowing down? You may need to back up to an earlier restart and change the time step before the physics have gotten, well, unphysical.
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
I should've asked in the above comment -- what forcing are you using for this run? If you are forcing a warming climate (such as a high-emissions future scenario), this is a common issue to run into. If you are running a paleo simulation with a non-standard land mask, that can also lead to these types of errors. If you are running a pre-industrial / historical run, the out-of-the-box default time step should be sufficient and we should look into whether you successfully ported the model to your machine.
 
Hi Michael Levy,
Thanks for chiming in. Yes, I agree that this is a CFL criteria error. I was adjusting the time step, perturbing the TS initial conditions, and restarting from an earlier restart file when necessary. Sometime the model would speed back up, sometimes not. I also noticed that progressively as time continues, the increased dt_count stays elevated.

As for model forcing:
I am executing a leap frog computational method where I adjust the carbon emissions file every 5 years to push CESM2 to the Paris Agreement and other global temperature targets (1.5°C, 2.0°C, and 3.0°C). What is more interesting to me is that these CFL errors seem to occur after the temperature target is reached, and the emissions are adjusting to try to keep the simulation at the temperature target.

I had no issues with the pre-industrial/historical run. The issue comes about when the 'stable climate' is reached. For example, due to how the model is spun up (using the BPRP settings), CESM2 global mean surface temperature is at 1.5°C by 2026, and the 1.5°C target simulation starts encountering the CFL criteria issues ~2050, whereas the 3.0°C simulation hits 3.0°C at ~2100, and starts having CFL criteria issues at ~2125.
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
The issue comes about when the 'stable climate' is reached. For example, due to how the model is spun up (using the BPRP settings), CESM2 global mean surface temperature is at 1.5°C by 2026, and the 1.5°C target simulation starts encountering the CFL criteria issues ~2050, whereas the 3.0°C simulation hits 3.0°C at ~2100, and starts having CFL criteria issues at ~2125.
This is very interesting. I'm not sure how familiar you are with our grid, but given that this is a public forum I'll risk telling you something you already know under the guise of potentially helping future users who find this thread :) Our ocean grid uses a displaced pole to hide the grid convergence at the north pole in Greenland; this means the ocean cells get smaller as they approach Greenland, but the vanishing cells themselves are masked out as land.

What we typically see in our future simulations is that warming climates cause sea ice to melt off the coast of Greenland, exposing the smaller grid cells to atmospheric forcing. This leads to large velocities in these newly-exposed cells, which drive the non-physical T & S values that eventually trigger the MARBL error (when BGC is enabled). In fact, the error you reported is off the NE coast (Message from (lon, lat) ( 345.671, 83.866)). So I am very surprised that the simulation targetting 3 C of warming doesn't abort until 2125, when the 1.5 degree target reaches this error in 2050. Have you looked at sea ice coverage in both of these simulations? And how about temperature? It might be the case that the 3 degree case is showing non-physical T&S values in that region very early in the simulation, but some chaotic process is preventing it from blowing up the CO2 solver in MARBL until much later.

So my current diagnosis is that you're just increasing dt_count too late, but I'll look for similar threads here / also scour my email to see if I can give you a better idea of when you need to change the time step.
 
Thanks Michael Levy for all of the information. Yes, I am aware of the warped pole grids. The Greenland spot is not the only location where this occurs. I've seen it around the Indonesian Islands, too. But, overall, the simulations are showing the characteristic AMOC slowdown response from global warming. I'll check out sea ice more closely—this is what my colleagues and I have thought, too. Some locations now being exposed to atmospheric fluxes and that's throwing the system. But, this doesn't explain the occasional Indonesian numbers.
 
Top