Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

SSP585 simulation breaks in year 2095

demetray

Demetra Yancopoulos
New Member
I am using CESM2.1.3 and running a CMIP6 SSP5-85 simulation (2015-2100) with the component set SSP585_CAM60_CLM50%BGC-CROP-CMIP6DECK_CICE%CMIP6_POP2%ECO%ABIO-DIC_MOSART_CISM2%NOEVOLVE_WW3_BGC%BDRD.

I have been submitting simulations in 3-year chunks. I have successfully run the model through the year 2092 (with restart files for 2093-01-01). At that point, the simulation restarts and runs for a while. However, it randomly stops during this run. In the run directory, files seem to be populated until the date 2095-06-01, but nothing after 2092-12-01 exists in the output files. So, the run never finishes. I can't find any sign of the error in the log files. I am totally stumped.

I have resubmitted the simulation a couple times and the same thing happens.

Any suggestions for where I can look to diagnose the error? Anybody encounter this error before with an SSP5-85 compset? Are there files I should clear out in the case or run directory before resubmitting the simulation?
 

slevis

Moderator
Staff member
Unless somebody has a different suggestion, I would start by checking whether you have filled some disk space beyond your quota. If not, I would want to know whether the model stops at the same timestep and in the same exact way every time. I might start writing restart files more frequently so as to restart the simulation closer to the point of failure and eventually possibly start adding write statements in the code that may help reveal where the model crashes.
 
Vote Upvote 0 Downvote

demetray

Demetra Yancopoulos
New Member
Unless somebody has a different suggestion, I would start by checking whether you have filled some disk space beyond your quota. If not, I would want to know whether the model stops at the same timestep and in the same exact way every time. I might start writing restart files more frequently so as to restart the simulation closer to the point of failure and eventually possibly start adding write statements in the code that may help reveal where the model crashes.
Thank you for the advice! I have plenty of space left in scratch (and have some other experiments running and storing output there with no problems). I will start by writing the restart files more frequently. What kind of write statements and where should I put them?
 
Vote Upvote 0 Downvote

demetray

Demetra Yancopoulos
New Member
I think I was looking in the wrong place for a logged error. In the log files in the run directory (as opposed to log directory), I was able to find more information:


First:
MARBL ERROR (marbl_co2calc_mod:drtsafe): bounding bracket for pH solution not found



Then a bit later:

(Task 123, block 1) MARBL WARNING (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) it = 3

MPICH Notice [Rank 1147] [job id 5c0aa1d0-31f8-4d17-93bb-47b401825b4a] [Sat Dec 27 06:21:34 2025] [dec1332] - Abort(0) (rank 1147 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1147

aborting job:

application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1147


And finally:

forrtl: error (78): process killed (SIGTERM)


This source claims it is an outside process killing the job, whereas other sources suggest it is model instability (though when I look at the outputs, values seem reasonable). Do you know if this is an outside process or internal to CESM2? Do you have any suggestions for how I can fix this?

I am not interested in the marine biogeochemistry, so perhaps I should just turn MARBL off... How can I do that? Is it a problem to do it part way through the experiment (I already have many decades of data I don't want to give up)?
 
Vote Upvote 0 Downvote

slevis

Moderator
Staff member
I do not have experience with this type of simulation, but looking at your compset, maybe "ECO" is the keyword that turned on MARBL. I do not know if turning off MARBL will mess up other things. I guess you could create a new case without MARBL and start it some years before the crash, so as to compare output with and without MARBL. If turning off MARBL resolves the crash, then you may find it best to start the simulation over, especially if you need to show a clean methodology for a paper.
 
Vote Upvote 0 Downvote

mlevy

Michael Levy
CSEG and Liaisons
Staff member
This MARBL error is often indicative of other problems in the run. For these future simulations, we frequently see ice melt along the coast of Greenland and, due to the way the grid is defined (the north pole has been displaced into Greenland, so the small near-polar grid cells are close to the island), we get a numerical instability due to increasing velocities in these small cells. Can you provide more of the MARBL ERROR output? Specifically, I'm looking for a line that provides latitude and longitude, as well as lines providing values for dic, ta, pt, sit, temp, and salt. It's not uncommon to see this error associated with temperatures on the order of -100° C, and the solution is to restart the run with a smaller ocean time step (I believe the default value for dt_count is 24, and setting it to 48 in user_nl_pop is often sufficient... but the ocean component will take twice as long to run because it's taking twice as many time steps)
 
Vote Upvote 0 Downvote

demetray

Demetra Yancopoulos
New Member
This MARBL error is often indicative of other problems in the run. For these future simulations, we frequently see ice melt along the coast of Greenland and, due to the way the grid is defined (the north pole has been displaced into Greenland, so the small near-polar grid cells are close to the island), we get a numerical instability due to increasing velocities in these small cells. Can you provide more of the MARBL ERROR output? Specifically, I'm looking for a line that provides latitude and longitude, as well as lines providing values for dic, ta, pt, sit, temp, and salt. It's not uncommon to see this error associated with temperatures on the order of -100° C, and the solution is to restart the run with a smaller ocean time step (I believe the default value for dt_count is 24, and setting it to 48 in user_nl_pop is often sufficient... but the ocean component will take twice as long to run because it's taking twice as many time steps)
Thanks for the helpful info! The lines that directly precede the first MARBL ERROR are:

Message from (lon, lat) ( 348.986, 84.352), which is global (i,j) (136, 375). Level: 19

MARBL WARNING (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) it = 4

MARBL WARNING (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) x1,f = 0.8564766E-009-0.1179223E-003

MARBL WARNING (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) x2,f = 0.1357424E-005-0.2006994E-002

MARBL ERROR (marbl_co2calc_mod:drtsafe): bounding bracket for pH solution not found

-------------
However, there are many MARBL WARNING messages before the first MARBL ERROR. These warnings appear at a variety of locations, often very far apart. The first warning says:

(Task 111, block 1) Message from (lon, lat) ( 318.428, 50.883), which is global (i,j) (319, 330). Level: 14

(Task 111, block 1) MARBL WARNING (marbl_interior_tendency_mod:compute_large_detritus_prod): dz*DOP_loss_P_bal= 0.133E-011 exceeds Jint_Ptot_thres= 0.271E-013

-------------
Elsewhere, I see this MARBL WARNING:

(Task 66, block 1) Message from (lon, lat) ( 92.187, 2.009), which is global (i,j) (118, 195). Level: 13

(Task 66, block 1) MARBL WARNING (marbl_interior_tendency_mod:compute_large_detritus_prod): dz*DOP_loss_P_bal= 0.391E-011 exceeds Jint_Ptot_thres= 0.271E-013

Negative conc. in ch4tran. c,j,deficit (mol): 379759 2

1.174526595236352E-003

-------------


When I look at values of ocean temperature and salinity globally, the values seem generally reasonable. For instance, in May 1997 under SSP585 forcing, potential temperature at 500 cm ranges from -1.9 to 36 degrees celsius, and salinity ranges from 0.006 to 0.045 kg/kg.

For these experiments, I am only concerned with ENSO dynamics (ie. ENSO period/strength, etc., not BGC), so if there is an instability in ocean biogeochemistry near Greenland, my instinct is that it's not too important. Do you think it would be reasonable to continue "brute forcing" the experiment (usually, when I resubmit the experiment with a different restart file frequency, I can get the simulation to advance)?

Or, can I possibly shut down MARBL in the middle of the experiment? I understand that is not the cleanest methodology, but I'd like to avoid wasting computer resources if there is a good chance this bug isn't affecting the integrity of ENSO dynamics.

Attached are screenshots of ncview plots of TEMP and SALT at 500 cm in May 1997 under SSP585 forcing.
 

Attachments

  • Screenshot 2026-01-01 at 3.07.29 PM.png
    Screenshot 2026-01-01 at 3.07.29 PM.png
    387.1 KB · Views: 1
  • Screenshot 2026-01-01 at 3.15.59 PM.png
    Screenshot 2026-01-01 at 3.15.59 PM.png
    208.9 KB · Views: 1
Vote Upvote 0 Downvote

demetray

Demetra Yancopoulos
New Member
This MARBL error is often indicative of other problems in the run. For these future simulations, we frequently see ice melt along the coast of Greenland and, due to the way the grid is defined (the north pole has been displaced into Greenland, so the small near-polar grid cells are close to the island), we get a numerical instability due to increasing velocities in these small cells. Can you provide more of the MARBL ERROR output? Specifically, I'm looking for a line that provides latitude and longitude, as well as lines providing values for dic, ta, pt, sit, temp, and salt. It's not uncommon to see this error associated with temperatures on the order of -100° C, and the solution is to restart the run with a smaller ocean time step (I believe the default value for dt_count is 24, and setting it to 48 in user_nl_pop is often sufficient... but the ocean component will take twice as long to run because it's taking twice as many time steps)

I've now narrowed this down to two main warnings/errors in MARBL.

The first is as you suggested: a MARBL ERROR that indicates there are very low temperatures in some locations near the Arctic: ( 348.986, 84.352). Since I am studying the tropics, should I still be concerned with this? Do you think it contaminates the whole simulation?
I can sometimes resubmit the simulation from recent restart files and the error is not thrown. Should I brute force my way through this or restart the simulation altogether? I have already simulated 82/85 years of the SSP585 simulation, so I am at they very tail end of things when this error arises...

The second is a MARBL WARNING, which occurs at a variety of longitudinal locations near the equator, and throws a message like this:
(Task 66, block 1) Message from (lon, lat) ( 92.187, 2.009), which is global (i,j) (118, 195). Level: 13
(Task 66, block 1) MARBL WARNING (marbl_interior_tendency_mod:compute_large_detritus_prod): dz*DOP_loss_P_bal= 0.391E-011 exceeds Jint_Ptot_thres= 0.271E-013
Negative conc. in ch4tran. c,j,deficit (mol): 379759 2
1.174526595236352E-003

I am not sure how to interpret this warning. Is it something I should be concerned about?
 
Vote Upvote 0 Downvote

mlevy

Michael Levy
CSEG and Liaisons
Staff member
When I look at values of ocean temperature and salinity globally, the values seem generally reasonable

The values you are looking at are monthly averages, and the model is aborting due to a problem in a specific time-step... the unrealistic value you are seeing is not included in any of the previous timesteps.

The first is as you suggested: a MARBL ERROR that indicates there are very low temperatures in some locations near the Arctic: ( 348.986, 84.352). Since I am studying the tropics, should I still be concerned with this? Do you think it contaminates the whole simulation?

I do think unrealistic properties off the coast of Greenland will propogate throughout the model and you should be concerned. Could you share the temperature and salinity values you are seeing in the error message? That should be something like

MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) temp = ####
MARBL ERROR (marbl_co2calc_mod:drtsafe): (marbl_co2calc_mod:drtsafe) salt = ####

The second is a MARBL WARNING, which occurs at a variety of longitudinal locations near the equator, and throws a message like this:
(Task 66, block 1) Message from (lon, lat) ( 92.187, 2.009), which is global (i,j) (118, 195). Level: 13
(Task 66, block 1) MARBL WARNING (marbl_interior_tendency_mod:compute_large_detritus_prod): dz*DOP_loss_P_bal= 0.391E-011 exceeds Jint_Ptot_thres= 0.271E-013
I don't think this is a huge concern; it is a conservation check, and we expect round-off level errors from various arithmetic operations to accumulate. The threshold is somewhat arbitrary, and being 2 orders of magnitude above it is different than this check being O(1).


I stand by my earlier recommendation:

the solution is to restart the run with a smaller ocean time step (I believe the default value for dt_count is 24, and setting it to 48 in user_nl_pop is often sufficient... but the ocean component will take twice as long to run because it's taking twice as many time steps)

We can talk about turning off the BGC, but you'll still want to run with the smaller time step and you're so close to the end of your run that I don't think it's worth the effort to make that change... restarting from 2093 with the smaller time step should let you get through 2100 without issues.
 
Vote Upvote 0 Downvote

demetray

Demetra Yancopoulos
New Member
The values you are looking at are monthly averages, and the model is aborting due to a problem in a specific time-step... the unrealistic value you are seeing is not included in any of the previous timesteps.
Thank you so much for the fast response! So, if I restart from an earlier timestep with a larger dt_count, should I go back a few years to ensure I am clipping all the unrealistic values? As you suggested, 2093 will be sufficient if the error is occurring in 2097?

Will changing dt_count in the middle of a run be problematic? Will this somehow change the results? Since I have so little time left, it may not be a big deal -- but I have another ensemble member of the same experiment that failed much earlier (around 2040) -- will changing dt_count at this early stage still be okay, or should I just restart that ensemble member from the beginning?

Also, is it valid to compare experiments with different time integration steps? My pre-industrial control experiment has dt_count=24.

I do think unrealistic properties off the coast of Greenland will propogate throughout the model and you should be concerned. Could you share the temperature and salinity values you are seeing in the error message?
I am getting values at location (lon, lat) ( 348.986, 84.352):
dic = 0.2019165E+004
ta = 0.2083389E+004
pt = 0.1950551E+001
sit = 0.6517283E+002
temp = -0.6779240E+002
salt = 0.2820285E+002
 
Vote Upvote 0 Downvote

mlevy

Michael Levy
CSEG and Liaisons
Staff member
So, if I restart from an earlier timestep with a larger dt_count, should I go back a few years to ensure I am clipping all the unrealistic values? As you suggested, 2093 will be sufficient if the error is occurring in 2097?

Yeah, I think the non-physical values trigger the CFL violation pretty quickly; if your run was crashing on January 3rd, 2093, then maybe the January 1st, 2093, restart would be inappropriate... but given the failure is more than a year into the run you should be fine starting from your most recent restart file.

Will changing dt_count in the middle of a run be problematic? Will this somehow change the results?

Changing the ocean time step will change the results, but from a statistical standpoint the run with the smaller time step can still be thought of as different representation of the same climate... and in that sense, the fact that the results are different is not problematic.

I have another ensemble member of the same experiment that failed much earlier (around 2040) -- will changing dt_count at this early stage still be okay, or should I just restart that ensemble member from the beginning?

Changing dt_count earlier in the run is also fine; another way to phrase my comment above is that two runs with different time steps will produce different results, but as long as the numerics are stable in both runs they can each be thought of as different sampling from the same statistical distribution of possible climate states. Any large differences between them are indicating a process with a particularly high variance, rather one being correct and the other being wrong.

Also, is it valid to compare experiments with different time integration steps? My pre-industrial control experiment has dt_count=24.

Yes, this is also okay. I believe there are members of the CESM Large Ensemble where this issue occurred and the time step changed mid-run.

I am getting values at location (lon, lat) ( 348.986, 84.352):
dic = 0.2019165E+004
ta = 0.2083389E+004
pt = 0.1950551E+001
sit = 0.6517283E+002
temp = -0.6779240E+002
salt = 0.2820285E+002

Thanks! -68° sounds chilly...
 
Vote Upvote 0 Downvote
Top