"Forced exit from Rosenbrock due to step size too small" in CESM 2.3 beta17

Liyl2

Yali Li
New Member
Describe your problem or question:

Hi CESM team,
I am Yali, a phD student from Southern University of Science and Technology (SUSTech) in China.

I am new to CESM and currently running CESM with GEOS-Chem chemistry (cesm2_3_beta17). I encountered a Rosenbrock integrator convergence issue. Although the model does not stop immediately, the simulation becomes extremely slow, and the runtime increases significantly as the simulation proceeds. Eventually, the run stops after about 2–3 months of simulation. For reference, on a single node (128 cores), a default FHIST run with CAM-chem takes about ~2 hours, while the same configuration with GEOS-Chem chemistry (FCHIST_GC) takes ~4.5 hours and shows the instability described above.

I would appreciate any suggestions on how to diagnose or fix this issue.
1775126564096.png


---

In addition, I have a few related questions:
1. Since my research requires GEOS-Chem chemistry, I understand that the model may need to run on a single node (as suggested in Fritz et al. 2022). Do you have recommendations for optimal processor layout (NTASKS/ROOTPE) for different components (CAM, LND, OCN, etc.) in this case?
2. Is there a recommended CESM version for simulations with GEOS-Chem chemistry? Also, when were CESM 2.3 and CESM3 officially released? And in general, are the official releases more stable or faster than the alpha/beta versions?
3. For GEOS-Chem-enabled compsets (e.g., FC2000climo_GC, FC2010climo_GC, FCHIST_GC, and FCnudged_GC), could you briefly explain their typical use cases? Can I define custom compsets for specific applications?

Any suggestions would be greatly appreciated! Thank you very much for your help!


Have you made any changes to files in the source tree?
According to the known issues documented in GEOS-Chem in CESM (GEOS-Chem in CESM - Geos-chem),
two code modifications were applied:
(1) GEOS-Chem landmap issue: The call to GC_Error in olson_landmap_mod.F90 was commented out to bypass the "State_Met%IUSE is zero" error.
(2) HEMCO vertical grid issue: Pointers BXHEIGHT, PEDGE, and ZSFC in hemco_interface.F90 were initialized to NULL() to prevent errors in CalcVertGrid.

Describe every step you took leading up to the problem:
./create_newcase \
--case /xxx/case.FCHIST_GC_2.3beta_t1 \
--compset FCHIST_GC \
--res f19_f19_mg17 \
--run-unsupported \
--mach sustech-qiming-release \
--output-root /xxx/

./xmlchange NTASKS_CPL=4,NTASKS_ATM=108,NTASKS_LND=8,NTASKS_ICE=2,NTASKS_OCN=2,NTASKS_ROF=1,NTASKS_GLC=1,NTASKS_WAV=1,NTASKS_ESP=1
./xmlchange ROOTPE_CPL=0,ROOTPE_ATM=4,ROOTPE_LND=112,ROOTPE_ICE=120,ROOTPE_OCN=122,ROOTPE_ROF=124,ROOTPE_GLC=125,ROOTPE_WAV=126,
ROOTPE_ESP=127
./xmlchange MAX_TASKS_PER_NODE=128,MAX_MPITASKS_PER_NODE=128

./xmlchange STOP_OPTION=nyears,STOP_N=1
./xmlchange RUN_STARTDATE=2000-01-01
./case.setup

vi HEMCO_Config.rc
ROOT: /xxx/cesm_input/inputdata/atm/cam/geoschem/emis/ExtData/HEMCO
METDIR: not_used
--> OFFLINE_SOILNOX : false # 1980-2021

vi geoschem_config.yml
start_date: 20000101 000000
end_date: 20010101 010000
photolysis:
input_dir: /xxx/cesm_input/inputdata/atm/cam/geoschem/ExtData/CHEM_INPUTS/FAST_JX/v2020-10/

vi user_nl_cam
! Monthly mean output
avgflag_pertape = 'A'
nhtfrq = 0
mfilt = 1
fincl1 = 'T:A', 'TS:A', 'PS:A', 'PSL:A', 'PRECT:A',
'FLNT:A', 'FSNT:A', 'LWCF:A', 'SWCF:A',
'O3:A', 'SST:A','m_CH4_c:A','cb_CH4_c:A'
inithist = 'MONTHLY'
history_amwg = .true.

./case.setup --reset
./case.build

bsub < run_cesm.job (The job file is as follows.)
#!/bin/bash
#BSUB -J CESM_test # Job name
#BSUB -q v3-6t # SELECT QUEUE HERE (e.g., 33, 38, v3-64, 1t75c)
#BSUB -n 128 # Total number of cores
#BSUB -R "span[ptile=128]" # Cores per node (for single node: set equal to -n)
#BSUB -o %J.out # Output file
#BSUB -e %J.err # Error file

source ~/.bashrc
cd /xxx/case.FCHIST_GC_2.3beta_t1/run/
rm -rf *.log.*
export GFORTRAN_ERROR_DUMPCORE=0
export SIGFPE_DUMPCORE=0
mpirun -np 128 /xxx/case.FCHIST_GC_2.3beta_t1/bld/cesm.exe >> cesm.log. 2>&1

I am using a new machine" sustech-qiming-release" with GCC 11.2.0 compiler.
I have attached the relevant machine-porting files, including ~/.cime/config_compilers.xml, config_batch.xml, config_machines.xml, and config_pes.xml. The error occurs during the run stage rather than during the build stage. I have attached all available run log files, including cesm.log, cpl.log, and the component log files. The cesm.log and atm.log files are quite large, so I’ve removed some repetitive parts.
 

Attachments

hplin

Haipeng Lin
Moderator
Staff member
Thanks for writing.

Eventually, the run stops after about 2–3 months of simulation. For reference, on a single node (128 cores), a default FHIST run with CAM-chem takes about ~2 hours, while the same configuration with GEOS-Chem chemistry (FCHIST_GC) takes ~4.5 hours and shows the instability described above.
I think some of the slowness is due to the instability so KPP keeps retrying the solve. However, the GEOS-Chem chemical mechanism is quite more complex than MOZART-TS1 - I would expect 3-4 hours to be normal if CAM-chem takes ~2 hours.

1. Since my research requires GEOS-Chem chemistry, I understand that the model may need to run on a single node (as suggested in Fritz et al. 2022). Do you have recommendations for optimal processor layout (NTASKS/ROOTPE) for different components (CAM, LND, OCN, etc.) in this case?
Not a single node, a single thread per core (./xmlchange NTHRDS=1). You can run with multiple nodes with GEOS-Chem chemistry enabled.

2. Is there a recommended CESM version for simulations with GEOS-Chem chemistry? Also, when were CESM 2.3 and CESM3 officially released? And in general, are the official releases more stable or faster than the alpha/beta versions?
CESM2.3 was not officially released and CESM3 is yet to be officially released, so any version with GEOS-Chem chemistry is unsupported development code. However, cesm2.3 beta17 is quite old. cesm3_0_beta07 and onwards contain GEOS-Chem version 14.5.

3. For GEOS-Chem-enabled compsets (e.g., FC2000climo_GC, FC2010climo_GC, FCHIST_GC, and FCnudged_GC), could you briefly explain their typical use cases? Can I define custom compsets for specific applications?
The _GC compsets are analogous to the CAM-chem compsets with the same name, only with GEOS-Chem chemistry and HEMCO emissions.
You can define custom compsets using GEOS-Chem chemistry by using the `%GEOSCHEM%HEMCO` modifier on the CAM portion in the compset longname. See config_compsets.xml.

Regarding the issue you mentioned in your post, I would suspect an issue with initial conditions. The model version is quite old with GEOS-Chem 14.1.2. One common recommendation is to make sure you are providing chemical initial conditions from the same version of the GEOS-Chem model (i.e., GEOS-Chem 14.1). You may have to regrid a GEOS-Chem offline run (e.g., 2x2.5) to the f19 grid (1.9x2.5). Unfortunately the version you are using is several years old at this point so it may be difficult to diagnose any specific issues.
 
Vote Upvote 0 Downvote

Liyl2

Yali Li
New Member
Regarding the issue you mentioned in your post, I would suspect an issue with initial conditions. The model version is quite old with GEOS-Chem 14.1.2. One common recommendation is to make sure you are providing chemical initial conditions from the same version of the GEOS-Chem model (i.e., GEOS-Chem 14.1). You may have to regrid a GEOS-Chem offline run (e.g., 2x2.5) to the f19 grid (1.9x2.5). Unfortunately the version you are using is several years old at this point so it may be difficult to diagnose any specific issues.
Hi, HaiPeng,
Thanks a lot for your reply and suggestions! For the Rosenbrock solver issue, it really was caused by the initial conditions. I replaced the GC-matched species in the FCHIST_GC restart file (f.e20.FC2010.f19_f19.144.GC_vbsext.001.cam.i.0007-01-01-00000.nc) with those from a standalone GEOS-Chem restart file and then remapped them, the model no longer reports errors and is now running stably.

Additionally, I’m curious about how the FCHIST_GC compset simulates the state at a specified date (e.g., RUN_STARTDATE=2010-01-01). Because I see that the specific time in this initial file is 1994—how can it simulate 2010? Is it an initial concentration field at equilibrium that is then rapidly adjusted to the 2010 atmospheric state based on the 2010 forcing (i.e., the spin-up process)? If so, does that mean the year of the initial concentration field doesn’t matter?
If I want to simulate the period from 2010 to 2020, which year’s initial species concentrations from the standalone GC should I use? The earliest GC restart file I could find is from 2010 (maybe I could ask the GEOS-Chem team for an earlier one). Additionally, will this change disrupt the equilibrium state of the original restart file? Will I need to spin up for a longer period? Thank you in advance for your help.
 
Vote Upvote 0 Downvote

Liyl2

Yali Li
New Member
CESM2.3 was not officially released and CESM3 is yet to be officially released, so any version with GEOS-Chem chemistry is unsupported development code. However, cesm2.3 beta17 is quite old. cesm3_0_beta07 and onwards contain GEOS-Chem version 14.5.
I also tried cesm3_0_beta07, but after submitting the job, I encountered the error “ncd_pio_openfileERROR: Failed to open file.” I resolved this by setting ./xmlchange PIO_NUMTASKS=1, PIO_STRIDE=1, PIO_ROOT=1. However, after running just two more lines of code, a segmentation fault occurred.


1775831803235.png

I didn't know how to solve this problem, so I ended up sticking with the old 2.3 version. Also, sorry—what I actually wanted to ask was when CESM 2.3 and CESM 3.0 will be released? And thank you so much for your explanations regarding the other issues.
 
Vote Upvote 0 Downvote

hplin

Haipeng Lin
Moderator
Staff member
Hi, HaiPeng,
Thanks a lot for your reply and suggestions! For the Rosenbrock solver issue, it really was caused by the initial conditions. I replaced the GC-matched species in the FCHIST_GC restart file (f.e20.FC2010.f19_f19.144.GC_vbsext.001.cam.i.0007-01-01-00000.nc) with those from a standalone GEOS-Chem restart file and then remapped them, the model no longer reports errors and is now running stably.

Additionally, I’m curious about how the FCHIST_GC compset simulates the state at a specified date (e.g., RUN_STARTDATE=2010-01-01). Because I see that the specific time in this initial file is 1994—how can it simulate 2010? Is it an initial concentration field at equilibrium that is then rapidly adjusted to the 2010 atmospheric state based on the 2010 forcing (i.e., the spin-up process)? If so, does that mean the year of the initial concentration field doesn’t matter?
If I want to simulate the period from 2010 to 2020, which year’s initial species concentrations from the standalone GC should I use? The earliest GC restart file I could find is from 2010 (maybe I could ask the GEOS-Chem team for an earlier one). Additionally, will this change disrupt the equilibrium state of the original restart file? Will I need to spin up for a longer period? Thank you in advance for your help.
That spin up file, I think, was spun up 6 years from 2010, for 2016 simulations in Fritz et al. - so I am not sure why the timestamp in the file shows 1994.

Note the below spin up advice applies to atmospheric chemistry simulations with nudged or prescribed meteorology only - not for climate runs.

I believe `f.e20.FC2010.f19_f19.144.GC_vbsext.001.cam.i.0007-01-01-00000.nc` is an initial concentration for that particular version of GEOS-Chem used in the Fritz et al. paper with chemical initial conditions suitable for simulations starting at 2016-01-01. If you want to start at a different time, you would have to spin up the model. For ozone I would recommend at least 6 months. 12 months to be safe (it might be easier this way if you want to start in January...)

If you are running for a given period of time you can always run GEOS-Chem standalone (e.g., at 2x2.5) to the time you want to start, minus the spin-up time to accommodate for the difference in resolution (I usually do two weeks), and regrid the chemical species concentrations (note any unit differences) into the CAM ncdata initial conditions file.

I didn't know how to solve this problem, so I ended up sticking with the old 2.3 version. Also, sorry—what I actually wanted to ask was when CESM 2.3 and CESM 3.0 will be released? And thank you so much for your explanations regarding the other issues.
There will not be a CESM2.3 release. CESM3.0 is in the final stages of development but I am not aware of a date.
 
Vote Upvote 0 Downvote

Liyl2

Yali Li
New Member
That spin up file, I think, was spun up 6 years from 2010, for 2016 simulations in Fritz et al. - so I am not sure why the timestamp in the file shows 1994.

Note the below spin up advice applies to atmospheric chemistry simulations with nudged or prescribed meteorology only - not for climate runs.

I believe `f.e20.FC2010.f19_f19.144.GC_vbsext.001.cam.i.0007-01-01-00000.nc` is an initial concentration for that particular version of GEOS-Chem used in the Fritz et al. paper with chemical initial conditions suitable for simulations starting at 2016-01-01. If you want to start at a different time, you would have to spin up the model. For ozone I would recommend at least 6 months. 12 months to be safe (it might be easier this way if you want to start in January...)

If you are running for a given period of time you can always run GEOS-Chem standalone (e.g., at 2x2.5) to the time you want to start, minus the spin-up time to accommodate for the difference in resolution (I usually do two weeks), and regrid the chemical species concentrations (note any unit differences) into the CAM ncdata initial conditions file.

There will not be a CESM2.3 release. CESM3.0 is in the final stages of development but I am not aware of a date.
Thank you very much for your reply and your advice on spin-up.

As I understand it, if my research focuses on atmospheric chemistry modeling (possibly covering only a short period of years) rather than climate—similar to your paper comparing tropospheric chemistry in CESM GC and Standalone GC—would it be more appropriate to use the nudge simulation?

Also, unfortunately, when I previously ran the 2010 FCHIST_GC using my own initialization files, it was indeed stable at the very beginning, but after 10 months, I still encountered the error: “Forced exit from Rosenbrock due to step size too small.” Additionally, when running the 2010 FCnudged_GC, I encountered this issue on the very first day. So I’m wondering if this might not just be an initial condition issue, but also related to other inputs or settings in CESM-GC. Do you have any suggestions on what I should check next?
 
Vote Upvote 0 Downvote

hplin

Haipeng Lin
Moderator
Staff member
I think at 10 months the initial conditions would not matter, something else in the model is unstable. I would check the outputs and see if the species distribution looks reasonable. Could be an emissions issue or somewhere has abnormally high rates. The usual suspects, NO/NO2, OH, ozone, I would also check sulfate, we've had issues with it recently - if you see some grid points that are really high that could be the reason. Unfortunately I don't have a clear answer of how to solve these issues.
 
Vote Upvote 0 Downvote

Liyl2

Yali Li
New Member
I think at 10 months the initial conditions would not matter, something else in the model is unstable. I would check the outputs and see if the species distribution looks reasonable. Could be an emissions issue or somewhere has abnormally high rates. The usual suspects, NO/NO2, OH, ozone, I would also check sulfate, we've had issues with it recently - if you see some grid points that are really high that could be the reason. Unfortunately I don't have a clear answer of how to solve these issues.
OK, thanks for the suggestion. I’ll take a look at those variables. All the best!
 
Vote Upvote 0 Downvote
Back
Top