Describe your problem or question:
Hi CESM team,
I am Yali, a phD student from Southern University of Science and Technology (SUSTech) in China.
I am new to CESM and currently running CESM with GEOS-Chem chemistry (cesm2_3_beta17). I encountered a Rosenbrock integrator convergence issue. Although the model does not stop immediately, the simulation becomes extremely slow, and the runtime increases significantly as the simulation proceeds. Eventually, the run stops after about 2–3 months of simulation. For reference, on a single node (128 cores), a default FHIST run with CAM-chem takes about ~2 hours, while the same configuration with GEOS-Chem chemistry (FCHIST_GC) takes ~4.5 hours and shows the instability described above.
I would appreciate any suggestions on how to diagnose or fix this issue.
---
In addition, I have a few related questions:
1. Since my research requires GEOS-Chem chemistry, I understand that the model may need to run on a single node (as suggested in Fritz et al. 2022). Do you have recommendations for optimal processor layout (NTASKS/ROOTPE) for different components (CAM, LND, OCN, etc.) in this case?
2. Is there a recommended CESM version for simulations with GEOS-Chem chemistry? Also, when were CESM 2.3 and CESM3 officially released? And in general, are the official releases more stable or faster than the alpha/beta versions?
3. For GEOS-Chem-enabled compsets (e.g., FC2000climo_GC, FC2010climo_GC, FCHIST_GC, and FCnudged_GC), could you briefly explain their typical use cases? Can I define custom compsets for specific applications?
Any suggestions would be greatly appreciated! Thank you very much for your help!
Have you made any changes to files in the source tree?
According to the known issues documented in GEOS-Chem in CESM (GEOS-Chem in CESM - Geos-chem),
two code modifications were applied:
(1) GEOS-Chem landmap issue: The call to GC_Error in olson_landmap_mod.F90 was commented out to bypass the "State_Met%IUSE is zero" error.
(2) HEMCO vertical grid issue: Pointers BXHEIGHT, PEDGE, and ZSFC in hemco_interface.F90 were initialized to NULL() to prevent errors in CalcVertGrid.
Describe every step you took leading up to the problem:
./create_newcase \
--case /xxx/case.FCHIST_GC_2.3beta_t1 \
--compset FCHIST_GC \
--res f19_f19_mg17 \
--run-unsupported \
--mach sustech-qiming-release \
--output-root /xxx/
./xmlchange NTASKS_CPL=4,NTASKS_ATM=108,NTASKS_LND=8,NTASKS_ICE=2,NTASKS_OCN=2,NTASKS_ROF=1,NTASKS_GLC=1,NTASKS_WAV=1,NTASKS_ESP=1
./xmlchange ROOTPE_CPL=0,ROOTPE_ATM=4,ROOTPE_LND=112,ROOTPE_ICE=120,ROOTPE_OCN=122,ROOTPE_ROF=124,ROOTPE_GLC=125,ROOTPE_WAV=126,
ROOTPE_ESP=127
./xmlchange MAX_TASKS_PER_NODE=128,MAX_MPITASKS_PER_NODE=128
./xmlchange STOP_OPTION=nyears,STOP_N=1
./xmlchange RUN_STARTDATE=2000-01-01
./case.setup
vi HEMCO_Config.rc
ROOT: /xxx/cesm_input/inputdata/atm/cam/geoschem/emis/ExtData/HEMCO
METDIR: not_used
--> OFFLINE_SOILNOX : false # 1980-2021
vi geoschem_config.yml
start_date: 20000101 000000
end_date: 20010101 010000
photolysis:
input_dir: /xxx/cesm_input/inputdata/atm/cam/geoschem/ExtData/CHEM_INPUTS/FAST_JX/v2020-10/
vi user_nl_cam
! Monthly mean output
avgflag_pertape = 'A'
nhtfrq = 0
mfilt = 1
fincl1 = 'T:A', 'TS:A', 'PS:A', 'PSL:A', 'PRECT:A',
'FLNT:A', 'FSNT:A', 'LWCF:A', 'SWCF:A',
'O3:A', 'SST:A','m_CH4_c:A','cb_CH4_c:A'
inithist = 'MONTHLY'
history_amwg = .true.
./case.setup --reset
./case.build
bsub < run_cesm.job (The job file is as follows.)
#!/bin/bash
#BSUB -J CESM_test # Job name
#BSUB -q v3-6t # SELECT QUEUE HERE (e.g., 33, 38, v3-64, 1t75c)
#BSUB -n 128 # Total number of cores
#BSUB -R "span[ptile=128]" # Cores per node (for single node: set equal to -n)
#BSUB -o %J.out # Output file
#BSUB -e %J.err # Error file
source ~/.bashrc
cd /xxx/case.FCHIST_GC_2.3beta_t1/run/
rm -rf *.log.*
export GFORTRAN_ERROR_DUMPCORE=0
export SIGFPE_DUMPCORE=0
mpirun -np 128 /xxx/case.FCHIST_GC_2.3beta_t1/bld/cesm.exe >> cesm.log. 2>&1
I am using a new machine" sustech-qiming-release" with GCC 11.2.0 compiler.
I have attached the relevant machine-porting files, including ~/.cime/config_compilers.xml, config_batch.xml, config_machines.xml, and config_pes.xml. The error occurs during the run stage rather than during the build stage. I have attached all available run log files, including cesm.log, cpl.log, and the component log files. The cesm.log and atm.log files are quite large, so I’ve removed some repetitive parts.
Hi CESM team,
I am Yali, a phD student from Southern University of Science and Technology (SUSTech) in China.
I am new to CESM and currently running CESM with GEOS-Chem chemistry (cesm2_3_beta17). I encountered a Rosenbrock integrator convergence issue. Although the model does not stop immediately, the simulation becomes extremely slow, and the runtime increases significantly as the simulation proceeds. Eventually, the run stops after about 2–3 months of simulation. For reference, on a single node (128 cores), a default FHIST run with CAM-chem takes about ~2 hours, while the same configuration with GEOS-Chem chemistry (FCHIST_GC) takes ~4.5 hours and shows the instability described above.
I would appreciate any suggestions on how to diagnose or fix this issue.
---
In addition, I have a few related questions:
1. Since my research requires GEOS-Chem chemistry, I understand that the model may need to run on a single node (as suggested in Fritz et al. 2022). Do you have recommendations for optimal processor layout (NTASKS/ROOTPE) for different components (CAM, LND, OCN, etc.) in this case?
2. Is there a recommended CESM version for simulations with GEOS-Chem chemistry? Also, when were CESM 2.3 and CESM3 officially released? And in general, are the official releases more stable or faster than the alpha/beta versions?
3. For GEOS-Chem-enabled compsets (e.g., FC2000climo_GC, FC2010climo_GC, FCHIST_GC, and FCnudged_GC), could you briefly explain their typical use cases? Can I define custom compsets for specific applications?
Any suggestions would be greatly appreciated! Thank you very much for your help!
Have you made any changes to files in the source tree?
According to the known issues documented in GEOS-Chem in CESM (GEOS-Chem in CESM - Geos-chem),
two code modifications were applied:
(1) GEOS-Chem landmap issue: The call to GC_Error in olson_landmap_mod.F90 was commented out to bypass the "State_Met%IUSE is zero" error.
(2) HEMCO vertical grid issue: Pointers BXHEIGHT, PEDGE, and ZSFC in hemco_interface.F90 were initialized to NULL() to prevent errors in CalcVertGrid.
Describe every step you took leading up to the problem:
./create_newcase \
--case /xxx/case.FCHIST_GC_2.3beta_t1 \
--compset FCHIST_GC \
--res f19_f19_mg17 \
--run-unsupported \
--mach sustech-qiming-release \
--output-root /xxx/
./xmlchange NTASKS_CPL=4,NTASKS_ATM=108,NTASKS_LND=8,NTASKS_ICE=2,NTASKS_OCN=2,NTASKS_ROF=1,NTASKS_GLC=1,NTASKS_WAV=1,NTASKS_ESP=1
./xmlchange ROOTPE_CPL=0,ROOTPE_ATM=4,ROOTPE_LND=112,ROOTPE_ICE=120,ROOTPE_OCN=122,ROOTPE_ROF=124,ROOTPE_GLC=125,ROOTPE_WAV=126,
ROOTPE_ESP=127
./xmlchange MAX_TASKS_PER_NODE=128,MAX_MPITASKS_PER_NODE=128
./xmlchange STOP_OPTION=nyears,STOP_N=1
./xmlchange RUN_STARTDATE=2000-01-01
./case.setup
vi HEMCO_Config.rc
ROOT: /xxx/cesm_input/inputdata/atm/cam/geoschem/emis/ExtData/HEMCO
METDIR: not_used
--> OFFLINE_SOILNOX : false # 1980-2021
vi geoschem_config.yml
start_date: 20000101 000000
end_date: 20010101 010000
photolysis:
input_dir: /xxx/cesm_input/inputdata/atm/cam/geoschem/ExtData/CHEM_INPUTS/FAST_JX/v2020-10/
vi user_nl_cam
! Monthly mean output
avgflag_pertape = 'A'
nhtfrq = 0
mfilt = 1
fincl1 = 'T:A', 'TS:A', 'PS:A', 'PSL:A', 'PRECT:A',
'FLNT:A', 'FSNT:A', 'LWCF:A', 'SWCF:A',
'O3:A', 'SST:A','m_CH4_c:A','cb_CH4_c:A'
inithist = 'MONTHLY'
history_amwg = .true.
./case.setup --reset
./case.build
bsub < run_cesm.job (The job file is as follows.)
#!/bin/bash
#BSUB -J CESM_test # Job name
#BSUB -q v3-6t # SELECT QUEUE HERE (e.g., 33, 38, v3-64, 1t75c)
#BSUB -n 128 # Total number of cores
#BSUB -R "span[ptile=128]" # Cores per node (for single node: set equal to -n)
#BSUB -o %J.out # Output file
#BSUB -e %J.err # Error file
source ~/.bashrc
cd /xxx/case.FCHIST_GC_2.3beta_t1/run/
rm -rf *.log.*
export GFORTRAN_ERROR_DUMPCORE=0
export SIGFPE_DUMPCORE=0
mpirun -np 128 /xxx/case.FCHIST_GC_2.3beta_t1/bld/cesm.exe >> cesm.log. 2>&1
I am using a new machine" sustech-qiming-release" with GCC 11.2.0 compiler.
I have attached the relevant machine-porting files, including ~/.cime/config_compilers.xml, config_batch.xml, config_machines.xml, and config_pes.xml. The error occurs during the run stage rather than during the build stage. I have attached all available run log files, including cesm.log, cpl.log, and the component log files. The cesm.log and atm.log files are quite large, so I’ve removed some repetitive parts.
Attachments
-
rof.log.260402-165826.txt41.9 KB · Views: 0
-
atm.log.260402-165826.txt605.1 KB · Views: 0
-
cesm.log.txt294.1 KB · Views: 0
-
config_batch.xml.txt4.5 KB · Views: 0
-
config_compilers.xml.txt1 KB · Views: 1
-
config_machines.xml.txt1.1 KB · Views: 1
-
glc.log.260402-165826.txt16.2 KB · Views: 0
-
ice.log.260402-165826.txt269.4 KB · Views: 0
-
lnd.log.260402-165826.txt273.5 KB · Views: 0
-
med.log.260402-165826.txt238.5 KB · Views: 0
