Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CAM6.3 model run on derecho with issue_stopped/crashed in the cesm.log file has (error (78): process killed (SIGTERM)

dharmendraks841

Dharmendra Kumar Singh
Member
Please fill in all relevant information below, deleting the red text after you have read it.

Before submitting a help request, please check to see if your question is already answered:
- Search the forums for similar issues
- Check the CIME troubleshooting guide to see if any suggestions there solve your problem
- Check any other relevant CESM documentation



What version of the code are you using?
- CESM staff members will mainly provide answers for supported model versions, as outlined in the CESM support policy, and can only provide limited help for versions that are no longer supported. You may ask questions about unsupported versions, but may need to rely on community answers.
- For CESM2.1.2 onwards run the script ./describe_version from the top level of your CESM clone to find the version
- From older model versions, provide the output from running the following commands from the top level of your CESM clone
> git describe
> ./manage_externals/checkout_externals --status --verbose



Have you made any changes to files in the source tree?
- Describe any changes (code, xml files, etc.)


Describe every step you took leading up to the problem:
- Describe every step you took, starting with the create_newcase command and including any changes you made to xml files, user_nl files, etc. Please try to reproduce the problem first using your own instructions.


If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.

- If the error occurs during the build, please attach the appropriate build log file showing the compilation error message.
- If the error occurs during the run, please attach all log files from the run (cpl.log, cesm.log and all component log files).



Describe your problem or question:
Hi could you please assist me in resolving an issue with the CAM6 model, which is stopped on Derecho: The details are as follows:
Please go and see the cesm and atm log file.

From cesm. log, I found one line error: dec1910.hsn.de.hpc.ucar.edu 164: forrtl: error (78): process killed (SIGTERM)



In atm.log the issue is here:

dksingh@derecho6:/glade/derecho/scratch/dksingh/control_2006/run> tail atm.log.9455308.desched1.250507-131624

2.873563218390813E-002

-----------------------------------

do_press_fix_llnl: dpress_g = 269.714613640958

do_press_fix_llnl: dpress_g = 269.714613640958

nstep, te 18281 0.21779298155554585E+10 0.21783754201189966E+10 0.24755136061853877E-01 0.98289789975735082E+05 0.22552395239472389E+03

-----------------------------------

photo_timestep_init: diagnostics

calday, last, next, dels = 16.8541666666667 1 2

2.945402298850567E-002

-----------------------------------







2025-05-07 14:30:24: model execution error

ERROR: Command: 'mpiexec --label --line-buffer -n 512 /glade/derecho/scratch/dksingh/control_2006/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/glade/derecho/scratch/dksingh/control_2006/run'

---------------------------------------------------

2025-05-07 14:30:24: case.run error

ERROR: RUN FAIL: Command 'mpiexec --label --line-buffer -n 512 /glade/derecho/scratch/dksingh/control_2006/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed

See log file for details: /glade/derecho/scratch/dksingh/control_2006/run/cesm.log.9455308.desched1.250507-131624





Additional details

dksingh@derecho7:/glade/work/dksingh/cesm_tutorial_23/control_2006> ./xmlquery STOP_OPTION STOP_N RESUBMIT CONTINUE_RUN RUN_TYPE
 

dharmendraks841

Dharmendra Kumar Singh
Member
ANOTHER CASE (control_2024): This case was stopped after 12 hrs run





2025-05-07 11:48:48: case.submit starting 9455271.desched1

---------------------------------------------------

2025-05-07 11:48:48: case.submit success 9455271.desched1

---------------------------------------------------

2025-05-07 13:16:23: case.run starting 9455270.desched1

---------------------------------------------------

2025-05-07 13:16:28: model execution starting 9455270.desched1

---------------------------------------------------

2025-05-08 00:14:24: model execution error

ERROR: Command: 'mpiexec --label --line-buffer -n 256 /glade/derecho/scratch/dksingh/control_2024/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/glade/derecho/scratch/dksingh/control_2024/run'

---------------------------------------------------

2025-05-08 00:14:24: case.run error

ERROR: RUN FAIL: Command 'mpiexec --label --line-buffer -n 256 /glade/derecho/scratch/dksingh/control_2024/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed

See log file for details: /glade/derecho/scratch/dksingh/control_2024/run/cesm.log.9455270.desched1.250507-131623



With this type of error in cesm.log.

496): application called MPI_Abort(comm=0x84000001, 1) - process 145

dec1827.hsn.de.hpc.ucar.edu 14: MPICH ERROR [Rank 14] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1827] - Abort(1) (rank 14 in comm 49

6): application called MPI_Abort(comm=0x84000001, 1) - process 14

dec1831.hsn.de.hpc.ucar.edu 145:

dec1831.hsn.de.hpc.ucar.edu 147: MPICH ERROR [Rank 147] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1831] - Abort(1) (rank 147 in comm

496): application called MPI_Abort(comm=0x84000001, 1) - process 147

dec1827.hsn.de.hpc.ucar.edu 14:

dec1831.hsn.de.hpc.ucar.edu 147:

dec1828.hsn.de.hpc.ucar.edu 65: MPICH ERROR [Rank 65] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1828] - Abort(1) (rank 65 in comm 49

6): application called MPI_Abort(comm=0x84000001, 1) - process 65

dec1827.hsn.de.hpc.ucar.edu 17: MPICH ERROR [Rank 17] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1827] - Abort(1) (rank 17 in comm 49

6): application called MPI_Abort(comm=0x84000001, 1) - process 17

dec1831.hsn.de.hpc.ucar.edu 151: MPICH ERROR [Rank 151] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1831] - Abort(1) (rank 151 in comm

/glade/derecho/scratch/dksingh/control_2024/run/cesm.log.9455270.desched1.250507-131623 lines 106312-106320/111563 95%







And no clear error in atm.log

dksingh@derecho7:/glade/derecho/scratch/dksingh/control_2024/run> tail atm.log.9455270.desched1.250507-131623

INFLD_REAL_2D_2D: read field TS

READ_NEXT_METDATA: Read meteorological data

do_press_fix_llnl: dpress_g = 288.603987010986

do_press_fix_llnl: dpress_g = 288.603987010986

nstep, te 20353 0.21828164886341825E+10 0.21832587527945981E+10 0.24569596424926912E-01 0.98289657189247562E+05 0.22552395239472389E+03

-----------------------------------

photo_timestep_init: diagnostics

calday, last, next, dels = 60.0208333333333 2 3

0.484543010752688

-----------------------------------



BUT in run file: I found this type: PET000.ESMF_LogFile PET117.ESMF_LogFile PET234.ESMF_LogFile PET351.ESMF_LogFile PET468.ESMF_LogFile

PET001.ESMF_LogFile PET118.ESMF_LogFile PET235.ESMF_LogFile PET352.ESMF_LogFile PET469.ESMF_LogFile

PET002.ESMF_LogFile PET119.ESMF_LogFile PET236.ESMF_LogFile PET353.ESMF_LogFile PET470.ESMF_LogFile
 

hplin

Haipeng Lin
Moderator
Staff member

dharmendraks841

Dharmendra Kumar Singh
Member
Hi Dharmendra, is this 12 hours of model time or actual (wall-clock) time? The Derecho queues have as 12 hour wall-clock time limit (NCAR HPC Documentation - Queues and Charging if the job exceeds that time or JOB_WALLCLOCK_TIME it will be terminated by the system. If this is the case, consider splitting up your runs by using the Restart feature (Restarting a Run — CESM Tutorial)
Hi It was set by default 12 hr as it is also a maximum wall clock time . I know
Even when the time was changed the same error was reproduced.
 

hplin

Haipeng Lin
Moderator
Staff member
Hi, 12 hours is the maximum time. If the run cannot complete at 12 hours it wouldn't be able to complete with less time. I would suggest using restart files and limiting the run time to less than 12 hours to see if it can successfully complete.
 

dharmendraks841

Dharmendra Kumar Singh
Member
Hi Haipeng, Thanks.
As I used RUN_TYPE=branch, it means it will take the rest_file from the previous year, as in this case, the 2023 restart file for 2024.


Results in group run_begin_stop_restart


STOP_OPTION: nmonths

STOP_N: 2

RESUBMIT: 5

CONTINUE_RUN: FALSE

RUN_TYPE: branch


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/control_2024> ./xmlquery REST_OPTION REST_N RESUBMIT CONTINUE_RUN RUN_TYPE

Results in group run_begin_stop_restart

REST_OPTION: nmonths

REST_N: 2

RESUBMIT: 5

CONTINUE_RUN: FALSE

RUN_TYPE: branch

I will run again and let you know.
 

dharmendraks841

Dharmendra Kumar Singh
Member
Hi Everyone,
I ran branch type , and I found successful run and archived for 2 months, but there is no any output files in "hist" even model excursion and all were successfully performed: as


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery JOB_WALLCLOCK_TIME





Results in group case.run


JOB_WALLCLOCK_TIME: 04:00:00





Results in group case.st_archive


JOB_WALLCLOCK_TIME: 00:20:00


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery CAM_CONFIG_OPTS


CAM_CONFIG_OPTS: -phys cam6


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlchange --append CAM_CONFIG_OPTS='-offline_dyn'


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery CAM_CONFIG_OPTS


CAM_CONFIG_OPTS: -phys cam6 -offline_dyn


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery RUN_STARTDATE


RUN_STARTDATE: 0001-01-01


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlchange RUN_TYPE=branch


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlchange RUN_REFCASE=spinup_2023_nov_dec_branch


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlchange RUN_REFDATE=2024-01-01


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery RUN_TYPE,RUN_REFCASE,RUN_REFDATE,GET_REFCASE





Results in group run_begin_stop_restart


RUN_TYPE: branch


RUN_REFCASE: spinup_2023_nov_dec_branch


RUN_REFDATE: 2024-01-01


GET_REFCASE: FALSE

dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./preview_run


CASE INFO:


nodes: 4


total tasks: 512


tasks per node: 128


thread count: 1


ngpus per node: 0





BATCH INFO:


FOR JOB: case.run


ENV:


Setting Environment ESMF_RUNTIME_PROFILE=ON


Setting Environment ESMF_RUNTIME_PROFILE_OUTPUT=SUMMARY


Setting Environment FI_CXI_RX_MATCH_MODE=hybrid


Setting Environment FI_MR_CACHE_MONITOR=memhooks


Setting Environment OMP_NUM_THREADS=1


Setting Environment OMP_STACKSIZE=64M





SUBMIT CMD:


qsub -q main -l walltime=04:00:00 -A UIUC0044 -v ARGS_FOR_SCRIPT='--resubmit' .case.run





MPIRUN (job=case.run):


mpiexec --label --line-buffer -n 512 /glade/derecho/scratch/dksingh/metb_control_2024/bld/cesm.exe >> cesm.log.$LID 2>&1





FOR JOB: case.st_archive


ENV:


Setting Environment ESMF_RUNTIME_PROFILE=ON


Setting Environment ESMF_RUNTIME_PROFILE_OUTPUT=SUMMARY


Setting Environment FI_CXI_RX_MATCH_MODE=hybrid


Setting Environment FI_MR_CACHE_MONITOR=memhooks


Setting Environment OMP_NUM_THREADS=1


Setting Environment OMP_STACKSIZE=64M





SUBMIT CMD:


qsub -q main -l walltime=00:20:00 -A UIUC0044 -W depend=afterok:0 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive





dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery INFO_DBUG


INFO_DBUG: 1


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlchange INFO_DBUG =2


usage: xmlchange [-h] [-d] [-v] [-s] [--caseroot CASEROOT] [--append] [--subgroup SUBGROUP] [--id ID] [--val VAL] [--file FILE] [--delimiter DELIMITER] [--dryrun]


[--noecho] [-f] [-N] [-loglevel LOGLEVEL]


[listofsettings]


xmlchange: error: unrecognized arguments: =2


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery DOUT_S


DOUT_S: TRUE


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlchange DEBUG=TRUE


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ./xmlquery DEBUG


DEBUG: TRUE.

2025-05-09 01:23:11: case.run starting 9474428.desched1


---------------------------------------------------


2025-05-09 01:23:15: model execution starting 9474428.desched1


---------------------------------------------------


2025-05-09 01:47:58: model execution success 9474428.desched1


---------------------------------------------------


2025-05-09 01:47:58: case.run success 9474428.desched1


---------------------------------------------------


2025-05-09 01:49:33: st_archive starting 9474429.desched1


---------------------------------------------------


2025-05-09 01:49:33: st_archive success 9474429.desched1.


dksingh@derecho7:/glade/derecho/scratch/dksingh/metb_control_2024/run> ls


atm_in init_generated_files rpointer.ocn spinup_2023_nov_dec_branch.clm2.rh0.2024-01-01-00000.nc


CASEROOT lnd_in rpointer.rof spinup_2023_nov_dec_branch.cpl.r.2024-01-01-00000.nc


docn_in mosart_in spinup_2023_nov_dec_branch.cam.h0.2023-12.nc spinup_2023_nov_dec_branch.docn.r.2024-01-01-00000.nc


docn.streams.xml nuopc.runconfig spinup_2023_nov_dec_branch.cam.i.2024-01-01-00000.nc spinup_2023_nov_dec_branch.mosart.h0.2023-12.nc


drv_flds_in nuopc.runseq spinup_2023_nov_dec_branch.cam.r.2024-01-01-00000.nc spinup_2023_nov_dec_branch.mosart.r.2024-01-01-00000.nc


drv_in rpointer.atm spinup_2023_nov_dec_branch.cam.rs.2024-01-01-00000.nc spinup_2023_nov_dec_branch.mosart.rh0.2024-01-01-00000.nc


ESMF_Profile.summary rpointer.cpl spinup_2023_nov_dec_branch.cice.r.2024-01-01-00000.nc timing


fd.yaml rpointer.ice spinup_2023_nov_dec_branch.clm2.h0.2023-12.nc


ice_in rpointer.lnd spinup_2023_nov_dec_branch.clm2.r.2024-01-01-00000.nc

dksingh@derecho7:/glade/derecho/scratch/dksingh/archive/metb_control_2024> ls


atm cpl esp ice lnd logs ocn rof


dksingh@derecho7:/glade/derecho/scratch/dksingh/archive/metb_control_2024> cd atm/


dksingh@derecho7:/glade/derecho/scratch/dksingh/archive/metb_control_2024/atm> ls


hist


dksingh@derecho7:/glade/derecho/scratch/dksingh/archive/metb_control_2024/atm> cd hist/


dksingh@derecho7:/glade/derecho/scratch/dksingh/archive/metb_control_2024/atm/hist> ls


dksingh@derecho7:/glade/derecho/scratch/dksingh/archive/metb_control_2024/atm/hist> ls -lrt


total 0
 

hplin

Haipeng Lin
Moderator
Staff member
Hi, there is a .h0. file in the run directory -- were you expecting other history tapes? What is the user_nl_cam configuration of the history files? Thanks!
dksingh@derecho7:/glade/derecho/scratch/dksingh/metb_control_2024/run> ls


atm_in init_generated_files rpointer.ocn spinup_2023_nov_dec_branch.clm2.rh0.2024-01-01-00000.nc


CASEROOT lnd_in rpointer.rof spinup_2023_nov_dec_branch.cpl.r.2024-01-01-00000.nc


docn_in mosart_in spinup_2023_nov_dec_branch.cam.h0.2023-12.nc
 

dharmendraks841

Dharmendra Kumar Singh
Member
For more detailed: About casedirectory.

dksingh@derecho7:/glade/work/dksingh/cesm_tutorial_23/metb_control_2024> ls -lrt


total 299


-rw-r--r-- 1 dksingh ncar 1276 Sep 28 2023 Macros.cmake


-rw-r--r-- 1 dksingh ncar 1497 Sep 28 2023 Depends.intel


drwxr-xr-x 2 dksingh ncar 16384 Sep 28 2023 cmake_macros


-rw-r--r-- 1 dksingh ncar 1484 Sep 28 2023 user_nl_docn_streams


-rw-r--r-- 1 dksingh ncar 932 Sep 28 2023 user_nl_docn


-rw-r--r-- 1 dksingh ncar 367 Sep 28 2023 user_nl_cice


-rw-r--r-- 1 dksingh ncar 1344 Sep 28 2023 user_nl_clm


-rw-r--r-- 1 dksingh ncar 848 Sep 28 2023 user_nl_cpl


-rw-r--r-- 1 dksingh ncar 414 Sep 28 2023 user_nl_mosart


-rw-r--r-- 1 dksingh ncar 2658 May 8 21:27 README.case


drwxr-xr-x 10 dksingh ncar 4096 May 8 21:27 SourceMods


lrwxrwxrwx 1 dksingh ncar 61 May 8 21:27 case.setup -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/case.setup


lrwxrwxrwx 1 dksingh ncar 61 May 8 21:27 case.build -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/case.build


lrwxrwxrwx 1 dksingh ncar 62 May 8 21:27 case.submit -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/case.submit


lrwxrwxrwx 1 dksingh ncar 63 May 8 21:27 case.qstatus -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/case.qstatus


lrwxrwxrwx 1 dksingh ncar 72 May 8 21:27 case.cmpgen_namelists -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/case.cmpgen_namelists


lrwxrwxrwx 1 dksingh ncar 68 May 8 21:27 preview_namelists -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/preview_namelists


lrwxrwxrwx 1 dksingh ncar 62 May 8 21:27 preview_run -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/preview_run


lrwxrwxrwx 1 dksingh ncar 67 May 8 21:27 check_input_data -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/check_input_data


lrwxrwxrwx 1 dksingh ncar 61 May 8 21:27 check_case -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/check_case


lrwxrwxrwx 1 dksingh ncar 60 May 8 21:27 xmlchange -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/xmlchange


lrwxrwxrwx 1 dksingh ncar 59 May 8 21:27 xmlquery -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/xmlquery


lrwxrwxrwx 1 dksingh ncar 59 May 8 21:27 pelayout -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/pelayout


drwxr-xr-x 2 dksingh ncar 4096 May 8 21:27 Tools


lrwxrwxrwx 1 dksingh ncar 67 May 8 21:27 archive_metadata -> /glade/work/dksingh/cam6_3_128/CAM/cime/CIME/Tools/archive_metadata


-rw-r--r-- 1 dksingh ncar 19699 May 8 21:27 env_case.xml


-rw-r--r-- 1 dksingh ncar 1998 May 8 21:27 env_batch.xml


-rw-r--r-- 1 dksingh ncar 3905 May 8 21:27 env_workflow.xml


-rw-r--r-- 1 dksingh ncar 8219 May 8 21:27 env_archive.xml


-rw-r--r-- 1 dksingh ncar 7067 May 8 22:33 env_mach_pes.xml


-rw-r--r-- 1 dksingh ncar 608 May 8 22:48 user_nl_cam


-rw-r--r-- 1 dksingh ncar 3831 May 8 22:49 env_mach_specific.xml


-rw-r--r-- 1 dksingh ncar 40550 May 8 22:49 software_environment.txt


-rw-r--r-- 1 dksingh ncar 1721 May 8 22:54 Macros.make


-rw-r--r-- 1 dksingh ncar 15614 May 8 23:00 env_build.xml


-rwxr-xr-x 1 dksingh ncar 4387 May 8 23:06 case.st_archive


drwxr-xr-x 2 dksingh ncar 4096 May 8 23:06 LockedFiles


drwxr-xr-x 8 dksingh ncar 4096 May 9 01:23 Buildconf


drwxr-xr-x 2 dksingh ncar 4096 May 9 01:23 CaseDocs


drwxr-xr-x 2 dksingh ncar 4096 May 9 01:47 timing


-rw-r--r-- 1 dksingh ncar 5919 May 9 01:47 run.metb_control_2024.o9474428


-rw-r--r-- 1 dksingh ncar 2966 May 9 01:49 st_archive.metb_control_2024.o9474429


-rw-r--r-- 1 dksingh ncar 62296 May 9 10:11 env_run.xml


-rw-r--r-- 1 dksingh ncar 786 May 9 10:12 replay.sh


-rw-r--r-- 1 dksingh ncar 3993 May 9 10:12 CaseStatus
 

dharmendraks841

Dharmendra Kumar Singh
Member
Hi, there is a .h0. file in the run directory -- were you expecting other history tapes? What is the user_nl_cam configuration of the history files?
I only need monthly files from Jan 2024..Feb2024 . No other file type is required. It is in default for the monthly file. I did not mention any other file type in my user_nl_cam
 
Top