Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CAM6.3 model run on derecho with issue_stopped/crashed in the cesm.log file has (error (78): process killed (SIGTERM)

dharmendraks841

Dharmendra Kumar Singh
Member
Please fill in all relevant information below, deleting the red text after you have read it.

Before submitting a help request, please check to see if your question is already answered:
- Search the forums for similar issues
- Check the CIME troubleshooting guide to see if any suggestions there solve your problem
- Check any other relevant CESM documentation



What version of the code are you using?
- CESM staff members will mainly provide answers for supported model versions, as outlined in the CESM support policy, and can only provide limited help for versions that are no longer supported. You may ask questions about unsupported versions, but may need to rely on community answers.
- For CESM2.1.2 onwards run the script ./describe_version from the top level of your CESM clone to find the version
- From older model versions, provide the output from running the following commands from the top level of your CESM clone
> git describe
> ./manage_externals/checkout_externals --status --verbose



Have you made any changes to files in the source tree?
- Describe any changes (code, xml files, etc.)


Describe every step you took leading up to the problem:
- Describe every step you took, starting with the create_newcase command and including any changes you made to xml files, user_nl files, etc. Please try to reproduce the problem first using your own instructions.


If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.

- If the error occurs during the build, please attach the appropriate build log file showing the compilation error message.
- If the error occurs during the run, please attach all log files from the run (cpl.log, cesm.log and all component log files).



Describe your problem or question:
Hi could you please assist me in resolving an issue with the CAM6 model, which is stopped on Derecho: The details are as follows:
Please go and see the cesm and atm log file.

From cesm. log, I found one line error: dec1910.hsn.de.hpc.ucar.edu 164: forrtl: error (78): process killed (SIGTERM)



In atm.log the issue is here:

dksingh@derecho6:/glade/derecho/scratch/dksingh/control_2006/run> tail atm.log.9455308.desched1.250507-131624

2.873563218390813E-002

-----------------------------------

do_press_fix_llnl: dpress_g = 269.714613640958

do_press_fix_llnl: dpress_g = 269.714613640958

nstep, te 18281 0.21779298155554585E+10 0.21783754201189966E+10 0.24755136061853877E-01 0.98289789975735082E+05 0.22552395239472389E+03

-----------------------------------

photo_timestep_init: diagnostics

calday, last, next, dels = 16.8541666666667 1 2

2.945402298850567E-002

-----------------------------------







2025-05-07 14:30:24: model execution error

ERROR: Command: 'mpiexec --label --line-buffer -n 512 /glade/derecho/scratch/dksingh/control_2006/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/glade/derecho/scratch/dksingh/control_2006/run'

---------------------------------------------------

2025-05-07 14:30:24: case.run error

ERROR: RUN FAIL: Command 'mpiexec --label --line-buffer -n 512 /glade/derecho/scratch/dksingh/control_2006/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed

See log file for details: /glade/derecho/scratch/dksingh/control_2006/run/cesm.log.9455308.desched1.250507-131624





Additional details

dksingh@derecho7:/glade/work/dksingh/cesm_tutorial_23/control_2006> ./xmlquery STOP_OPTION STOP_N RESUBMIT CONTINUE_RUN RUN_TYPE
 

dharmendraks841

Dharmendra Kumar Singh
Member
ANOTHER CASE (control_2024): This case was stopped after 12 hrs run





2025-05-07 11:48:48: case.submit starting 9455271.desched1

---------------------------------------------------

2025-05-07 11:48:48: case.submit success 9455271.desched1

---------------------------------------------------

2025-05-07 13:16:23: case.run starting 9455270.desched1

---------------------------------------------------

2025-05-07 13:16:28: model execution starting 9455270.desched1

---------------------------------------------------

2025-05-08 00:14:24: model execution error

ERROR: Command: 'mpiexec --label --line-buffer -n 256 /glade/derecho/scratch/dksingh/control_2024/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/glade/derecho/scratch/dksingh/control_2024/run'

---------------------------------------------------

2025-05-08 00:14:24: case.run error

ERROR: RUN FAIL: Command 'mpiexec --label --line-buffer -n 256 /glade/derecho/scratch/dksingh/control_2024/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed

See log file for details: /glade/derecho/scratch/dksingh/control_2024/run/cesm.log.9455270.desched1.250507-131623



With this type of error in cesm.log.

496): application called MPI_Abort(comm=0x84000001, 1) - process 145

dec1827.hsn.de.hpc.ucar.edu 14: MPICH ERROR [Rank 14] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1827] - Abort(1) (rank 14 in comm 49

6): application called MPI_Abort(comm=0x84000001, 1) - process 14

dec1831.hsn.de.hpc.ucar.edu 145:

dec1831.hsn.de.hpc.ucar.edu 147: MPICH ERROR [Rank 147] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1831] - Abort(1) (rank 147 in comm

496): application called MPI_Abort(comm=0x84000001, 1) - process 147

dec1827.hsn.de.hpc.ucar.edu 14:

dec1831.hsn.de.hpc.ucar.edu 147:

dec1828.hsn.de.hpc.ucar.edu 65: MPICH ERROR [Rank 65] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1828] - Abort(1) (rank 65 in comm 49

6): application called MPI_Abort(comm=0x84000001, 1) - process 65

dec1827.hsn.de.hpc.ucar.edu 17: MPICH ERROR [Rank 17] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1827] - Abort(1) (rank 17 in comm 49

6): application called MPI_Abort(comm=0x84000001, 1) - process 17

dec1831.hsn.de.hpc.ucar.edu 151: MPICH ERROR [Rank 151] [job id 8df6c295-53c9-48a9-b01c-cd0dca997217] [Thu May 8 00:14:20 2025] [dec1831] - Abort(1) (rank 151 in comm

/glade/derecho/scratch/dksingh/control_2024/run/cesm.log.9455270.desched1.250507-131623 lines 106312-106320/111563 95%







And no clear error in atm.log

dksingh@derecho7:/glade/derecho/scratch/dksingh/control_2024/run> tail atm.log.9455270.desched1.250507-131623

INFLD_REAL_2D_2D: read field TS

READ_NEXT_METDATA: Read meteorological data

do_press_fix_llnl: dpress_g = 288.603987010986

do_press_fix_llnl: dpress_g = 288.603987010986

nstep, te 20353 0.21828164886341825E+10 0.21832587527945981E+10 0.24569596424926912E-01 0.98289657189247562E+05 0.22552395239472389E+03

-----------------------------------

photo_timestep_init: diagnostics

calday, last, next, dels = 60.0208333333333 2 3

0.484543010752688

-----------------------------------



BUT in run file: I found this type: PET000.ESMF_LogFile PET117.ESMF_LogFile PET234.ESMF_LogFile PET351.ESMF_LogFile PET468.ESMF_LogFile

PET001.ESMF_LogFile PET118.ESMF_LogFile PET235.ESMF_LogFile PET352.ESMF_LogFile PET469.ESMF_LogFile

PET002.ESMF_LogFile PET119.ESMF_LogFile PET236.ESMF_LogFile PET353.ESMF_LogFile PET470.ESMF_LogFile
 

hplin

Haipeng Lin
Moderator
Staff member

dharmendraks841

Dharmendra Kumar Singh
Member
Hi Dharmendra, is this 12 hours of model time or actual (wall-clock) time? The Derecho queues have as 12 hour wall-clock time limit (NCAR HPC Documentation - Queues and Charging if the job exceeds that time or JOB_WALLCLOCK_TIME it will be terminated by the system. If this is the case, consider splitting up your runs by using the Restart feature (Restarting a Run — CESM Tutorial)
Hi It was set by default 12 hr as it is also a maximum wall clock time . I know
Even when the time was changed the same error was reproduced.
 

hplin

Haipeng Lin
Moderator
Staff member
Hi, 12 hours is the maximum time. If the run cannot complete at 12 hours it wouldn't be able to complete with less time. I would suggest using restart files and limiting the run time to less than 12 hours to see if it can successfully complete.
 

dharmendraks841

Dharmendra Kumar Singh
Member
Hi Haipeng, Thanks.
As I used RUN_TYPE=branch, it means it will take the rest_file from the previous year, as in this case, the 2023 restart file for 2024.


Results in group run_begin_stop_restart


STOP_OPTION: nmonths

STOP_N: 2

RESUBMIT: 5

CONTINUE_RUN: FALSE

RUN_TYPE: branch


dksingh@derecho1:/glade/work/dksingh/cesm_tutorial_23/control_2024> ./xmlquery REST_OPTION REST_N RESUBMIT CONTINUE_RUN RUN_TYPE

Results in group run_begin_stop_restart

REST_OPTION: nmonths

REST_N: 2

RESUBMIT: 5

CONTINUE_RUN: FALSE

RUN_TYPE: branch

I will run again and let you know.
 
Top