Hi,
I was running CESM2.1.4 on derecho. I download the model from github using: git checkout release-cesm2.1.4
Then I created the case C6_B1850_Control with the compset B1850 and resolution f19_g17.
The only changes I have made to the model is to set N_STOP = 5 and STOP_OPTION = nyears. According, I have set the JOB_WALLCLOCK_TIME = 5:00:00. I have also changed the NTASKS for OCN from 2 to 4 to speed up the model, but I got the same following error without these changes:
-------------------------------------------------------------------------
- Prestage required restarts into /glade/derecho/scratch/lizhiy/C6_B1850_Control/run
- Case input data directory (DIN_LOC_ROOT) is /glade/campaign/cesm/cesmdata/inputdata
- Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
2024-02-27 11:28:36 MODEL EXECUTION BEGINS HERE
run command is mpiexec --label -n 1024 /glade/derecho/scratch/lizhiy/C6_B1850_Control/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpiexec --label -n 1024 /glade/derecho/scratch/lizhiy/C6_B1850_Control/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/derecho/scratch/lizhiy/C6_B1850_Control/run/cesm.log.3570032.desched1.240227-112827
The cesm.log file is too large to upload, so I paste the error lines here:
ice: Vertical thermo error
ERROR: ice: Vertical thermo error
dec0488.hsn.de.hpc.ucar.edu 478: Image PC Routine Line Source
cesm.exe 0000000002EDFA4D shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 00000000017CD7E4 ice_exit_mp_abort 46 ice_exit.F90
cesm.exe 0000000001AA4693 ice_step_mod_mp_s 569 ice_step_mod.F90
cesm.exe 00000000018FF7C8 cice_runmod_mp_ci 186 CICE_RunMod.F90
cesm.exe 00000000017BEA8A ice_comp_mct_mp_i 563 ice_comp_mct.F90
cesm.exe 00000000004312BA component_mod_mp_ 728 component_mod.F90
cesm.exe 000000000041681B cime_comp_mod_mp_ 2707 cime_comp_mod.F90
cesm.exe 0000000000430F59 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004145FD Unknown Unknown Unknown
libc-2.31.so 000014A2EBDD329D __libc_start_main Unknown Unknown
cesm.exe 000000000041452A Unknown Unknown Unknown
MPICH ERROR [Rank 478] [job id 49cfa114-bc92-4d0f-9cb7-af9da22e2a0c] [Tue Feb 27 14:28:16 2024] [dec0488] - Abort(1001) (rank 478 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 478
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 478
dec0488.hsn.de.hpc.ucar.edu: rank 478 exited with code 255
dec0492.hsn.de.hpc.ucar.edu 527: forrtl: error (78): process killed (SIGTERM)
I have met with the error several times. The error appears randomly after several years of model execution. Sometime it does not happen at all and sometimes when I rerun the model it just disappear. I was wondering if anyone may have faced with the same problem and any suggestions would be appreciated!
Thanks in advance.
				
			I was running CESM2.1.4 on derecho. I download the model from github using: git checkout release-cesm2.1.4
Then I created the case C6_B1850_Control with the compset B1850 and resolution f19_g17.
The only changes I have made to the model is to set N_STOP = 5 and STOP_OPTION = nyears. According, I have set the JOB_WALLCLOCK_TIME = 5:00:00. I have also changed the NTASKS for OCN from 2 to 4 to speed up the model, but I got the same following error without these changes:
-------------------------------------------------------------------------
- Prestage required restarts into /glade/derecho/scratch/lizhiy/C6_B1850_Control/run
- Case input data directory (DIN_LOC_ROOT) is /glade/campaign/cesm/cesmdata/inputdata
- Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
2024-02-27 11:28:36 MODEL EXECUTION BEGINS HERE
run command is mpiexec --label -n 1024 /glade/derecho/scratch/lizhiy/C6_B1850_Control/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpiexec --label -n 1024 /glade/derecho/scratch/lizhiy/C6_B1850_Control/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/derecho/scratch/lizhiy/C6_B1850_Control/run/cesm.log.3570032.desched1.240227-112827
The cesm.log file is too large to upload, so I paste the error lines here:
ice: Vertical thermo error
ERROR: ice: Vertical thermo error
dec0488.hsn.de.hpc.ucar.edu 478: Image PC Routine Line Source
cesm.exe 0000000002EDFA4D shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 00000000017CD7E4 ice_exit_mp_abort 46 ice_exit.F90
cesm.exe 0000000001AA4693 ice_step_mod_mp_s 569 ice_step_mod.F90
cesm.exe 00000000018FF7C8 cice_runmod_mp_ci 186 CICE_RunMod.F90
cesm.exe 00000000017BEA8A ice_comp_mct_mp_i 563 ice_comp_mct.F90
cesm.exe 00000000004312BA component_mod_mp_ 728 component_mod.F90
cesm.exe 000000000041681B cime_comp_mod_mp_ 2707 cime_comp_mod.F90
cesm.exe 0000000000430F59 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004145FD Unknown Unknown Unknown
libc-2.31.so 000014A2EBDD329D __libc_start_main Unknown Unknown
cesm.exe 000000000041452A Unknown Unknown Unknown
MPICH ERROR [Rank 478] [job id 49cfa114-bc92-4d0f-9cb7-af9da22e2a0c] [Tue Feb 27 14:28:16 2024] [dec0488] - Abort(1001) (rank 478 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 478
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 478
dec0488.hsn.de.hpc.ucar.edu: rank 478 exited with code 255
dec0492.hsn.de.hpc.ucar.edu 527: forrtl: error (78): process killed (SIGTERM)
I have met with the error several times. The error appears randomly after several years of model execution. Sometime it does not happen at all and sometimes when I rerun the model it just disappear. I was wondering if anyone may have faced with the same problem and any suggestions would be appreciated!
Thanks in advance.
