Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

B1850 case execution error on derecho

AYANAMY

New Member
Hi,

I was running CESM2.1.4 on derecho. I download the model from github using: git checkout release-cesm2.1.4

Then I created the case C6_B1850_Control with the compset B1850 and resolution f19_g17.

The only changes I have made to the model is to set N_STOP = 5 and STOP_OPTION = nyears. According, I have set the JOB_WALLCLOCK_TIME = 5:00:00. I have also changed the NTASKS for OCN from 2 to 4 to speed up the model, but I got the same following error without these changes:

-------------------------------------------------------------------------
- Prestage required restarts into /glade/derecho/scratch/lizhiy/C6_B1850_Control/run
- Case input data directory (DIN_LOC_ROOT) is /glade/campaign/cesm/cesmdata/inputdata
- Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
2024-02-27 11:28:36 MODEL EXECUTION BEGINS HERE
run command is mpiexec --label -n 1024 /glade/derecho/scratch/lizhiy/C6_B1850_Control/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpiexec --label -n 1024 /glade/derecho/scratch/lizhiy/C6_B1850_Control/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/derecho/scratch/lizhiy/C6_B1850_Control/run/cesm.log.3570032.desched1.240227-112827


The cesm.log file is too large to upload, so I paste the error lines here:


ice: Vertical thermo error
ERROR: ice: Vertical thermo error
dec0488.hsn.de.hpc.ucar.edu 478: Image PC Routine Line Source
cesm.exe 0000000002EDFA4D shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 00000000017CD7E4 ice_exit_mp_abort 46 ice_exit.F90
cesm.exe 0000000001AA4693 ice_step_mod_mp_s 569 ice_step_mod.F90
cesm.exe 00000000018FF7C8 cice_runmod_mp_ci 186 CICE_RunMod.F90
cesm.exe 00000000017BEA8A ice_comp_mct_mp_i 563 ice_comp_mct.F90
cesm.exe 00000000004312BA component_mod_mp_ 728 component_mod.F90
cesm.exe 000000000041681B cime_comp_mod_mp_ 2707 cime_comp_mod.F90
cesm.exe 0000000000430F59 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004145FD Unknown Unknown Unknown
libc-2.31.so 000014A2EBDD329D __libc_start_main Unknown Unknown
cesm.exe 000000000041452A Unknown Unknown Unknown
MPICH ERROR [Rank 478] [job id 49cfa114-bc92-4d0f-9cb7-af9da22e2a0c] [Tue Feb 27 14:28:16 2024] [dec0488] - Abort(1001) (rank 478 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 478
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 478
dec0488.hsn.de.hpc.ucar.edu: rank 478 exited with code 255
dec0492.hsn.de.hpc.ucar.edu 527: forrtl: error (78): process killed (SIGTERM)


I have met with the error several times. The error appears randomly after several years of model execution. Sometime it does not happen at all and sometimes when I rerun the model it just disappear. I was wondering if anyone may have faced with the same problem and any suggestions would be appreciated!

Thanks in advance.
 
Top