Hi, I am trying to run CESM2.1 on the ARCHER2 supercomputer. I am using a modified version of the code where CLM is replaced by a simple land model (marysa/SimpleLand) and I am using a slab ocean. This version of the code works on Cheyenne. The model successfully runs for ~10 minutes, but then stops with the following error:
ERROR: RUN FAIL: Command 'srun --distribution=block:block --hint=nomultithread /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/run/cesm.log.309261.210605-111526
The log file is too large for me to attach, but I think the relevant information can be found at the bottom:
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 129] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001864] - Abort(1) (rank 129 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 1] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001736] - Abort(1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
srun: error: nid001864: task 129: Exited with exit code 255
srun: Terminating job step 309261.0
slurmstepd: error: *** STEP 309261.0 ON nid001736 CANCELLED AT 2021-06-05T11:25:53 ***
srun: error: nid001736: task 1: Exited with exit code 255
srun: error: nid001864: tasks 128,130-255: Terminated
srun: error: nid001736: tasks 0,2-127: Terminated
srun: Force Terminated job step 309261.0
I have attached my config_machines.xml, config_batch.xml and config_compilers.xml files, as well as my submission script.
I was wondering whether anyone knows how to solve this problem?
I was wondering whether it is the same issue as that described in the section "How do you continue a run after hitting the CLM/PIO error?" of this webpage: Common questions and answers — CESM_WF_DOC 1.0 documentation ? The webpage suggests that I copy source mods from /glade/u/home/cmip6/PATCHES/clm-pio-bug_07-09-2019 but I don't have access to this.
Many thanks,
Emily
ERROR: RUN FAIL: Command 'srun --distribution=block:block --hint=nomultithread /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/run/cesm.log.309261.210605-111526
The log file is too large for me to attach, but I think the relevant information can be found at the bottom:
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 129] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001864] - Abort(1) (rank 129 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 1] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001736] - Abort(1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
srun: error: nid001864: task 129: Exited with exit code 255
srun: Terminating job step 309261.0
slurmstepd: error: *** STEP 309261.0 ON nid001736 CANCELLED AT 2021-06-05T11:25:53 ***
srun: error: nid001736: task 1: Exited with exit code 255
srun: error: nid001864: tasks 128,130-255: Terminated
srun: error: nid001736: tasks 0,2-127: Terminated
srun: Force Terminated job step 309261.0
I have attached my config_machines.xml, config_batch.xml and config_compilers.xml files, as well as my submission script.
I was wondering whether anyone knows how to solve this problem?
I was wondering whether it is the same issue as that described in the section "How do you continue a run after hitting the CLM/PIO error?" of this webpage: Common questions and answers — CESM_WF_DOC 1.0 documentation ? The webpage suggests that I copy source mods from /glade/u/home/cmip6/PATCHES/clm-pio-bug_07-09-2019 but I don't have access to this.
Many thanks,
Emily