NetCDF error 10 minutes into model run

EmilyVanDeKoot · Jun 7, 2021

Hi, I am trying to run CESM2.1 on the ARCHER2 supercomputer. I am using a modified version of the code where CLM is replaced by a simple land model (marysa/SimpleLand) and I am using a slab ocean. This version of the code works on Cheyenne. The model successfully runs for ~10 minutes, but then stops with the following error:

ERROR: RUN FAIL: Command 'srun --distribution=block:block --hint=nomultithread /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/run/cesm.log.309261.210605-111526

The log file is too large for me to attach, but I think the relevant information can be found at the bottom:

NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 129] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001864] - Abort(1) (rank 129 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 1] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001736] - Abort(1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
srun: error: nid001864: task 129: Exited with exit code 255
srun: Terminating job step 309261.0
slurmstepd: error: *** STEP 309261.0 ON nid001736 CANCELLED AT 2021-06-05T11:25:53 ***
srun: error: nid001736: task 1: Exited with exit code 255
srun: error: nid001864: tasks 128,130-255: Terminated
srun: error: nid001736: tasks 0,2-127: Terminated
srun: Force Terminated job step 309261.0

I have attached my config_machines.xml, config_batch.xml and config_compilers.xml files, as well as my submission script.

I was wondering whether anyone knows how to solve this problem?

I was wondering whether it is the same issue as that described in the section "How do you continue a run after hitting the CLM/PIO error?" of this webpage: Common questions and answers — CESM_WF_DOC 1.0 documentation ? The webpage suggests that I copy source mods from /glade/u/home/cmip6/PATCHES/clm-pio-bug_07-09-2019 but I don't have access to this.

Many thanks,
Emily

jedwards · Jun 7, 2021

I suspect that the problem is that you are trying to write a NaN value to a netcdf file. Since you have replaced the clm model a clm source mod is probably not going to be of any help. I would try to determine which variable has the NaN values and why.

EmilyVanDeKoot · Jun 12, 2021

Thank you for your help!

zjjiang · May 11, 2022

I met the same problem when running SLIM and I want to ask whether you have solve this problem?

NetCDF error 10 minutes into model run

EmilyVanDeKoot

Emily Van de koot

New Member

Attachments

jedwards

CSEG and Liaisons

EmilyVanDeKoot

Emily Van de koot

New Member

zjjiang

ZhongjingJiang

New Member