Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

NetCDF error 10 minutes into model run

EmilyVanDeKoot

Emily Van de koot
New Member
Hi, I am trying to run CESM2.1 on the ARCHER2 supercomputer. I am using a modified version of the code where CLM is replaced by a simple land model (marysa/SimpleLand) and I am using a slab ocean. This version of the code works on Cheyenne. The model successfully runs for ~10 minutes, but then stops with the following error:

ERROR: RUN FAIL: Command 'srun --distribution=block:block --hint=nomultithread /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/run/cesm.log.309261.210605-111526


The log file is too large for me to attach, but I think the relevant information can be found at the bottom:

NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 129] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001864] - Abort(1) (rank 129 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 1] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001736] - Abort(1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
srun: error: nid001864: task 129: Exited with exit code 255
srun: Terminating job step 309261.0
slurmstepd: error: *** STEP 309261.0 ON nid001736 CANCELLED AT 2021-06-05T11:25:53 ***
srun: error: nid001736: task 1: Exited with exit code 255
srun: error: nid001864: tasks 128,130-255: Terminated
srun: error: nid001736: tasks 0,2-127: Terminated
srun: Force Terminated job step 309261.0


I have attached my config_machines.xml, config_batch.xml and config_compilers.xml files, as well as my submission script.

I was wondering whether anyone knows how to solve this problem?

I was wondering whether it is the same issue as that described in the section "How do you continue a run after hitting the CLM/PIO error?" of this webpage: Common questions and answers — CESM_WF_DOC 1.0 documentation ? The webpage suggests that I copy source mods from /glade/u/home/cmip6/PATCHES/clm-pio-bug_07-09-2019 but I don't have access to this.

Many thanks,
Emily
 

Attachments

  • config_batch.xml.txt
    22.1 KB · Views: 0
  • config_compilers.xml.txt
    40.2 KB · Views: 1
  • config_machines.xml.txt
    97.7 KB · Views: 1
  • SLIM_submission_script.txt
    973 bytes · Views: 0

jedwards

CSEG and Liaisons
Staff member
I suspect that the problem is that you are trying to write a NaN value to a netcdf file. Since you have replaced the clm model a clm source mod is probably not going to be of any help. I would try to determine which variable has the NaN values and why.
 
Top