Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

NetCDF error 10 minutes into model run

EmilyVanDeKoot

Emily Van de koot
New Member
Hi, I am trying to run CESM2.1 on the ARCHER2 supercomputer. I am using a modified version of the code where CLM is replaced by a simple land model (marysa/SimpleLand) and I am using a slab ocean. This version of the code works on Cheyenne. The model successfully runs for ~10 minutes, but then stops with the following error:

ERROR: RUN FAIL: Command 'srun --distribution=block:block --hint=nomultithread /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /work/n02/n02/emilyvdk/cesm_data/runs/DiscussCESM_Example/run/cesm.log.309261.210605-111526


The log file is too large for me to attach, but I think the relevant information can be found at the bottom:

NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 129] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001864] - Abort(1) (rank 129 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 129
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : NetCDF: Numeric conversion not representable
MPICH ERROR [Rank 1] [job id 309261.0] [Sat Jun 5 11:25:52 2021] [unknown] [nid001736] - Abort(1) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
srun: error: nid001864: task 129: Exited with exit code 255
srun: Terminating job step 309261.0
slurmstepd: error: *** STEP 309261.0 ON nid001736 CANCELLED AT 2021-06-05T11:25:53 ***
srun: error: nid001736: task 1: Exited with exit code 255
srun: error: nid001864: tasks 128,130-255: Terminated
srun: error: nid001736: tasks 0,2-127: Terminated
srun: Force Terminated job step 309261.0


I have attached my config_machines.xml, config_batch.xml and config_compilers.xml files, as well as my submission script.

I was wondering whether anyone knows how to solve this problem?

I was wondering whether it is the same issue as that described in the section "How do you continue a run after hitting the CLM/PIO error?" of this webpage: Common questions and answers — CESM_WF_DOC 1.0 documentation ? The webpage suggests that I copy source mods from /glade/u/home/cmip6/PATCHES/clm-pio-bug_07-09-2019 but I don't have access to this.

Many thanks,
Emily
 

Attachments

  • config_batch.xml.txt
    22.1 KB · Views: 0
  • config_compilers.xml.txt
    40.2 KB · Views: 1
  • config_machines.xml.txt
    97.7 KB · Views: 1
  • SLIM_submission_script.txt
    973 bytes · Views: 0

jedwards

CSEG and Liaisons
Staff member
I suspect that the problem is that you are trying to write a NaN value to a netcdf file. Since you have replaced the clm model a clm source mod is probably not going to be of any help. I would try to determine which variable has the NaN values and why.
 

zjjiang

ZhongjingJiang
New Member
I met the same problem when running SLIM and I want to ask whether you have solve this problem?
 
Top