Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Ported CESM2 returns error

James King

James King
Member
Hi folks,

I've run into an error when running the cesm2.1.3 model version (full version info attached) on the ARCHER2 HPC in Scotland. I'm trying to do a 3 month integration using the F2000 compset. A quick integration of a couple of days worked fine, but we had issues with the batch system on ARCHER2 and had to make a few modifications to case.run and config_batch.xml to get the longer model integration off the ground. About 15mins into the run, the model crashed. The relevant part of the cesm.log file is:

ERROR: (seq_infodata_Init) :: rpointer file read returns an error condition
#0 0x153bc8c in ???
#1 0x153be0b in ???
#2 0x152c08d in ???
#3 0x41e2eb in ???
#4 0x42242e in ???
#5 0x2b411b29d349 in ???
#6 0x408c29 in ???
at ../sysdeps/x86_64/start.S:120
#7 0xffffffffffffffff in ???
MPICH ERROR [Rank 0] [job id 412079.0] [Mon Jul 26 17:39:49 2021] [unknown] [nid001695] - Abort(1001) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0
srun: error: nid001695: task 0: Exited with exit code 255
srun: Terminating job step 412079.0
slurmstepd: error: *** STEP 412079.0 ON nid001695 CANCELLED AT 2021-07-26T17:39:49 ***
srun: error: nid001695: tasks 1-7: Terminated
srun: error: nid001699: tasks 32-39: Terminated
srun: error: nid001698: tasks 24-31: Terminated
srun: error: nid001701: tasks 48-55: Terminated
srun: error: nid001696: tasks 8-15: Terminated
srun: error: nid001702: tasks 56-63: Terminated
srun: error: nid001697: tasks 16-23: Terminated
srun: error: nid001700: tasks 40-47: Terminated
srun: Force Terminated job step 412079.0

I've attached the requested config files for a run error, along with case.run. Not sure what this error is telling me I've done wrong.

Cheers,

James
 

Attachments

  • version_info.txt
    5.6 KB · Views: 1
  • .case.run.txt
    2.9 KB · Views: 3
  • config_batch.txt
    3.1 KB · Views: 2
  • config_compilers.txt
    4.5 KB · Views: 2
  • config_machines.txt
    5 KB · Views: 2

jedwards

CSEG and Liaisons
Staff member
The error message indicates an issue reading the rpointer file it would help if you post the run logs.
Is your run directory (the location of the rpointer file) on a shared file system and can you confirm that that file system is available on all of the compute nodes?
 

James King

James King
Member
Run logs attached. The run directory is on a shared file system, and as far as I'm aware this is available on all compute nodes - however both the CESM2 port and the HPC itself are work in progress and a couple of file system issues are currently being investigated.
 

Attachments

  • cesm.log.412079.210726-173937.txt
    8.8 KB · Views: 5
  • cpl.log.412079.210726-173937.txt
    4.8 KB · Views: 3
Top