Hi folks,
I've run into an error when running the cesm2.1.3 model version (full version info attached) on the ARCHER2 HPC in Scotland. I'm trying to do a 3 month integration using the F2000 compset. A quick integration of a couple of days worked fine, but we had issues with the batch system on ARCHER2 and had to make a few modifications to case.run and config_batch.xml to get the longer model integration off the ground. About 15mins into the run, the model crashed. The relevant part of the cesm.log file is:
ERROR: (seq_infodata_Init) :: rpointer file read returns an error condition
#0 0x153bc8c in ???
#1 0x153be0b in ???
#2 0x152c08d in ???
#3 0x41e2eb in ???
#4 0x42242e in ???
#5 0x2b411b29d349 in ???
#6 0x408c29 in ???
at ../sysdeps/x86_64/start.S:120
#7 0xffffffffffffffff in ???
MPICH ERROR [Rank 0] [job id 412079.0] [Mon Jul 26 17:39:49 2021] [unknown] [nid001695] - Abort(1001) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0
srun: error: nid001695: task 0: Exited with exit code 255
srun: Terminating job step 412079.0
slurmstepd: error: *** STEP 412079.0 ON nid001695 CANCELLED AT 2021-07-26T17:39:49 ***
srun: error: nid001695: tasks 1-7: Terminated
srun: error: nid001699: tasks 32-39: Terminated
srun: error: nid001698: tasks 24-31: Terminated
srun: error: nid001701: tasks 48-55: Terminated
srun: error: nid001696: tasks 8-15: Terminated
srun: error: nid001702: tasks 56-63: Terminated
srun: error: nid001697: tasks 16-23: Terminated
srun: error: nid001700: tasks 40-47: Terminated
srun: Force Terminated job step 412079.0
I've attached the requested config files for a run error, along with case.run. Not sure what this error is telling me I've done wrong.
Cheers,
James
I've run into an error when running the cesm2.1.3 model version (full version info attached) on the ARCHER2 HPC in Scotland. I'm trying to do a 3 month integration using the F2000 compset. A quick integration of a couple of days worked fine, but we had issues with the batch system on ARCHER2 and had to make a few modifications to case.run and config_batch.xml to get the longer model integration off the ground. About 15mins into the run, the model crashed. The relevant part of the cesm.log file is:
ERROR: (seq_infodata_Init) :: rpointer file read returns an error condition
#0 0x153bc8c in ???
#1 0x153be0b in ???
#2 0x152c08d in ???
#3 0x41e2eb in ???
#4 0x42242e in ???
#5 0x2b411b29d349 in ???
#6 0x408c29 in ???
at ../sysdeps/x86_64/start.S:120
#7 0xffffffffffffffff in ???
MPICH ERROR [Rank 0] [job id 412079.0] [Mon Jul 26 17:39:49 2021] [unknown] [nid001695] - Abort(1001) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0
srun: error: nid001695: task 0: Exited with exit code 255
srun: Terminating job step 412079.0
slurmstepd: error: *** STEP 412079.0 ON nid001695 CANCELLED AT 2021-07-26T17:39:49 ***
srun: error: nid001695: tasks 1-7: Terminated
srun: error: nid001699: tasks 32-39: Terminated
srun: error: nid001698: tasks 24-31: Terminated
srun: error: nid001701: tasks 48-55: Terminated
srun: error: nid001696: tasks 8-15: Terminated
srun: error: nid001702: tasks 56-63: Terminated
srun: error: nid001697: tasks 16-23: Terminated
srun: error: nid001700: tasks 40-47: Terminated
srun: Force Terminated job step 412079.0
I've attached the requested config files for a run error, along with case.run. Not sure what this error is telling me I've done wrong.
Cheers,
James