Dear all,
I tried to perform the E3SM-FATES single-point simulation on Perlmutter. Recently when I submitted the case, I got the following ERROR. Strangely, none of the previous simulations reported errors, and the models were all able to run and output results.
ERROR: RUN FAIL: Command 'srun --label -n 1 -N 1 -c 2 --cpu_bind=cores -m plane=128 /pscratch/sd/x/myuser/e3sm_scratch/pm-cpu/Spin_up_1x1_mysite.IELMBGC.ELM_USRDAT.001.2024-07-16/bld/e3sm.exe >> e3sm.log.$LID 2>&1 ' failed
See log file for details: /pscratch/sd/x/myuser/e3sm_scratch/pm-cpu/Spin_up_1x1_mysite.IELMBGC.ELM_USRDAT.001.2024-07-16/run/e3sm.log.28453129.240722-232021
Find the ERROR keyword in the above log file, and the main errors are as follows.
PE 0: MPICH_ABORT_ON_ERROR = 0
PE 0: MPICH_MPIIO_ABORT_ON_RW_ERROR= disable
ERROR: Unknown error submitted to shr_abort_abort
MPICH ERROR [Rank 0] [job id 28453133.0] [Mon Jul 22 23:20:42 2024] [nid004682] - Abort(1001) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0
srun: error: nid004682: task 0: Exited with exit code 233
srun: Terminating StepId=28453133.0
The attached file is my log error file and the sh file that created the case.
Has anyone encountered a similar issue when submitting a case? Any suggestions and comments would be greatly appreciated!!
I tried to perform the E3SM-FATES single-point simulation on Perlmutter. Recently when I submitted the case, I got the following ERROR. Strangely, none of the previous simulations reported errors, and the models were all able to run and output results.
ERROR: RUN FAIL: Command 'srun --label -n 1 -N 1 -c 2 --cpu_bind=cores -m plane=128 /pscratch/sd/x/myuser/e3sm_scratch/pm-cpu/Spin_up_1x1_mysite.IELMBGC.ELM_USRDAT.001.2024-07-16/bld/e3sm.exe >> e3sm.log.$LID 2>&1 ' failed
See log file for details: /pscratch/sd/x/myuser/e3sm_scratch/pm-cpu/Spin_up_1x1_mysite.IELMBGC.ELM_USRDAT.001.2024-07-16/run/e3sm.log.28453129.240722-232021
Find the ERROR keyword in the above log file, and the main errors are as follows.
PE 0: MPICH_ABORT_ON_ERROR = 0
PE 0: MPICH_MPIIO_ABORT_ON_RW_ERROR= disable
ERROR: Unknown error submitted to shr_abort_abort
MPICH ERROR [Rank 0] [job id 28453133.0] [Mon Jul 22 23:20:42 2024] [nid004682] - Abort(1001) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 0
srun: error: nid004682: task 0: Exited with exit code 233
srun: Terminating StepId=28453133.0
The attached file is my log error file and the sh file that created the case.
Has anyone encountered a similar issue when submitting a case? Any suggestions and comments would be greatly appreciated!!