Hi everyone,
I am trying to port CESM 2.1.0 to a cluster with SLURM. Building a basic case (--compset X) goes well. After submitting it successfully, "squeue -u username" seems pending state. Then it get errors. Some kind of MPI communication error is raised
Here shows CaseState first:
****************************************************
2020-09-10 05:30:53: case.build success
---------------------------------------------------
2020-09-10 05:33:35: case.submit starting
---------------------------------------------------
2020-09-10 06:43:20: case.submit success case.run:10801889, case.st_archive:10801890
---------------------------------------------------
2020-09-10 06:43:53: case.run starting
---------------------------------------------------
2020-09-10 06:44:07: model execution starting
---------------------------------------------------
2020-09-10 06:44:40: model execution success
---------------------------------------------------
2020-09-10 06:44:40: case.run error
ERROR: RUN FAIL: Command 'mpirun -np 120 /mnt/scratch/nfs_fs02/yangx2/b1850.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10801889.200910-064353
****************************************************
The "cesm.log.10801889.200910-064353" is shown below:
***************************************************
[node-0204:25435] *** An error occurred in MPI_Comm_create_keyval
[node-0204:25435] *** reported by process [1384972289,47880295415809]
[node-0204:25435] *** on communicator MPI_COMM_WORLD
[node-0204:25435] *** MPI_ERR_ARG: invalid argument of some other kind
[node-0204:25435] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-0204:25435] *** and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
***************************************************
I am stuck here, any hints would be helpful!
Thanks in advance!
Best,
Skyalr
I am trying to port CESM 2.1.0 to a cluster with SLURM. Building a basic case (--compset X) goes well. After submitting it successfully, "squeue -u username" seems pending state. Then it get errors. Some kind of MPI communication error is raised
Here shows CaseState first:
****************************************************
2020-09-10 05:30:53: case.build success
---------------------------------------------------
2020-09-10 05:33:35: case.submit starting
---------------------------------------------------
2020-09-10 06:43:20: case.submit success case.run:10801889, case.st_archive:10801890
---------------------------------------------------
2020-09-10 06:43:53: case.run starting
---------------------------------------------------
2020-09-10 06:44:07: model execution starting
---------------------------------------------------
2020-09-10 06:44:40: model execution success
---------------------------------------------------
2020-09-10 06:44:40: case.run error
ERROR: RUN FAIL: Command 'mpirun -np 120 /mnt/scratch/nfs_fs02/yangx2/b1850.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10801889.200910-064353
****************************************************
The "cesm.log.10801889.200910-064353" is shown below:
***************************************************
[node-0204:25435] *** An error occurred in MPI_Comm_create_keyval
[node-0204:25435] *** reported by process [1384972289,47880295415809]
[node-0204:25435] *** on communicator MPI_COMM_WORLD
[node-0204:25435] *** MPI_ERR_ARG: invalid argument of some other kind
[node-0204:25435] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-0204:25435] *** and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
***************************************************
I am stuck here, any hints would be helpful!
Thanks in advance!
Best,
Skyalr