Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPI errors in X compset (dead set)

yangx2

xinyi yang
Member
Hi everyone,
I am trying to port CESM 2.1.0 to a cluster with SLURM. Building a basic case (--compset X) goes well. After submitting it successfully, "squeue -u username" seems pending state. Then it get errors. Some kind of MPI communication error is raised

Here shows CaseState first:

****************************************************
2020-09-10 05:30:53: case.build success
---------------------------------------------------
2020-09-10 05:33:35: case.submit starting
---------------------------------------------------
2020-09-10 06:43:20: case.submit success case.run:10801889, case.st_archive:10801890
---------------------------------------------------
2020-09-10 06:43:53: case.run starting
---------------------------------------------------
2020-09-10 06:44:07: model execution starting
---------------------------------------------------
2020-09-10 06:44:40: model execution success
---------------------------------------------------
2020-09-10 06:44:40: case.run error
ERROR: RUN FAIL: Command 'mpirun -np 120 /mnt/scratch/nfs_fs02/yangx2/b1850.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /mnt/scratch/nfs_fs02/yangx2/b1850.test/run/cesm.log.10801889.200910-064353

****************************************************

The "cesm.log.10801889.200910-064353" is shown below:

***************************************************
[node-0204:25435] *** An error occurred in MPI_Comm_create_keyval
[node-0204:25435] *** reported by process [1384972289,47880295415809]
[node-0204:25435] *** on communicator MPI_COMM_WORLD
[node-0204:25435] *** MPI_ERR_ARG: invalid argument of some other kind
[node-0204:25435] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-0204:25435] *** and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
***************************************************

I am stuck here, any hints would be helpful!
Thanks in advance!
Best,
Skyalr
 

jedwards

CSEG and Liaisons
Staff member
You can view the slurm submit command and the mpiexec command using the preview_run script in the run directory.
You may need to ask your system support folks about the error you are getting, I don't recognize it.
 
Top