CESM 1.2.2 works with MPICH 3.3 but not OpenMPI 3.1.3

heavens · Jan 25, 2019

I have been trying to benchmark a new cluster that uses Knight's Landing Xeon Phi's.My benchmarking case is CESM 1.2.2:-res 0.9x1.25_gx1v6 -compset B_1850_CN It runs fine (if slow because of the Xeon Phi's) with MPICH 3.3 and GCC 4 or GCC 7. The relevant Macros are attached (currently configured for Open MPI but you can see the various MPICH options, too).But running with Open MPI is a disaster.It hangs and writes a small CESM log file with:(seq_comm_setcomm) initialize ID ( 1 GLOBAL ) pelist = 0 7 1 ( npes = 8) ( nthreads = 1)(seq_comm_setcomm) initialize ID ( 2 CPL ) pelist = 0 7 1 ( npes = 8) ( nthreads = 1)or something like that. If I change the BTL communicator when launching mpirun (whether submitting within or outside Torque) to exclude vader, I get a few more (seq_comm_setcomm) calls but a crash with : [scyld:61377] *** An error occurred in MPI_Group_range_incl[scyld:61377] *** reported by process [2500853761,7][scyld:61377] *** on communicator MPI_COMM_WORLD[scyld:61377] *** MPI_ERR_RANK: invalid rank[scyld:61377] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,[scyld:61377] *** and potentially your MPI job)and so forth. I have slim hope that anyone has encountered a similar problem but am documenting it here just in case.Also, MPI Hello World works fine with Open MPI.Nick

heavens · Jan 25, 2019

I have been able to "resolve" this issue by running mpiexec with np=twice the processors actually needed and making the equivalent resource request in Torque.However, I only can run reliably with -mca btl tcp, self. openib can be activated for 8 processors but fails for 128. (My nodes are 256 virtual processors) This suggests at least some settings issues with my current system rather than CESM. But the 2x processor thing is odd.

heavens · Jan 25, 2019

I have been able to "resolve" this issue by running mpiexec with np=twice the processors actually needed and making the equivalent resource request in Torque.However, I only can run reliably with -mca btl tcp, self. openib can be activated for 8 processors but fails for 128. (My nodes are 256 virtual processors) This suggests at least some settings issues with my current system rather than CESM. But the 2x processor thing is odd.

CESM 1.2.2 works with MPICH 3.3 but not OpenMPI 3.1.3

heavens

Member

heavens

Member

heavens

Member