Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM 1.2.2 works with MPICH 3.3 but not OpenMPI 3.1.3

heavens

Member
I have been trying to benchmark a new cluster that uses Knight's Landing Xeon Phi's.My benchmarking case is CESM 1.2.2:-res 0.9x1.25_gx1v6 -compset B_1850_CN It runs fine (if slow because of the Xeon Phi's) with MPICH 3.3 and GCC 4 or GCC 7. The relevant Macros are attached (currently configured for Open MPI but you can see the various MPICH options, too).But running with Open MPI is a disaster.It hangs and writes a small CESM log file with:(seq_comm_setcomm)  initialize ID (  1 GLOBAL          ) pelist   =     0     7     1 ( npes =     8) ( nthreads =  1)(seq_comm_setcomm)  initialize ID (  2 CPL             ) pelist   =     0     7     1 ( npes =     8) ( nthreads =  1)or something like that. If I change the BTL communicator when launching mpirun (whether submitting within or outside Torque) to exclude vader, I get a few more (seq_comm_setcomm) calls but a crash with : [scyld:61377] *** An error occurred in MPI_Group_range_incl[scyld:61377] *** reported by process [2500853761,7][scyld:61377] *** on communicator MPI_COMM_WORLD[scyld:61377] *** MPI_ERR_RANK: invalid rank[scyld:61377] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,[scyld:61377] ***    and potentially your MPI job)and so forth. I have slim hope that anyone has encountered a similar problem but am documenting it here just in case.Also, MPI Hello World works fine with Open MPI.Nick       
 

heavens

Member
I have been able to "resolve" this issue by running mpiexec with np=twice the processors actually needed and making the equivalent resource request in Torque.However, I only can run reliably with -mca btl tcp, self. openib can be activated for 8 processors but fails for 128. (My nodes are 256 virtual processors) This suggests at least some settings issues with my current system rather than CESM. But the 2x processor thing is odd.
 

heavens

Member
I have been able to "resolve" this issue by running mpiexec with np=twice the processors actually needed and making the equivalent resource request in Torque.However, I only can run reliably with -mca btl tcp, self. openib can be activated for 8 processors but fails for 128. (My nodes are 256 virtual processors) This suggests at least some settings issues with my current system rather than CESM. But the 2x processor thing is odd.
 
Top