Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM2.1.3 porting slurm job submission solution for the openmpi/3.1.1 build on Niagara (Digital Alliance of Canada: Scinet)

nstant

Noah Stanton
New Member
To Whom this may concern,

I am posting this as potential help for people with similar issues. I have a solution for an error that randomly can happen with cesm2 configured with openmpi/3.1.1 on the niagara supercomputer. This may occur on other systems too.

Jobs will submit just fine, output is fine, but randomly (sometimes it doesn't) a job might fail after the model has completed running (again output is all there, rpointers get updated). This resulted in dependencies for jobs not being triggered (adding a lot of pain to submitting jobs).

This is a snippet of the error that comes up in the cesm.log:
(My error)
[1696053789.688923] [nia2005:103227:0] ib_iface.c:786 UCX ERROR Invalid active_width on mlx5_0:1: 16
[nia1551.scinet.local:30345:6][bcol_ucx_p2p_component.c:584:hmca_bcol_ucx_p2p_init_query] UCXP2P failed to init ucx
[nia1551.scinet.local:30345] Error: coll_hcoll_module.c:311 - mca_coll_hcoll_comm_query() Hcol library init failed

A relevant github for this error is here: How to disable ucx during OpenMPI run? · Issue #7388 · open-mpi/ompi

The Mellanox's accelerated libraries were installed for niagara's install of openmpi/3.1.1, these could not be removed without majorly impacting other users on niagara.

With help from Niagara support, we came to the following solution.

In config_machines.xml append as follows to mpirun under the openmpi section...

<mpirun mpilib="openmpi">
<!-- name of the exectuable used to launch mpi jobs -->
<executable>mpirun --mca pml ^ucx --mca osc ^ucx --mca coll ^hcoll</executable>
<!-- arguments to the mpiexec command, the name attribute here is ignored-->
<arguments>
<arg name="anum_tasks"> -np {{ total_tasks }}</arg>
</arguments>
</mpirun>

The flag (bolded) stops these libraries from being used. The errors disappeared completely.

Sincerely,

Noah Stanton
 

jedwards

CSEG and Liaisons
Staff member
Great tip Noah - thank you. Have you tried more recent openmpi versions? 4.1.6 is the latest stable and 5.0.0 beta versions are also available.
 

nstant

Noah Stanton
New Member
Thanks @jedwards! It was frustrating me for a while.

I have not used the newest versions of openmpi (4.6.1 or 5.0.0) due to constraints on Niagara. I'd likely have to install the necessary modules from the ground up (which is difficult with the permissions given) or use a virtual environment (again tedious). On Niagara, to my knowledge, the most streamlined approach (when using openmpi versus intelmpi) is to use NiaEnv/2018a (a standard environment), intel/2018.3, openmpi/3.1.1, hdf5-mpi/1.8.20, and netcdf-mpi/4.6.1. The necessary fortran netcdf libraries are installed in this configuration. I'm sure there is a way to use the most updated openmpi version, but this workflow is working so I am sticking to it for now.

If you are wondering why I use openmpi versus intelmpi, it is because for the compsets I am using (BW1850,BWma1850) openmpi runs much more efficiently on Niagara. The walltimes are at least halved with openmpi versus intelmpi for these compsets.
 
Top