To Whom this may concern,
I am posting this as potential help for people with similar issues. I have a solution for an error that randomly can happen with cesm2 configured with openmpi/3.1.1 on the niagara supercomputer. This may occur on other systems too.
Jobs will submit just fine, output is fine, but randomly (sometimes it doesn't) a job might fail after the model has completed running (again output is all there, rpointers get updated). This resulted in dependencies for jobs not being triggered (adding a lot of pain to submitting jobs).
This is a snippet of the error that comes up in the cesm.log:
(My error)
[1696053789.688923] [nia2005:103227:0] ib_iface.c:786 UCX ERROR Invalid active_width on mlx5_0:1: 16
[nia1551.scinet.local:30345:6][bcol_ucx_p2p_component.c:584:hmca_bcol_ucx_p2p_init_query] UCXP2P failed to init ucx
[nia1551.scinet.local:30345] Error: coll_hcoll_module.c:311 - mca_coll_hcoll_comm_query() Hcol library init failed
A relevant github for this error is here: How to disable ucx during OpenMPI run? · Issue #7388 · open-mpi/ompi
The Mellanox's accelerated libraries were installed for niagara's install of openmpi/3.1.1, these could not be removed without majorly impacting other users on niagara.
With help from Niagara support, we came to the following solution.
In config_machines.xml append as follows to mpirun under the openmpi section...
<mpirun mpilib="openmpi">
<!-- name of the exectuable used to launch mpi jobs -->
<executable>mpirun --mca pml ^ucx --mca osc ^ucx --mca coll ^hcoll</executable>
<!-- arguments to the mpiexec command, the name attribute here is ignored-->
<arguments>
<arg name="anum_tasks"> -np {{ total_tasks }}</arg>
</arguments>
</mpirun>
The flag (bolded) stops these libraries from being used. The errors disappeared completely.
Sincerely,
Noah Stanton
I am posting this as potential help for people with similar issues. I have a solution for an error that randomly can happen with cesm2 configured with openmpi/3.1.1 on the niagara supercomputer. This may occur on other systems too.
Jobs will submit just fine, output is fine, but randomly (sometimes it doesn't) a job might fail after the model has completed running (again output is all there, rpointers get updated). This resulted in dependencies for jobs not being triggered (adding a lot of pain to submitting jobs).
This is a snippet of the error that comes up in the cesm.log:
(My error)
[1696053789.688923] [nia2005:103227:0] ib_iface.c:786 UCX ERROR Invalid active_width on mlx5_0:1: 16
[nia1551.scinet.local:30345:6][bcol_ucx_p2p_component.c:584:hmca_bcol_ucx_p2p_init_query] UCXP2P failed to init ucx
[nia1551.scinet.local:30345] Error: coll_hcoll_module.c:311 - mca_coll_hcoll_comm_query() Hcol library init failed
A relevant github for this error is here: How to disable ucx during OpenMPI run? · Issue #7388 · open-mpi/ompi
The Mellanox's accelerated libraries were installed for niagara's install of openmpi/3.1.1, these could not be removed without majorly impacting other users on niagara.
With help from Niagara support, we came to the following solution.
In config_machines.xml append as follows to mpirun under the openmpi section...
<mpirun mpilib="openmpi">
<!-- name of the exectuable used to launch mpi jobs -->
<executable>mpirun --mca pml ^ucx --mca osc ^ucx --mca coll ^hcoll</executable>
<!-- arguments to the mpiexec command, the name attribute here is ignored-->
<arguments>
<arg name="anum_tasks"> -np {{ total_tasks }}</arg>
</arguments>
</mpirun>
The flag (bolded) stops these libraries from being used. The errors disappeared completely.
Sincerely,
Noah Stanton