Hi CESM Community,
I'm currently working on setting up and testing CESM2.1.3 on a custom HPC cluster using a compset with all stub components for initial testing:
CESM Version: CESM2.1.3-rc.01
Compset: 2000_XATM_XLND_XICE_XOCN_XROF_XGLC_XWAV
Grid: a%1.9x2.5_l%1.9x2.5_oi%gx1v6_r%r05_g%gland4_w%ww3a_m%gx1v6
Machine: Custom machine file (mycluster)
MPI: OpenMPI 4.1.5 (compiled with UCX and OpenFabrics support)
PROBLEM:
When I run the test case (even with only stub components), CESM fails with the following MPI errors:
ORTE_ERROR_LOG: Bad parameter in file orted/pmix/pmix_server_gen.c at line 863
pml_ucx.c:176 Error: Failed to receive UCX worker address: Not found (-13)
UCX ERROR Error returned from open in attach. Permission denied. File name is: /proc/...
The model terminates with:
forrtl: error (78): process killed (SIGTERM)
Any suggestions or insights are greatly appreciated.
Thanks in advance,
SJ
I'm currently working on setting up and testing CESM2.1.3 on a custom HPC cluster using a compset with all stub components for initial testing:
CESM Version: CESM2.1.3-rc.01
Compset: 2000_XATM_XLND_XICE_XOCN_XROF_XGLC_XWAV
Grid: a%1.9x2.5_l%1.9x2.5_oi%gx1v6_r%r05_g%gland4_w%ww3a_m%gx1v6
Machine: Custom machine file (mycluster)
MPI: OpenMPI 4.1.5 (compiled with UCX and OpenFabrics support)
PROBLEM:
When I run the test case (even with only stub components), CESM fails with the following MPI errors:
ORTE_ERROR_LOG: Bad parameter in file orted/pmix/pmix_server_gen.c at line 863
pml_ucx.c:176 Error: Failed to receive UCX worker address: Not found (-13)
UCX ERROR Error returned from open in attach. Permission denied. File name is: /proc/...
The model terminates with:
forrtl: error (78): process killed (SIGTERM)
Any suggestions or insights are greatly appreciated.
Thanks in advance,
SJ