Port problem: model hang after finishing initialization for B1850

QINKONG · Feb 15, 2021

Hello. I'm trying to port cesm2.1.3 to HPC cluster of Purdue University. The compset X and A can be run successfully. However, when I try B1850, model hang right after finishing initialization of all components. The last few lines of cpl.log message are as below. There is no error message in cesm.log file. I attached cpl.log cesm.log and rof.log file here for reference. cesm.log file is too large and was separated into three parts.

(component_init_cx) : creating gsmap_cx for rof
(seq_mctext_gsmapCreate) created new gsmap decomp_type = 2
(seq_mctext_gsmapCreate) ngseg/gsize = 259200 259200
(seq_mctext_gsmapCreate) mpisize/active_pes = 192 192
(seq_mctext_gsmapCreate) avg seg per pe/ape = 1350 1350
(seq_mctext_gsmapCreate) nlseg/maxnlsegs = 1350 1350
(component_init_cx) : Initializing mapper_Cr2x 1
(seq_map_init_rearrsplit) gsmaps are not identical
(seq_map_init_rearrsplit) mapper counter, strategy, mapfile = 5 rearrange undefined
(component_init_cx) : Initializing mapper_Cx2r 1
(seq_map_init_rearrsplit) gsmaps are not identical

The initial default pe layout is as below.
Comp NTASKS NTHRDS ROOTPE
CPL : 96/ 1; 0
ATM : 96/ 1; 0
LND : 48/ 1; 0
ICE : 48/ 1; 48
OCN : 48/ 1; 96
ROF : 48/ 1; 0
GLC : 24/ 1; 0
WAV : 24/ 1; 0
ESP : 1/ 1; 0

I suspected that the problem above may be a memory issue for rof component. So I increased the NTASKS to 192 for each component, but model still hang at the same place.
Comp NTASKS NTHRDS ROOTPE
CPL : 192/ 1; 0
ATM : 192/ 1; 0
LND : 192/ 1; 0
ICE : 192/ 1; 0
OCN : 192/ 1; 0
ROF : 192/ 1; 0
GLC : 192/ 1; 0
WAV : 192/ 1; 0
ESP : 24/ 1; 0

Any idea what's the problem?

Thanks a lot!

QINKONG · Feb 15, 2021

I just test another compset: F2000climo. Model hang at the same place with the same message in log file!

jedwards · Feb 16, 2021

What resolution are you trying? It looks like memory may be an issue, but it's not clear. Try a lower resolution or more tasks.

QINKONG · Feb 16, 2021

jedwards said:
What resolution are you trying? It looks like memory may be an issue, but it's not clear. Try a lower resolution or more tasks.

Hi Jedwards, thanks for the reply. The resolution is f19_g17, NTASKS=192 for all components. But, I do increase the NTASKS to 240 for all components for another try with FHIST compset, model still hang at the same place. For the HPC cluster I'm using, the memory is about 90G for 24 cores, so the total available memory should be around 900G.

I will try f45_g37 first (but this is not scientifically supported for B1850) to see if it works.

Thanks for the help!

QINKONG · Feb 16, 2021

jedwards said:
What resolution are you trying? It looks like memory may be an issue, but it's not clear. Try a lower resolution or more tasks.

Currently, I use the same NTASKS for all components (except NTASKS=1 for ESP), NTHRDS=1, and ROOTPE=0 for all components, to make it serial. This shouldn't cause any problem, right?

jedwards · Feb 16, 2021

You have A and X working so I think that the next step should be aquaplanet
f19_f19_mg17 should be fine. Compset is QSC6

QINKONG · Feb 16, 2021

jedwards said:
You have A and X working so I think that the next step should be aquaplanet
f19_f19_mg17 should be fine. Compset is QSC6

compset QSC6, C, G can be run successfully. Any further tests should I do? Thanks!

jedwards · Feb 16, 2021

Try an I compset: IHistClm50Bgc

QINKONG · Feb 16, 2021

jedwards said:
Try an I compset: IHistClm50Bgc

Hi Jedwards, thanks for the reply. I just tried IHistClm50Bgc. It hangs at the same place. NTASKS=168 for all components.

The cesm.log, cpl.log, lnd.log, and rof.log file was uploaded for reference.

I also attached my config_machines.xml and config_compiler.xml in case there is something wrong in that.

QINKONG · Feb 16, 2021

jedwards said:
Try an I compset: IHistClm50Bgc

My machine name is brown. I forgot to only extract the configuration part of my machine. Sorry.
I once read another post that has partly similar issue. You suggested that the potential problem maybe a mix of MPI library. In my config_machines.xml file, I include impi and mpi-serial, could that be the problem?

jedwards · Feb 17, 2021

No, I don't think that is the problem - but you could try the I compset with mpi-serial to see if that works.
I'm not seeing anything in the logs that indicates the problem. You can also try rebuilding with DEBUG on
to see if that provides any more information.

dobbins · Feb 17, 2021

Hi,

I'll double check this with CESM 2.1.3 later, but on CESM 2.2, I've run a B1850 @ f19_g17 resolution on 24 cores and it only takes ~71GB of RAM, so with 90GB nodes, this isn't likely to be a (hardware) memory limit. Scaling out to 192 cores, it grows to ~149GB, but that should be spread out pretty evenly across 8 nodes, so again, not likely an issue. One quick first thing to check is your user limits -- what does 'ulimit -a' show on the compute nodes? If you're being limited to much less, that would help explain things.

Also, you mentioned running A, X, C and G compsets - can you confirm that at least one of them used more than one node? (The first few lines of output from './preview_run' in those case directories should tell us.).

If none of them did, I'd recommend trying a B1850 @ f19_g17 on 24 cores (./xmlchange NTASKS=24,NTHRDS=1,ROOTPE=0). This won't be efficient, but it restricts the full run to a single node. If your system has a limit where ~71GB is pushing things, dropping down to f45_g37 should be fine for this test. If it works, we'd want to then do the same run on 2 nodes, which you can do in the same case directory by the following:

./xmlchange MAX_TASKS_PER_NODE=12
./case.setup --reset
./case.build --clean-all
./case.build
./preview_run # (just to check!)
./case.submit

This should basically duplicate the same exact run, but forcing it to use the same 24-core count across 2 nodes, with 12 per node.

As above, also building with DEBUG=true will likely help. Hopefully with more info from above, we can figure this out.

QINKONG · Feb 17, 2021

jedwards said:
No, I don't think that is the problem - but you could try the I compset with mpi-serial to see if that works.
I'm not seeing anything in the logs that indicates the problem. You can also try rebuilding with DEBUG on
to see if that provides any more information.

dobbins said:
Hi,

I'll double check this with CESM 2.1.3 later, but on CESM 2.2, I've run a B1850 @ f19_g17 resolution on 24 cores and it only takes ~71GB of RAM, so with 90GB nodes, this isn't likely to be a (hardware) memory limit. Scaling out to 192 cores, it grows to ~149GB, but that should be spread out pretty evenly across 8 nodes, so again, not likely an issue. One quick first thing to check is your user limits -- what does 'ulimit -a' show on the compute nodes? If you're being limited to much less, that would help explain things.

Also, you mentioned running A, X, C and G compsets - can you confirm that at least one of them used more than one node? (The first few lines of output from './preview_run' in those case directories should tell us.).

If none of them did, I'd recommend trying a B1850 @ f19_g17 on 24 cores (./xmlchange NTASKS=24,NTHRDS=1,ROOTPE=0). This won't be efficient, but it restricts the full run to a single node. If your system has a limit where ~71GB is pushing things, dropping down to f45_g37 should be fine for this test. If it works, we'd want to then do the same run on 2 nodes, which you can do in the same case directory by the following:

./xmlchange MAX_TASKS_PER_NODE=12
./case.setup --reset
./case.build --clean-all
./case.build
./preview_run # (just to check!)
./case.submit

This should basically duplicate the same exact run, but forcing it to use the same 24-core count across 2 nodes, with 12 per node.

As above, also building with DEBUG=true will likely help. Hopefully with more info from above, we can figure this out.

Hi Jedwards and Brian, thanks for your kind help!
Both compset C and G are successfully run under multiple nodes.
I made some modifications to my config_machines.xml file, and rerun the IHistClm50Bgc compset. It doesn't hang anymore and move beyond the initial hang point, but report an error message which seems more like a scientific error message.

The modification of config_machines.xml files are as below (blue fonts are original config, red fonts are the new one). I removed the mpi-serial part (because the module system in our HPC cluster seems doesn't have mpi-serial) and delete some seemingly redundant module command.
The old one:
<module_system type="module">
<init_path lang="perl">/opt/lmod/init/perl</init_path>
<init_path lang="python">/opt/lmod/init/env_modules_python.py</init_path>
<init_path lang="sh">/opt/lmod/init/sh</init_path>
<init_path lang="csh">/opt/lmod/init/csh</init_path>
<cmd_path lang="perl">/opt/lmod/libexec/lmod perl</cmd_path>
<cmd_path lang="python">/opt/lmod/libexec/lmod python</cmd_path>
<cmd_path lang="sh">module</cmd_path>
<cmd_path lang="csh">module</cmd_path>
<modules>
<command name="purge">--force</command>
<command name="load">rcac</command>
</modules>
<modules compiler="intel">
<command name="load">intel/17.0.1.132</command>
<command name="load">anaconda/5.1.0-py36</command>
<command name="load">netcdf/4.5.0</command>
<command name="load">netcdf-fortran/4.4.4</command>
<command name="load">parallel-netcdf/1.10.0</command>
<command name="load">hdf5/1.8.16</command>
<command name="load">netlib-lapack/3.6.0</command>
<command name="load">openblas/0.3.7</command>
</modules>
<modules mpilib="impi" compiler="intel">
<command name="load">intel/17.0.1.132</command>
<command name="load">impi/2017.1.132</command>
<command name="load">netcdf/4.5.0</command>
<command name="load">parallel-netcdf/1.10.0</command>
</modules>
<modules mpilib="mpi-serial">
<command name="load">netcdf/4.5.0</command>
<command name="load">parallel-netcdf/1.10.0</command>
</modules>
</module_system>

The new one:

<module_system type="module">
<init_path lang="perl">/opt/lmod/init/perl</init_path>
<init_path lang="python">/opt/lmod/init/env_modules_python.py</init_path>
<init_path lang="sh">/opt/lmod/init/sh</init_path>
<init_path lang="csh">/opt/lmod/init/csh</init_path>
<cmd_path lang="perl">/opt/lmod/libexec/lmod perl</cmd_path>
<cmd_path lang="python">/opt/lmod/libexec/lmod python</cmd_path>
<cmd_path lang="sh">module</cmd_path>
<cmd_path lang="csh">module</cmd_path>
<modules>
<command name="purge">--force</command>
</modules>
<modules compiler="intel">
<command name="load">intel/17.0.1.132</command>
<command name="load">impi/2017.1.132</command>
<command name="load">anaconda/5.1.0-py36</command>
<command name="load">netcdf/4.5.0</command>
<command name="load">netcdf-fortran/4.4.4</command>
<command name="load">parallel-netcdf/1.10.0</command>
<command name="load">hdf5/1.8.16</command>
<command name="load">netlib-lapack/3.6.0</command>
<command name="load">openblas/0.3.7</command>
</modules>
</module_system>

QINKONG · Feb 17, 2021

After modifications above, the IHistClm50Bgc compset was rerun and generate the error message of "carbon or nitrogen state critically negative ERROR in CNPrecisionControl" in lnd.log file. The cesm.log file also contain many similar error messages, but all are seemingly science-related. I didn't do any source code modification or namelist changes. NTASKS=168 for all components. Grid is f19_g17.

I feel confused about following questions:
(1) It appears that the config_machines.xml modification (mainly remove mpi-serial) solved the model-hang issue. Could you please explain why?
(2) I didn't made any source code modification or namelist changes. Why did the model report errors like "carbon or nitrogen state critically negative"? How should I solve this?

The cesm.log, cpl.log and lnd.log are attached here.
Thanks!

jedwards · Feb 17, 2021

The mpi-serial source is provided with cesm - it's not a module on your system.
When you using mpi-serial you cannot use pnetcdf or a parallel build of netcdf, this is why there
is a seperate module section with that attribute. I don't see any difference in the module stuff except that you removed
<command name="load">rcac</command> and I don't know what that is.

QINKONG · Feb 17, 2021

jedwards said:
The mpi-serial source is provided with cesm - it's not a module on your system.
When you using mpi-serial you cannot use pnetcdf or a parallel build of netcdf, this is why there
is a seperate module section with that attribute. I don't see any difference in the module stuff except that you removed
<command name="load">rcac</command> and I don't know what that is.

The rcac is a short cut of module command in our HPC cluster and will load some modules by default. I think it's irrelevant here.
Do you mean my initial config_machines.xml is wrong because I included pnetcdf under mpi-serial?
Should I change:
</modules>
<modules mpilib="mpi-serial">
<command name="load">netcdf/4.5.0</command>
<command name="load">parallel-netcdf/1.10.0</command>
</modules>

To:
</modules>
<modules mpilib="mpi-serial">
<command name="load">netcdf/4.5.0</command>
</modules>

?

If I want to test with mpi-serial, what should I do? Also, does it make sense to include multiple nodes when using mpi-serial?

QINKONG · Feb 17, 2021

dobbins said:
Hi,

I'll double check this with CESM 2.1.3 later, but on CESM 2.2, I've run a B1850 @ f19_g17 resolution on 24 cores and it only takes ~71GB of RAM, so with 90GB nodes, this isn't likely to be a (hardware) memory limit. Scaling out to 192 cores, it grows to ~149GB, but that should be spread out pretty evenly across 8 nodes, so again, not likely an issue. One quick first thing to check is your user limits -- what does 'ulimit -a' show on the compute nodes? If you're being limited to much less, that would help explain things.

Also, you mentioned running A, X, C and G compsets - can you confirm that at least one of them used more than one node? (The first few lines of output from './preview_run' in those case directories should tell us.).

If none of them did, I'd recommend trying a B1850 @ f19_g17 on 24 cores (./xmlchange NTASKS=24,NTHRDS=1,ROOTPE=0). This won't be efficient, but it restricts the full run to a single node. If your system has a limit where ~71GB is pushing things, dropping down to f45_g37 should be fine for this test. If it works, we'd want to then do the same run on 2 nodes, which you can do in the same case directory by the following:

./xmlchange MAX_TASKS_PER_NODE=12
./case.setup --reset
./case.build --clean-all
./case.build
./preview_run # (just to check!)
./case.submit

This should basically duplicate the same exact run, but forcing it to use the same 24-core count across 2 nodes, with 12 per node.

As above, also building with DEBUG=true will likely help. Hopefully with more info from above, we can figure this out.

This is the output of 'ulimit-a':

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 379989
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 20480
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I tried to run B1850 at f45_g37, but case.build fails with following error message. Maybe this grid is not supported for B8150?

Building case in directory /depot/huberm/apps/kong/cesm2.1.3/cime/scripts/B1850_8th
sharedlib_only is False
model_only is False
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
Generating component namelists as part of build
Creating component namelists
Calling /depot/huberm/apps/kong/cesm2.1.3/components/cam//cime_config/buildnml
...calling cam buildcpp to set build time options
ERROR: Command /depot/huberm/apps/kong/cesm2.1.3/components/cam/bld/build-namelist -ntasks 24 -csmdata /depot/huberm/data/kong/cesm_input_data -infile /depot/huberm/apps/kong/cesm2.1.3/cime/scripts/B1850_8th/Buildconf/camconf/namelist -ignore_ic_year -use_case 1850_cam6 -inputdata /depot/huberm/apps/kong/cesm2.1.3/cime/scripts/B1850_8th/Buildconf/cam.input_data_list -namelist " &atmexp co2_cycle_rad_passive=.true. /" failed rc=255
out=CAM build-namelist - ERROR: No default value found for ncdata
user defined attributes:
key=ic_md val=00010101
err=Died at /depot/huberm/apps/kong/cesm2.1.3/components/cam/bld/build-namelist line 4054.

jedwards · Feb 17, 2021

with mpi-serial you do not want to load pnetcdf
use the argument to create_newcase --mpilib mpi-serial

QINKONG · Feb 19, 2021

jedwards said:
with mpi-serial you do not want to load pnetcdf
use the argument to create_newcase --mpilib mpi-serial

Hi. Just took a break from porting, and now I'm back. I changed my config_machines.xml and config_compilers.xml file. Basically, I switched from impi to openmpi (if I switch back to impi, model still hangs). Now, I can run B1850 and IHistClm50Bgc successfully, but not every time. After a successful initial run, when I resubmit it, the model fail sometimes. But, the cpl.log file still show successful termination, and the error message occurs at the end of cesm.log file as below:

Open MPI failed an OFI Libfabric library call (fi_close). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

Local host: brown-a415
Location: mtl_ofi_component.c:657
Error: Device or resource busy (16)

mpirun has exited due to process rank 71 with PID 40449 on
node brown-a415 exiting improperly. There are three reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
You can avoid this message by specifying -quiet on the mpirun command line.

Since cpl.log file shows success, I suspect the error happens at the very end of the run maybe during the writing of restart files. It seems to be associated with mpi.

How should I solve this?

The cesm.log and cpl.log file of IHistClm50Bgc case, and the config_compilers.xml and config_machines.xml files were attached.

Thanks!!!

jedwards · Feb 19, 2021

This looks an awful lot like a bug that was in early versions of openmpi but has since been solved. Which reminds me that there was a similar bug in older versions of impi - according to your config_machines.xml file you have some really old system software. Can you get some newer versions?
besides compiler and mpi library netcdf, hdf5 and pnetcdf are also very old

Port problem: model hang after finishing initialization for B1850

QINQIN KONG

Member

Attachments

QINQIN KONG

Member

CSEG and Liaisons

QINQIN KONG

Member

QINQIN KONG

Member

CSEG and Liaisons

QINQIN KONG

Member

CSEG and Liaisons

QINQIN KONG

Member

Attachments

QINQIN KONG

Member

CSEG and Liaisons

Brian Dobbins

CSEG and Liaisons

QINQIN KONG

Member

QINQIN KONG

Member

Attachments

CSEG and Liaisons

QINQIN KONG

Member

QINQIN KONG

Member

CSEG and Liaisons

QINQIN KONG

Member

Attachments

CSEG and Liaisons