MPI errors on high resolution grid.


Jonathan R. Buzan
$ ./describe_version
            ccs_config at tag ccs_config_cesm0.0.109
M                      HEAD detached at 797acd7
                      Changes not staged for commit:
                        (use "git add <file>..." to update what will be committed)
                        (use "git restore <file>..." to discard changes in working directory)
                          modified:   machines/config_batch.xml
                          modified:   machines/config_machines.xml

                      no changes added to commit (use "git add" and/or "git commit -a")

                 share at tag share1.0.19
                  cime at tag cime6.0.246
                   mct at tag MCT_2.11.0
            mpi-serial at tag MPIserial_2.5.0
                   cam at tag cam6_3_162
M                      HEAD detached at ab476f9b
                      Changes not staged for commit:
                        (use "git add/rm <file>..." to update what will be committed)
                        (use "git restore <file>..." to discard changes in working directory)
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/shell_commands
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_cam
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_clm

                      no changes added to commit (use "git add" and/or "git commit -a")

                   ww3 at tag ww3i_0.0.2
                   rtm at tag rtm1_0_79
                pysect at tag 3.2.2
                mosart at tag mosart1_0_49
             mizuroute at tag cesm-coupling.n02_v2.1.2
                   fms at tag fi_240516
            parallelio at tag pio2_6_2
                 cdeps at tag cdeps1.0.37
                 cmeps at tag cmeps0.14.63
                  cice at tag cesm_cice6_5_0_9
                  cism at tag cismwrap_2_2_001
                   clm at tag ctsm5.2.007
                   mom at tag mi_240522
    testfails = 0, local mods = 2, needs updates 0

    The submodules labeled with 'M' above are not in a clean state.
    The following are options for how to proceed:
    (1) Go into each submodule which is not in a clean state and issue a 'git status'
        Either revert or commit your changes so that the submodule is in a clean state.
    (2) use the --force option to git-fleximod
    (3) you can name the particular submodules to update using the git-fleximod command line
    (4) As a last resort you can remove the submodule (via 'rm -fr [directory]')
        then rerun git-fleximod update.

./create_newcase --case /capstor/scratch/cscs/jbuzan/cesm3_0_beta01/cases/intel_cesm3_0_beta01_F2000climo_x025_O6144_01 --compiler intel --compset F2000climo --res ne120pg3_ne120pg3_mt13 --mach eiger --driver nuopc --mpilib mpich --run-unsupported

env_mach_pes is attached. 48 nodes x 128 cores per node (Eiger is almost the same machine as Derecho).

I set up the core distribution as attached.

Describe your problem or question:

The simulation always seems to fail to execute. I've tried less nodes, but run into wallclock issues for a 20 day test. I used the following formula to determine trying to use 48 nodes.

ne30pg3_ne30pg3_mg17 grid executes successfully with 3 nodes (384 cores).
ne120pg3_ne120pg3_mt13 is approximately 4x4 higher resolution, and I multiple the 3 nodes by 16 to get 48 nodes.

I get the error below.


jbuzan@eiger-ln002:/capstor/scratch/cscs/jbuzan/cesm3_0_beta01/cases/intel_cesm3_0_beta01_F2000climo_x025_O6144_01 [19:18:46] $ cat /capstor/scratch/cscs/jbuzan/cesm3_0_beta01/output/intel_cesm3_0_beta01_F2000climo_x025_O6144_01/run/cesm.log.3303936.240902-185339
Mon Sep  2 18:56:58 2024: [PE_3896]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=94, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_5872]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=107, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_2169]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=104, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_624]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_1592]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=105, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_5936]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=97, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_3696]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_378]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=98, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_3706]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_3706]:_pmi_mmap_init:Failed to setup PMI mmap.Mon Sep  2 18:57:00 2024: [PE_3706]:globals_init:_pmi_mmap_init returned -1
MPICH ERROR [Rank 0] [job id unknown] [Mon Sep  2 18:57:00 2024] [nid001420] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_pmi_init(110)...: PMI_Init returned 1

aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_pmi_init(110)...: PMI_Init returned 1
Mon Sep  2 18:57:00 2024: [PE_4922]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=91, pes_this_node=128, timeout=180 secs
srun: error: nid001420: task 3706: Exited with exit code 255
srun: Terminating StepId=3303936.0
slurmstepd: error: *** STEP 3303936.0 ON nid001117 CANCELLED AT 2024-09-02T18:57:00 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source            
libpthread-2.31.s  00001468BCEBE910  Unknown               Unknown  Unknown    00001468B51A7755  Unknown               Unknown  Unknown    00001468B51A7855  Unknown               Unknown  Unknown        00001468B51A7C14  _pmi_mmap_init        Unknown  Unknown        00001468B51A252C  _pmi_init             Unknown  Unknown        00001468B51AF706  PMI2_Init             Unknown  Unknown  00001468B9632A11  Unknown               Unknown  Unknown  00001468B96384DD  Unknown               Unknown  Unknown  00001468B80C3D7E  Unknown               Unknown  Unknown  00001468B80C4304  PMPI_Init_thread      Unknown  Unknown
libmpifort_intel.  00001468B9FD392F  MPI_INIT_THREAD       Unknown  Unknown
cesm.exe           0000000000436E1B  MAIN__                     40  esmApp.F90
cesm.exe           0000000000425DCD  Unknown               Unknown  Unknown       00001468B73C724D  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000425CFA  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)


CSEG and Liaisons
Staff member
We do not currently support the ne120 grid in cesm3_0 - it's in the plans but not ready yet.


Jonathan R. Buzan
hi Jim,

Thanks for the quick reply. Is there a 0.5 degree that might work?



CSEG and Liaisons
Staff member
At this time we are only supporting the ne30pg3 1 degree grid - higher resolution grids will follow.