Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPI errors on high resolution grid.

Jbuzan

Jonathan R. Buzan
Member
Code:
$ ./describe_version
            ccs_config at tag ccs_config_cesm0.0.109
M                      HEAD detached at 797acd7
                      Changes not staged for commit:
                        (use "git add <file>..." to update what will be committed)
                        (use "git restore <file>..." to discard changes in working directory)
                          modified:   machines/config_batch.xml
                          modified:   machines/config_machines.xml

                      no changes added to commit (use "git add" and/or "git commit -a")

                 share at tag share1.0.19
                  cime at tag cime6.0.246
                   mct at tag MCT_2.11.0
            mpi-serial at tag MPIserial_2.5.0
                   cam at tag cam6_3_162
M                      HEAD detached at ab476f9b
                      Changes not staged for commit:
                        (use "git add/rm <file>..." to update what will be committed)
                        (use "git restore <file>..." to discard changes in working directory)
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/shell_commands
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_cam
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_clm

                      no changes added to commit (use "git add" and/or "git commit -a")

                   ww3 at tag ww3i_0.0.2
                   rtm at tag rtm1_0_79
                pysect at tag 3.2.2
                mosart at tag mosart1_0_49
             mizuroute at tag cesm-coupling.n02_v2.1.2
                   fms at tag fi_240516
            parallelio at tag pio2_6_2
                 cdeps at tag cdeps1.0.37
                 cmeps at tag cmeps0.14.63
                  cice at tag cesm_cice6_5_0_9
                  cism at tag cismwrap_2_2_001
                   clm at tag ctsm5.2.007
                   mom at tag mi_240522
    testfails = 0, local mods = 2, needs updates 0

    The submodules labeled with 'M' above are not in a clean state.
    The following are options for how to proceed:
    (1) Go into each submodule which is not in a clean state and issue a 'git status'
        Either revert or commit your changes so that the submodule is in a clean state.
    (2) use the --force option to git-fleximod
    (3) you can name the particular submodules to update using the git-fleximod command line
    (4) As a last resort you can remove the submodule (via 'rm -fr [directory]')
        then rerun git-fleximod update.



./create_newcase --case /capstor/scratch/cscs/jbuzan/cesm3_0_beta01/cases/intel_cesm3_0_beta01_F2000climo_x025_O6144_01 --compiler intel --compset F2000climo --res ne120pg3_ne120pg3_mt13 --mach eiger --driver nuopc --mpilib mpich --run-unsupported

env_mach_pes is attached. 48 nodes x 128 cores per node (Eiger is almost the same machine as Derecho).




I set up the core distribution as attached.



Describe your problem or question:

The simulation always seems to fail to execute. I've tried less nodes, but run into wallclock issues for a 20 day test. I used the following formula to determine trying to use 48 nodes.

ne30pg3_ne30pg3_mg17 grid executes successfully with 3 nodes (384 cores).
ne120pg3_ne120pg3_mt13 is approximately 4x4 higher resolution, and I multiple the 3 nodes by 16 to get 48 nodes.

I get the error below.

Thanks,
-Jonathan


Code:
jbuzan@eiger-ln002:/capstor/scratch/cscs/jbuzan/cesm3_0_beta01/cases/intel_cesm3_0_beta01_F2000climo_x025_O6144_01 [19:18:46] $ cat /capstor/scratch/cscs/jbuzan/cesm3_0_beta01/output/intel_cesm3_0_beta01_F2000climo_x025_O6144_01/run/cesm.log.3303936.240902-185339
Mon Sep  2 18:56:58 2024: [PE_3896]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=94, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_5872]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=107, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_2169]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=104, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_624]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_1592]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=105, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:56:59 2024: [PE_5936]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=97, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_3696]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_378]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=98, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_3706]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep  2 18:57:00 2024: [PE_3706]:_pmi_mmap_init:Failed to setup PMI mmap.Mon Sep  2 18:57:00 2024: [PE_3706]:globals_init:_pmi_mmap_init returned -1
MPICH ERROR [Rank 0] [job id unknown] [Mon Sep  2 18:57:00 2024] [nid001420] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(441).......:
MPIR_pmi_init(110)...: PMI_Init returned 1

aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(441).......:
MPIR_pmi_init(110)...: PMI_Init returned 1
Mon Sep  2 18:57:00 2024: [PE_4922]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=91, pes_this_node=128, timeout=180 secs
srun: error: nid001420: task 3706: Exited with exit code 255
srun: Terminating StepId=3303936.0
slurmstepd: error: *** STEP 3303936.0 ON nid001117 CANCELLED AT 2024-09-02T18:57:00 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source            
libpthread-2.31.s  00001468BCEBE910  Unknown               Unknown  Unknown
libpmi.so.0.6.0    00001468B51A7755  Unknown               Unknown  Unknown
libpmi.so.0.6.0    00001468B51A7855  Unknown               Unknown  Unknown
libpmi.so.0        00001468B51A7C14  _pmi_mmap_init        Unknown  Unknown
libpmi.so.0        00001468B51A252C  _pmi_init             Unknown  Unknown
libpmi.so.0        00001468B51AF706  PMI2_Init             Unknown  Unknown
libmpi_intel.so.1  00001468B9632A11  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001468B96384DD  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001468B80C3D7E  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001468B80C4304  PMPI_Init_thread      Unknown  Unknown
libmpifort_intel.  00001468B9FD392F  MPI_INIT_THREAD       Unknown  Unknown
cesm.exe           0000000000436E1B  MAIN__                     40  esmApp.F90
cesm.exe           0000000000425DCD  Unknown               Unknown  Unknown
libc-2.31.so       00001468B73C724D  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000425CFA  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
 

Attachments

  • 48_nodes_eiger.txt
    7.9 KB · Views: 1

jedwards

CSEG and Liaisons
Staff member
We do not currently support the ne120 grid in cesm3_0 - it's in the plans but not ready yet.
 

Jbuzan

Jonathan R. Buzan
Member
hi Jim,

Thanks for the quick reply. Is there a 0.5 degree that might work?

Cheers,
-Jonathan
 

jedwards

CSEG and Liaisons
Staff member
At this time we are only supporting the ne30pg3 1 degree grid - higher resolution grids will follow.
 
Top