Code:
$ ./describe_version
ccs_config at tag ccs_config_cesm0.0.109
M HEAD detached at 797acd7
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: machines/config_batch.xml
modified: machines/config_machines.xml
no changes added to commit (use "git add" and/or "git commit -a")
share at tag share1.0.19
cime at tag cime6.0.246
mct at tag MCT_2.11.0
mpi-serial at tag MPIserial_2.5.0
cam at tag cam6_3_162
M HEAD detached at ab476f9b
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
deleted: cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/shell_commands
deleted: cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_cam
deleted: cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_clm
no changes added to commit (use "git add" and/or "git commit -a")
ww3 at tag ww3i_0.0.2
rtm at tag rtm1_0_79
pysect at tag 3.2.2
mosart at tag mosart1_0_49
mizuroute at tag cesm-coupling.n02_v2.1.2
fms at tag fi_240516
parallelio at tag pio2_6_2
cdeps at tag cdeps1.0.37
cmeps at tag cmeps0.14.63
cice at tag cesm_cice6_5_0_9
cism at tag cismwrap_2_2_001
clm at tag ctsm5.2.007
mom at tag mi_240522
testfails = 0, local mods = 2, needs updates 0
The submodules labeled with 'M' above are not in a clean state.
The following are options for how to proceed:
(1) Go into each submodule which is not in a clean state and issue a 'git status'
Either revert or commit your changes so that the submodule is in a clean state.
(2) use the --force option to git-fleximod
(3) you can name the particular submodules to update using the git-fleximod command line
(4) As a last resort you can remove the submodule (via 'rm -fr [directory]')
then rerun git-fleximod update.
./create_newcase --case /capstor/scratch/cscs/jbuzan/cesm3_0_beta01/cases/intel_cesm3_0_beta01_F2000climo_x025_O6144_01 --compiler intel --compset F2000climo --res ne120pg3_ne120pg3_mt13 --mach eiger --driver nuopc --mpilib mpich --run-unsupported
env_mach_pes is attached. 48 nodes x 128 cores per node (Eiger is almost the same machine as Derecho).
I set up the core distribution as attached.
Describe your problem or question:
The simulation always seems to fail to execute. I've tried less nodes, but run into wallclock issues for a 20 day test. I used the following formula to determine trying to use 48 nodes.
ne30pg3_ne30pg3_mg17 grid executes successfully with 3 nodes (384 cores).
ne120pg3_ne120pg3_mt13 is approximately 4x4 higher resolution, and I multiple the 3 nodes by 16 to get 48 nodes.
I get the error below.
Thanks,
-Jonathan
Code:
jbuzan@eiger-ln002:/capstor/scratch/cscs/jbuzan/cesm3_0_beta01/cases/intel_cesm3_0_beta01_F2000climo_x025_O6144_01 [19:18:46] $ cat /capstor/scratch/cscs/jbuzan/cesm3_0_beta01/output/intel_cesm3_0_beta01_F2000climo_x025_O6144_01/run/cesm.log.3303936.240902-185339
Mon Sep 2 18:56:58 2024: [PE_3896]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=94, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:56:59 2024: [PE_5872]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=107, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:56:59 2024: [PE_2169]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=104, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:56:59 2024: [PE_624]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:56:59 2024: [PE_1592]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=105, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:56:59 2024: [PE_5936]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=97, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:57:00 2024: [PE_3696]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:57:00 2024: [PE_378]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=98, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:57:00 2024: [PE_3706]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=103, pes_this_node=128, timeout=180 secs
Mon Sep 2 18:57:00 2024: [PE_3706]:_pmi_mmap_init:Failed to setup PMI mmap.Mon Sep 2 18:57:00 2024: [PE_3706]:globals_init:_pmi_mmap_init returned -1
MPICH ERROR [Rank 0] [job id unknown] [Mon Sep 2 18:57:00 2024] [nid001420] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(441).......:
MPIR_pmi_init(110)...: PMI_Init returned 1
aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(441).......:
MPIR_pmi_init(110)...: PMI_Init returned 1
Mon Sep 2 18:57:00 2024: [PE_4922]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=91, pes_this_node=128, timeout=180 secs
srun: error: nid001420: task 3706: Exited with exit code 255
srun: Terminating StepId=3303936.0
slurmstepd: error: *** STEP 3303936.0 ON nid001117 CANCELLED AT 2024-09-02T18:57:00 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.31.s 00001468BCEBE910 Unknown Unknown Unknown
libpmi.so.0.6.0 00001468B51A7755 Unknown Unknown Unknown
libpmi.so.0.6.0 00001468B51A7855 Unknown Unknown Unknown
libpmi.so.0 00001468B51A7C14 _pmi_mmap_init Unknown Unknown
libpmi.so.0 00001468B51A252C _pmi_init Unknown Unknown
libpmi.so.0 00001468B51AF706 PMI2_Init Unknown Unknown
libmpi_intel.so.1 00001468B9632A11 Unknown Unknown Unknown
libmpi_intel.so.1 00001468B96384DD Unknown Unknown Unknown
libmpi_intel.so.1 00001468B80C3D7E Unknown Unknown Unknown
libmpi_intel.so.1 00001468B80C4304 PMPI_Init_thread Unknown Unknown
libmpifort_intel. 00001468B9FD392F MPI_INIT_THREAD Unknown Unknown
cesm.exe 0000000000436E1B MAIN__ 40 esmApp.F90
cesm.exe 0000000000425DCD Unknown Unknown Unknown
libc-2.31.so 00001468B73C724D __libc_start_main Unknown Unknown
cesm.exe 0000000000425CFA Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)