Sorry, I should've replied earlier.
Could you maybe elaborate a bit on your last line? Any idea how we could diagnose/address that issue?
For reading data, POP typically reads a file on a single task and then broadcasts the content of the file out. You can see this
in the source code, where only one task should enter the
if (my_task == master_task) then
block. The error I commented on above:
Code:
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 97, file /mmfs1/data/heinriea/cime_box/scratch/test_BCO2/run/fort.97
It looks like there was a call to open unit 97 without a file name; when you call
open()
without a file name, the default name used is
fort.N
(where
N
is the unit number being opened). The ocean log shows
Code:
ovf_read_restart unit (mu) = 97 file name = ./b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-0000
which is printed at line 2198 (one of the highlighted lines in the link above); it might me think that one of the tasks that was not
master_task
was calling
open(mu)
, although even if there was confusion about which tasks should call
open()
I don't see how that would be possible because the call to
open()
would still pass the
file
argument...
Even if that particular path didn't lead to anything useful, there are a few other errors in your
cesm.log
file. Some tasks are reporting an error with
MPI_Bcast
in the CIME stack:
Code:
libmpi.so.40.30.1 000015554C3A8597 MPI_Bcast Unknown Unknown
libmpi_mpifh.so.4 000015554C6CEF04 pmpi_bcast Unknown Unknown
cesm.exe 0000000002D4556F shr_mpi_mod_mp_sh 568 shr_mpi_mod.F90
cesm.exe 0000000002C3C9D5 seq_infodata_mod_ 2549 seq_infodata_mod.F90
cesm.exe 0000000000439971 component_mod_mp_ 285 component_mod.F90
cesm.exe 0000000000427E96 cime_comp_mod_mp_ 1249 cime_comp_mod.F90
cesm.exe 00000000004365A9 MAIN__ 114 cime_driver.F90
cesm.exe 0000000000418AA2 Unknown Unknown Unknown
libc-2.28.so 000015554B9F7493 __libc_start_main Unknown Unknown
cesm.exe 00000000004189AE Unknown Unknown Unknown
while others are reporting issues with
MPI_Bcast
but coming through the POP code:
Code:
libmpi.so.40.30.1 000015554C3A8597 MPI_Bcast Unknown Unknown
libmpi_mpifh.so.4 000015554C6CEF04 pmpi_bcast Unknown Unknown
cesm.exe 0000000002377D06 broadcast_mp_broa 204 broadcast.F90
cesm.exe 00000000022240EF overflows_mp_init 2384 overflows.F90
cesm.exe 00000000024B9069 initial_mp_pop_in 249 initial.F90
cesm.exe 0000000002317C97 pop_initmod_mp_po 102 POP_InitMod.F90
cesm.exe 00000000021F92CC ocn_comp_mct_mp_o 255 ocn_comp_mct.F90
cesm.exe 0000000000439584 component_mod_mp_ 267 component_mod.F90
cesm.exe 0000000000427E96 cime_comp_mod_mp_ 1249 cime_comp_mod.F90
cesm.exe 00000000004365A9 MAIN__ 114 cime_driver.F90
cesm.exe 0000000000418AA2 Unknown Unknown Unknown
libc-2.28.so 000015554B9F7493 __libc_start_main Unknown Unknown
cesm.exe 00000000004189AE Unknown Unknown Unknown
So I don't have anything specific to recommend, but this still really looks like a system issue rather than a POP problem. You've mentioned being able to run other simple cases -- one possibility is that you need to use more nodes than you are currently giving it so you are running into memory / resource allocation issues. Can you run the following commands and share the output? It should tell you how CESM is distributed across the computer.
Code:
$ cd [case root]
$ ./pelayout
For example, on the
NCAR supercomputer the results are
Code:
$ ./pelayout
Comp NTASKS NTHRDS ROOTPE
CPL : 576/ 1; 0
ATM : 576/ 1; 0
LND : 468/ 1; 0
ICE : 72/ 1; 468
OCN : 144/ 1; 576
ROF : 468/ 1; 0
GLC : 576/ 1; 0
WAV : 36/ 1; 540
ESP : 1/ 1; 0
Each node is given 36 tasks, so 16 nodes are dedicated to the coupler, atmosphere, land, sea-ice, runoff, glacier, and wave models (the land and sea-ice run concurrently, with the land using 14 nodes while sea-ice uses the other 2; runoff runs on the same nodes as the land, and the wave model just uses a single node because it does not scale well). Meanwhile, the ocean has 4 nodes to itself.