Error during run for BCO2x4cmip6 case

Eliot Heinrich · Sep 19, 2023

Hello,

I am testing a new port of CESM2.1.3/CIME and am running into some issues with some compsets. One of our users would like to use the BCO22x4mip6 case, but the case is failing to run. I see several warnings and errors in the output, but I'm not sure which are relevant or how to address them; I have attached the cesm.log.

I created the case with ./create_newcase --case test_BCO2 --compset BCO2x4cmip6 --res f09_g17 and and did not make any modifications before building and running it.

Thanks in advance for any insight!

Best,
Eliot

jedwards · Sep 19, 2023

I think that the pop ocean model is expecting a file it cannot find - check the end of the ocn.log file.

Eliot Heinrich · Sep 19, 2023

Here is the ocn.log file; it looks like b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-00000 file is present, and it doesn't seem to complain about anything else.

jedwards · Sep 19, 2023

The error from the cesm log is:
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 97, file /mmfs1/data/heinriea/cime_box/scratch/test_BCO2/run/fort.97

In the ocn log you have:
ovf_read_restart unit (mu) = 97 file name = ./b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-00000

Just going on a hunch here - is the filesystem available to all of the compute tasks?

Eliot Heinrich · Sep 19, 2023

Thanks for your help; it looks like the restart file can be read from each of our compute nodes where the tasks were run, so it doesn't seem to be that.

mlevy · Sep 19, 2023

I'm seeing a lot of tracebacks in your cesm.log file that look like

Code:

cesm.exe           0000000002377D06  broadcast_mp_broa         204  broadcast.F90
cesm.exe           00000000022240EF  overflows_mp_init        2384  overflows.F90
cesm.exe           00000000024B9069  initial_mp_pop_in         249  initial.F90
cesm.exe           0000000002317C97  pop_initmod_mp_po         102  POP_InitMod.F90
cesm.exe           00000000021F92CC  ocn_comp_mct_mp_o         255  ocn_comp_mct.F90
cesm.exe           0000000000439584  component_mod_mp_         267  component_mod.F90
cesm.exe           0000000000427E96  cime_comp_mod_mp_        1249  cime_comp_mod.F90
cesm.exe           00000000004365A9  MAIN__                    114  cime_driver.F90

and line 204 in broadcast.F90 is a straightforward MPI_BCAST() of an integer.

Code:

call MPI_BCAST(scalar, 1, MPI_INTEGER, root_pe, MPI_COMM_OCN,ierr)

I see you added support for andromeda in the config files, and I have a few of questions:

1. have you successfully run other parallel codes with the compiler / mpi libraries you used to build CESM?
2. why do you load both gcc/9.2.0 and intel/2020? (and which subversion of the intel compiler is being used?)
3. are you using mpich for this run? It looks like config_machines supports it, but config_compilers hard-codes paths to openmpi

Nothing jumps out as obviously wrong, but at first read it seems like the issue is in the MPI library. I'm also confused by the following line in your cesm.log file:

Code:

forrtl: severe (29): file not found, unit 97, file /mmfs1/data/heinriea/cime_box/scratch/test_BCO2/run/fort.97

in you ocean log file, it looks like unit 97 is the restart file:

Code:

 ovf_read_restart  unit (mu) =    97 file name = ./b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-0000

but maybe a non-master task is trying to access unit 97 and getting the default fort.97 file name?

Eliot Heinrich · Sep 20, 2023

Hi Michael,

Thanks for making those points; our port is definitely still a work in progress haha.

1. Yes, we are able to run some other simple cases, for instance, we are able to run and validate the moist Held-Suarez case following Moist Held-Suarez | Community Earth System Model. The case runs without issue and produces sensible data, even if it is split between several nodes.
2. gcc is a dependency of our CMake module; I don't believe it is actually being used during case building. We are using icc 19.1.2.275.
3. We are not using mpich and haven't tried it yet, so that's currently a placeholder.

Could you maybe elaborate a bit on your last line? Any idea how we could diagnose/address that issue?

Eliot Heinrich · Oct 10, 2023

Just wanted to check and see if there was any update; I built a new, clean version of openMPI but ran into the same issue.

jedwards · Oct 10, 2023

Wild guess - the ocean .ro file is binary - is the endian setting correct for your system?

mlevy · Oct 10, 2023

Sorry, I should've replied earlier.

Eliot Heinrich said:
Could you maybe elaborate a bit on your last line? Any idea how we could diagnose/address that issue?

For reading data, POP typically reads a file on a single task and then broadcasts the content of the file out. You can see this in the source code, where only one task should enter the if (my_task == master_task) then block. The error I commented on above:

Code:

forrtl: No such file or directory
forrtl: severe (29): file not found, unit 97, file /mmfs1/data/heinriea/cime_box/scratch/test_BCO2/run/fort.97

It looks like there was a call to open unit 97 without a file name; when you call open() without a file name, the default name used is fort.N (where N is the unit number being opened). The ocean log shows

Code:

ovf_read_restart  unit (mu) =    97 file name = ./b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-0000

which is printed at line 2198 (one of the highlighted lines in the link above); it might me think that one of the tasks that was not master_task was calling open(mu), although even if there was confusion about which tasks should call open() I don't see how that would be possible because the call to open() would still pass the file argument...

Even if that particular path didn't lead to anything useful, there are a few other errors in your cesm.log file. Some tasks are reporting an error with MPI_Bcast in the CIME stack:

Code:

libmpi.so.40.30.1  000015554C3A8597  MPI_Bcast             Unknown  Unknown
libmpi_mpifh.so.4  000015554C6CEF04  pmpi_bcast            Unknown  Unknown
cesm.exe           0000000002D4556F  shr_mpi_mod_mp_sh         568  shr_mpi_mod.F90
cesm.exe           0000000002C3C9D5  seq_infodata_mod_        2549  seq_infodata_mod.F90
cesm.exe           0000000000439971  component_mod_mp_         285  component_mod.F90
cesm.exe           0000000000427E96  cime_comp_mod_mp_        1249  cime_comp_mod.F90
cesm.exe           00000000004365A9  MAIN__                    114  cime_driver.F90
cesm.exe           0000000000418AA2  Unknown               Unknown  Unknown
libc-2.28.so       000015554B9F7493  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004189AE  Unknown               Unknown  Unknown

while others are reporting issues with MPI_Bcast but coming through the POP code:

Code:

libmpi.so.40.30.1  000015554C3A8597  MPI_Bcast             Unknown  Unknown
libmpi_mpifh.so.4  000015554C6CEF04  pmpi_bcast            Unknown  Unknown
cesm.exe           0000000002377D06  broadcast_mp_broa         204  broadcast.F90
cesm.exe           00000000022240EF  overflows_mp_init        2384  overflows.F90
cesm.exe           00000000024B9069  initial_mp_pop_in         249  initial.F90
cesm.exe           0000000002317C97  pop_initmod_mp_po         102  POP_InitMod.F90
cesm.exe           00000000021F92CC  ocn_comp_mct_mp_o         255  ocn_comp_mct.F90
cesm.exe           0000000000439584  component_mod_mp_         267  component_mod.F90
cesm.exe           0000000000427E96  cime_comp_mod_mp_        1249  cime_comp_mod.F90
cesm.exe           00000000004365A9  MAIN__                    114  cime_driver.F90
cesm.exe           0000000000418AA2  Unknown               Unknown  Unknown
libc-2.28.so       000015554B9F7493  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004189AE  Unknown               Unknown  Unknown

So I don't have anything specific to recommend, but this still really looks like a system issue rather than a POP problem. You've mentioned being able to run other simple cases -- one possibility is that you need to use more nodes than you are currently giving it so you are running into memory / resource allocation issues. Can you run the following commands and share the output? It should tell you how CESM is distributed across the computer.

Code:

$ cd [case root]
$ ./pelayout

For example, on the NCAR supercomputer the results are

Code:

$ ./pelayout
Comp  NTASKS  NTHRDS  ROOTPE
CPL :    576/     1;      0
ATM :    576/     1;      0
LND :    468/     1;      0
ICE :     72/     1;    468
OCN :    144/     1;    576
ROF :    468/     1;      0
GLC :    576/     1;      0
WAV :     36/     1;    540
ESP :      1/     1;      0

Each node is given 36 tasks, so 16 nodes are dedicated to the coupler, atmosphere, land, sea-ice, runoff, glacier, and wave models (the land and sea-ice run concurrently, with the land using 14 nodes while sea-ice uses the other 2; runoff runs on the same nodes as the land, and the wave model just uses a single node because it does not scale well). Meanwhile, the ocean has 4 nodes to itself.

Eliot Heinrich · Oct 25, 2023

Adding additional nodes to the ocean model seems to fix the issue; the default output of pelayout was

Code:

$ ./pelayout 
Comp  NTASKS  NTHRDS  ROOTPE
CPL :    384/     1;      0
ATM :    384/     1;      0
LND :    192/     1;      0
ICE :    192/     1;    192
OCN :     48/     1;    384
ROF :    192/     1;      0
GLC :    384/     1;      0
WAV :    384/     1;      0
ESP :      1/     1;      0

After doing ./xmlchange NTASKS_OCN=-4, the output is

Code:

$ ./pelayout 
Comp  NTASKS  NTHRDS  ROOTPE
CPL :    384/     1;      0
ATM :    384/     1;      0
LND :    192/     1;      0
ICE :    192/     1;    192
OCN :    192/     1;    384
ROF :    192/     1;      0
GLC :    384/     1;      0
WAV :    384/     1;      0
ESP :      1/     1;      0

and the job runs without issue, so it seems to be an allocation issue. How should we avoid this issue in the future?

jedwards · Oct 25, 2023

Now that you have it running you should load balance it. Once you have completed a run look in the timing directory for the file
cesm_timing.*

In that file you will find a table similar to:

TOT Run Time: 25884.178 seconds 37.030 seconds/mday 6.39 myears/wday
CPL Run Time: 606.974 seconds 0.868 seconds/mday 272.60 myears/wday
CPL COMM Time: 5253.999 seconds 7.516 seconds/mday 31.49 myears/wday
ATM Run Time: 20082.937 seconds 28.731 seconds/mday 8.24 myears/wday
LND Run Time: 4976.426 seconds 7.119 seconds/mday 33.25 myears/wday
ICE Run Time: 2974.062 seconds 4.255 seconds/mday 55.63 myears/wday
OCN Run Time: 16479.826 seconds 23.576 seconds/mday 10.04 myears/wday
ROF Run Time: 287.386 seconds 0.411 seconds/mday 575.75 myears/wday
GLC Run Time: 1.013 seconds 0.001 seconds/mday 163338.52 myears/wday
WAV Run Time: 928.616 seconds 1.328 seconds/mday 178.18 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday

Ideally you want the ice and lnd+rof to use about the same amount of time so in the example here you might want to shift
some of the tasks from the lnd model to the ice model like:
./xmlchange NTASKS_LND=128,NTASKS_ROF=128,NTASKS_ICE=256,ROOTPE_ICE=128

And you want the ocn time to be about the same as atm+lnd+cpl so again refering to the example you could either reduce the
number of ocn tasks or increase the tasks used by all the other components.

Once you are satisfied with a particular pelayout you can save it to the file in the source tree cesm/cime_config/config_pes.xml
following the regex pattern used in that file to identify your compset, resolution and machine. Future cases using the same compset, resolution and machine will start with the tuned pelayout.

Error during run for BCO2x4cmip6 case

Eliot Heinrich

Eliot Heinrich

New Member

Attachments

jedwards

CSEG and Liaisons

Eliot Heinrich

Eliot Heinrich

New Member

Attachments

jedwards

CSEG and Liaisons

Eliot Heinrich

Eliot Heinrich

New Member

mlevy

Michael Levy

CSEG and Liaisons

Eliot Heinrich

Eliot Heinrich

New Member

Eliot Heinrich

Eliot Heinrich

New Member

jedwards

CSEG and Liaisons

mlevy

Michael Levy

CSEG and Liaisons

Eliot Heinrich

Eliot Heinrich

New Member

jedwards

CSEG and Liaisons