Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Error during run for BCO2x4cmip6 case

Eliot Heinrich

Eliot Heinrich
New Member
Hello,

I am testing a new port of CESM2.1.3/CIME and am running into some issues with some compsets. One of our users would like to use the BCO22x4mip6 case, but the case is failing to run. I see several warnings and errors in the output, but I'm not sure which are relevant or how to address them; I have attached the cesm.log.

I created the case with ./create_newcase --case test_BCO2 --compset BCO2x4cmip6 --res f09_g17 and and did not make any modifications before building and running it.

Thanks in advance for any insight!

Best,
Eliot
 

Attachments

  • cesm.log.1438586.230915-214031.txt
    550.3 KB · Views: 9
  • config_batch.txt
    23.2 KB · Views: 0
  • config_compilers.txt
    41.3 KB · Views: 1
  • config_inputdata.txt
    1.4 KB · Views: 0
  • config_machines.txt
    108.7 KB · Views: 2
  • version_info.txt
    461.5 KB · Views: 0

jedwards

CSEG and Liaisons
Staff member
I think that the pop ocean model is expecting a file it cannot find - check the end of the ocn.log file.
 

Eliot Heinrich

Eliot Heinrich
New Member
Here is the ocn.log file; it looks like b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-00000 file is present, and it doesn't seem to complain about anything else.
 

Attachments

  • ocn.log.1438586.230915-214031.txt
    3.7 KB · Views: 5

jedwards

CSEG and Liaisons
Staff member
The error from the cesm log is:
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 97, file /mmfs1/data/heinriea/cime_box/scratch/test_BCO2/run/fort.97

In the ocn log you have:
ovf_read_restart unit (mu) = 97 file name = ./b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-00000

Just going on a hunch here - is the filesystem available to all of the compute tasks?
 

Eliot Heinrich

Eliot Heinrich
New Member
Thanks for your help; it looks like the restart file can be read from each of our compute nodes where the tasks were run, so it doesn't seem to be that.
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
I'm seeing a lot of tracebacks in your cesm.log file that look like

Code:
cesm.exe           0000000002377D06  broadcast_mp_broa         204  broadcast.F90
cesm.exe           00000000022240EF  overflows_mp_init        2384  overflows.F90
cesm.exe           00000000024B9069  initial_mp_pop_in         249  initial.F90
cesm.exe           0000000002317C97  pop_initmod_mp_po         102  POP_InitMod.F90
cesm.exe           00000000021F92CC  ocn_comp_mct_mp_o         255  ocn_comp_mct.F90
cesm.exe           0000000000439584  component_mod_mp_         267  component_mod.F90
cesm.exe           0000000000427E96  cime_comp_mod_mp_        1249  cime_comp_mod.F90
cesm.exe           00000000004365A9  MAIN__                    114  cime_driver.F90

and line 204 in broadcast.F90 is a straightforward MPI_BCAST() of an integer.

Code:
call MPI_BCAST(scalar, 1, MPI_INTEGER, root_pe, MPI_COMM_OCN,ierr)

I see you added support for andromeda in the config files, and I have a few of questions:

1. have you successfully run other parallel codes with the compiler / mpi libraries you used to build CESM?
2. why do you load both gcc/9.2.0 and intel/2020? (and which subversion of the intel compiler is being used?)
3. are you using mpich for this run? It looks like config_machines supports it, but config_compilers hard-codes paths to openmpi

Nothing jumps out as obviously wrong, but at first read it seems like the issue is in the MPI library. I'm also confused by the following line in your cesm.log file:

Code:
forrtl: severe (29): file not found, unit 97, file /mmfs1/data/heinriea/cime_box/scratch/test_BCO2/run/fort.97

in you ocean log file, it looks like unit 97 is the restart file:

Code:
 ovf_read_restart  unit (mu) =    97 file name = ./b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-0000

but maybe a non-master task is trying to access unit 97 and getting the default fort.97 file name?
 

Eliot Heinrich

Eliot Heinrich
New Member
Hi Michael,

Thanks for making those points; our port is definitely still a work in progress haha.

1. Yes, we are able to run some other simple cases, for instance, we are able to run and validate the moist Held-Suarez case following Moist Held-Suarez | Community Earth System Model. The case runs without issue and produces sensible data, even if it is split between several nodes.
2. gcc is a dependency of our CMake module; I don't believe it is actually being used during case building. We are using icc 19.1.2.275.
3. We are not using mpich and haven't tried it yet, so that's currently a placeholder.

Could you maybe elaborate a bit on your last line? Any idea how we could diagnose/address that issue?
 

Eliot Heinrich

Eliot Heinrich
New Member
Just wanted to check and see if there was any update; I built a new, clean version of openMPI but ran into the same issue.
 

jedwards

CSEG and Liaisons
Staff member
Wild guess - the ocean .ro file is binary - is the endian setting correct for your system?
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
Sorry, I should've replied earlier.

Could you maybe elaborate a bit on your last line? Any idea how we could diagnose/address that issue?

For reading data, POP typically reads a file on a single task and then broadcasts the content of the file out. You can see this in the source code, where only one task should enter the if (my_task == master_task) then block. The error I commented on above:

Code:
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 97, file /mmfs1/data/heinriea/cime_box/scratch/test_BCO2/run/fort.97

It looks like there was a call to open unit 97 without a file name; when you call open() without a file name, the default name used is fort.N (where N is the unit number being opened). The ocean log shows

Code:
ovf_read_restart  unit (mu) =    97 file name = ./b.e21.B1850.f09_g17.CMIP6-piControl.001_v2.pop.ro.0501-01-01-0000

which is printed at line 2198 (one of the highlighted lines in the link above); it might me think that one of the tasks that was not master_task was calling open(mu), although even if there was confusion about which tasks should call open() I don't see how that would be possible because the call to open() would still pass the file argument...



Even if that particular path didn't lead to anything useful, there are a few other errors in your cesm.log file. Some tasks are reporting an error with MPI_Bcast in the CIME stack:

Code:
libmpi.so.40.30.1  000015554C3A8597  MPI_Bcast             Unknown  Unknown
libmpi_mpifh.so.4  000015554C6CEF04  pmpi_bcast            Unknown  Unknown
cesm.exe           0000000002D4556F  shr_mpi_mod_mp_sh         568  shr_mpi_mod.F90
cesm.exe           0000000002C3C9D5  seq_infodata_mod_        2549  seq_infodata_mod.F90
cesm.exe           0000000000439971  component_mod_mp_         285  component_mod.F90
cesm.exe           0000000000427E96  cime_comp_mod_mp_        1249  cime_comp_mod.F90
cesm.exe           00000000004365A9  MAIN__                    114  cime_driver.F90
cesm.exe           0000000000418AA2  Unknown               Unknown  Unknown
libc-2.28.so       000015554B9F7493  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004189AE  Unknown               Unknown  Unknown

while others are reporting issues with MPI_Bcast but coming through the POP code:

Code:
libmpi.so.40.30.1  000015554C3A8597  MPI_Bcast             Unknown  Unknown
libmpi_mpifh.so.4  000015554C6CEF04  pmpi_bcast            Unknown  Unknown
cesm.exe           0000000002377D06  broadcast_mp_broa         204  broadcast.F90
cesm.exe           00000000022240EF  overflows_mp_init        2384  overflows.F90
cesm.exe           00000000024B9069  initial_mp_pop_in         249  initial.F90
cesm.exe           0000000002317C97  pop_initmod_mp_po         102  POP_InitMod.F90
cesm.exe           00000000021F92CC  ocn_comp_mct_mp_o         255  ocn_comp_mct.F90
cesm.exe           0000000000439584  component_mod_mp_         267  component_mod.F90
cesm.exe           0000000000427E96  cime_comp_mod_mp_        1249  cime_comp_mod.F90
cesm.exe           00000000004365A9  MAIN__                    114  cime_driver.F90
cesm.exe           0000000000418AA2  Unknown               Unknown  Unknown
libc-2.28.so       000015554B9F7493  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004189AE  Unknown               Unknown  Unknown

So I don't have anything specific to recommend, but this still really looks like a system issue rather than a POP problem. You've mentioned being able to run other simple cases -- one possibility is that you need to use more nodes than you are currently giving it so you are running into memory / resource allocation issues. Can you run the following commands and share the output? It should tell you how CESM is distributed across the computer.

Code:
$ cd [case root]
$ ./pelayout

For example, on the NCAR supercomputer the results are

Code:
$ ./pelayout
Comp  NTASKS  NTHRDS  ROOTPE
CPL :    576/     1;      0
ATM :    576/     1;      0
LND :    468/     1;      0
ICE :     72/     1;    468
OCN :    144/     1;    576
ROF :    468/     1;      0
GLC :    576/     1;      0
WAV :     36/     1;    540
ESP :      1/     1;      0

Each node is given 36 tasks, so 16 nodes are dedicated to the coupler, atmosphere, land, sea-ice, runoff, glacier, and wave models (the land and sea-ice run concurrently, with the land using 14 nodes while sea-ice uses the other 2; runoff runs on the same nodes as the land, and the wave model just uses a single node because it does not scale well). Meanwhile, the ocean has 4 nodes to itself.
 

Eliot Heinrich

Eliot Heinrich
New Member
Adding additional nodes to the ocean model seems to fix the issue; the default output of pelayout was
Code:
$ ./pelayout 
Comp  NTASKS  NTHRDS  ROOTPE
CPL :    384/     1;      0
ATM :    384/     1;      0
LND :    192/     1;      0
ICE :    192/     1;    192
OCN :     48/     1;    384
ROF :    192/     1;      0
GLC :    384/     1;      0
WAV :    384/     1;      0
ESP :      1/     1;      0
After doing ./xmlchange NTASKS_OCN=-4, the output is
Code:
$ ./pelayout 
Comp  NTASKS  NTHRDS  ROOTPE
CPL :    384/     1;      0
ATM :    384/     1;      0
LND :    192/     1;      0
ICE :    192/     1;    192
OCN :    192/     1;    384
ROF :    192/     1;      0
GLC :    384/     1;      0
WAV :    384/     1;      0
ESP :      1/     1;      0
and the job runs without issue, so it seems to be an allocation issue. How should we avoid this issue in the future?
 

jedwards

CSEG and Liaisons
Staff member
Now that you have it running you should load balance it. Once you have completed a run look in the timing directory for the file
cesm_timing.*

In that file you will find a table similar to:

TOT Run Time: 25884.178 seconds 37.030 seconds/mday 6.39 myears/wday
CPL Run Time: 606.974 seconds 0.868 seconds/mday 272.60 myears/wday
CPL COMM Time: 5253.999 seconds 7.516 seconds/mday 31.49 myears/wday
ATM Run Time: 20082.937 seconds 28.731 seconds/mday 8.24 myears/wday
LND Run Time: 4976.426 seconds 7.119 seconds/mday 33.25 myears/wday
ICE Run Time: 2974.062 seconds 4.255 seconds/mday 55.63 myears/wday
OCN Run Time: 16479.826 seconds 23.576 seconds/mday 10.04 myears/wday
ROF Run Time: 287.386 seconds 0.411 seconds/mday 575.75 myears/wday
GLC Run Time: 1.013 seconds 0.001 seconds/mday 163338.52 myears/wday
WAV Run Time: 928.616 seconds 1.328 seconds/mday 178.18 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday


Ideally you want the ice and lnd+rof to use about the same amount of time so in the example here you might want to shift
some of the tasks from the lnd model to the ice model like:
./xmlchange NTASKS_LND=128,NTASKS_ROF=128,NTASKS_ICE=256,ROOTPE_ICE=128

And you want the ocn time to be about the same as atm+lnd+cpl so again refering to the example you could either reduce the
number of ocn tasks or increase the tasks used by all the other components.

Once you are satisfied with a particular pelayout you can save it to the file in the source tree cesm/cime_config/config_pes.xml
following the regex pattern used in that file to identify your compset, resolution and machine. Future cases using the same compset, resolution and machine will start with the tuned pelayout.
 
Top