Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MOM6 (with ocean only benchmark) gets stuck

puneet336

Puneet
New Member
Hi,
I am trying to run the MOM6 with ocean only benchmark, the simulation seems to be getting stuck.
Compiler - intel2019u5
state of the directory where simulation got stuck-

Code:
ocean_only_benchmark1]$ ls test/
available_diags.000000 input.nml MOM6 MOM_parameter_doc.all ocean.stats time_stamp.out
change.sh input.nml.bak MOM_input MOM_parameter_doc.debugging ocean.stats.nc Vertical_coordinate.nc
CPU_stats input.nml.bk MOM_memory.h MOM_parameter_doc.layout RESTART
diag_table libnuma.so MOM_override MOM_parameter_doc.short run.sh
env.src logfile.000000.out MOM_override.bak mon run.sh.bak
GOLD_IC.nc lstopo.info MOM_override.bk ocean_geometry.nc slurm-86603.out



contents of MOM_override file -
Code:
ocean_only_benchmark1]$ cat test/MOM_override
! Blank file in which we can put "overrides" for parameters
#override NIGLOBAL = 720
#override NJGLOBAL = 360


contents if input.nml -

Code:
ocean_only_benchmark1]$ cat test/input.nml
&MOM_input_nml
output_directory = './',
input_filename = 'n'
restart_input_dir = 'INPUT/',
restart_output_dir = 'RESTART/',
parameter_filename = 'MOM_input',
'MOM_override' /

&diag_manager_nml
/

&fms_nml
clock_grain='ROUTINE'
clock_flags='SYNC'
domains_stack_size = 955296
stack_size =0 /

&ocean_solo_nml
months = 0
days = 20 /


slurm.out.txt

with 56 ranks (i have 56 cores per node), the simulation seems to be getting stuck at this state (stdout) -

Code:
MOM Date 1/01/01 00:00:00 0: En 5.483091E-01, MaxCFL 0.00000, Mass 7.909719100499E+19, Salt 35.00000000000, Temp 5.06383782258
Total Energy: 4402CF01460DFD5A 4.3369709224187347E+19
Total Mass: 7.9097191004994732E+19, Change: 0.0000000000000000E+00 Error: 0.00000E+00 ( 0.0E+00)
Total Salt: 2.7684016851748152E+18, Change: 0.0000000000000000E+00 Error: 0.00000E+00 ( 0.0E+00)
Total Heat: 1.5721012388222692E+24, Change: 0.0000000000000000E+00 Error: 0.00000E+00 ( 0.0E+00)
Total age: 0.0000000000000000E+00 yr kg

attaching the stdout herewith. Rest other files are same as present in the source code. The same setup runs to completion with lower number of ranks (28). Do i need to modify some settings in the input file to make the simulation work with higher number of ranks (56)?

I tested few launch configurations -
a) 14 ranks x 1 thread per process
b) 14 ranks x 4 threads per process
c) ranks > 50 per node
with all above the simulation gets stuck at first timestep. with 28 ranks per node, 50 ranks per node the simulation works.

Please let me know in case some more information is required from my end on this issue.
 

Attachments

  • slurm.out.txt
    13.2 KB · Views: 3

marshallward

Marshall Ward
New Member
Hi Puneet, it seems that this was an issue with the code. We have discussed this on GitHub, but I'll repeat the explanation here for others.

There are certain timers which are synced via `MPI_Barrier`. A few of these are embedded in j-loops.

If the ranks have different sized domains in j, then the domain will be unable to sync for iterations which exceed the size of the smaller ranks.

This can be resolved by removing the `clock_flags` argument from input.nml:

Code:
&fms_nml
    !clock_flags = 'SYNC'
/
 
Top