Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MOM6 dies silently in call to read_data in FMS1/mosaic/grid.F90

ezaron

Ed Zaron
New Member
Hello MOM6 enthusiasts:
i am trying to run MOM6+SIS2 on NCAR cheyenne in a configuration which is known to run on NCCS discover.
I am working with a stock checkout of MOM6-examples, and I have been able to run some ocean_only test cases as well as the ice_ocean_SIS2/OM4_05 configuration.
I compile the code with "make NETCDF=3 DEBUG=1" flags using intel mpi compilers with these modules loaded:
1) ncarenv/1.3 4) netcdf/4.8.1 7) julia/1.9.1
2) intel/19.1.1 5) mpt/2.25
3) ncarcompilers/0.5.0 6) ncview/2.1.7

The code appears to startup correctly and write OUTPUT/SIS_parameter files. The stdout and stderr log files look normal compared to the same files when the code is run at NCCS discover.
With some basic sleuthing, I found that the code is silently failing inside FMS1/mosaic/grid.F90 in a call to read_data in the grid_mod/get_grid_cell_vertices section. It is trying to read coordinates out of a regional.mom6.nc file which exists and appears to be well-formed.

Does this failure mode sound familiar to anyone? I wonder if you have suggestions for how to debug this or getting working.

All the best,
Ed
 

marshallward

Marshall Ward
New Member
Hi Ed, were you able to get any info at all about the failure? Was there a backtrace or even an error from the model? Or did it truly just exit immediately with no more information? If so then this might need to go through a debugger.
 

klindsay

CSEG and Liaisons
Staff member
Ed,
If you point to a directory on cheyenne where the problem is showing up, I'll take a look.
Keith
 

ezaron

Ed Zaron
New Member
Hi Marshall and Keith:
There is no backtrace.
The code is at /glade/u/home/ezaron/MOM6-examples-flat/config_08/UNFORCED
The runtime behavior changes depending on whether the type of the variables in regional.mom6.nc is float or double:
float: mom6.o1988835
double: mom6.o2002186
(I put print statements in grid.F90 to identify the code paths and failure points.)
I'll start looking into the debuggers.
Thank you for your attention.
-Ed
 

ezaron

Ed Zaron
New Member
I was a little intimidated by the ARM Forge documentation, so I just tried gdb single-proc on the login node. I should have done this earlier:

Program received signal SIGSEGV, Segmentation fault.
0x00000000047c9bbd in fms_io_mod::read_data_2d_new (filename=..., fieldname=...,
data=<error reading variable: value requires 474892760 bytes, which is more than max-value-size>,
domain=<error reading variable: Cannot access memory at address 0xe0>, timelevel=<error reading variable: Cannot access memory at address 0x0>,
no_domain=4294967295, position=<error reading variable: Cannot access memory at address 0x0>,
tile_count=<error reading variable: Cannot access memory at address 0x0>, .tmp.FILENAME.len_V$b5e8=1024, .tmp.FIELDNAME.len_V$b5eb=1)
at ../../src/FMS/fms/fms_io.F90:6013
 

milicak

Mehmet Ilicak
New Member
Hi Ed,

Is it possible to try the string with 5 characters instead of 256.
I remember long time ago FMS was giving me an error with that.
FMS unfortunately is not very informative when it comes to error informations.
 

ezaron

Ed Zaron
New Member
Two experiments:
(1) Add the following print statements to fms_io.F90 at line 6013:
write(0,*)"EDZ: read_data_2d_new : filename = ",trim(filename)
write(0,*)"EDZ: read_data_2d_new : data sizes = ",size(data,1),size(data,2)
write(0,*)"EDZ: read_data_2d_new : data3d sizes = ",size(data_3d,1),size(data_3d,2),size(data_3d,3)

(2) Shorten the string variable in regional.mom6.nc from 256 characters to 5 characters.

Result:
Both approaches allow the serial code to execute successfully through this call in gdb. I am waiting in the queue to test the parallel version.

Question:
Is it possible that "make clean" under build/fms does not wipe the old object files? Maybe fms_io.o was stale from my original compilation (when I used some incompatible modules) and it was not recompiled until I touched the file?
 

ezaron

Ed Zaron
New Member
... the build/fms/Makefile uses "-rm" (minus rm) in the clean target. I've never seen that command before.
 

ezaron

Ed Zaron
New Member
SOLVED:

Symptoms: early exit with no error message or backtrace, not 100% reproducible

Possible problem/solution: too much memory is being requested on the node, use "mem:" constraint in the PBS script to run on large memory nodes, use qhist -n -j <jobid> to see memory usage stats

I got lucky and the debugger once emitted SIGBUS(7) on exit. NCAR support suggested memory issue.
 
Top