Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MOM6 Fatal Checksum Error: does IO flush before MPI_Finalize?

ezaron

Ed Zaron
New Member
Hi MOM6 experts:

Two questions:
(1) When the code exits with a Fatal error, is there normally an MPI_Barrier before the MPI_Finalize?
I am a little concerned if I am seeing all the output in my mom6.o* log file. Is it possible that the lead proc detects the Fatal error signal and shuts down before all the output has been flushed into the mom6.o* log file?

(2) The Fatal error is due to a disagreement of checksums in an input file. What could cause a corrupt checksum?
86593:FATAL from PE 0: SIS_restart(restore_state): Checksum of input field part_size AFF344408441C19A does not match value 231344408441C19A stored in RESTART/ice_model.res.nc

There are two oddities about this: (a) the checksums only differ in the 3st three hex digits, and (b) the md5sum of the entire file matches the value of the same file on another computer where the code is known to work.

-Ed
 
You can turn off the checking of the sums in the restart files:
#override RESTART_CHECKSUMS_REQUIRED = False

For me, the checksum failed to match when I went from using a mask_table to not (for debugging on fewer cores).
 

adcroft

Alistair Adcroft
Member
1) I think errors are meant to be issued by each process that encounters it. I normally see hundreds of the same FMS fatal error message. Whether the root PE does something different I am unsure. If the error is bypassing FMS, such as a SEGV, then I think it's hard to define what happens
2) Kate is right that you can turn off the checksum. The idea is that the checksums should only be over the part of the model that matters (not land). However, we've found the filling-in of missing processors (switching from mask_table to non-mask_table) adds values on land that change the bitcount and can lead to subsequent bad values because NaNs propagate even when masked. It is generally very hard to change the number of cores if using masked processors (mask_table). When something needs to be debugged with fewer cores, I usually re-run without a mask_table, and then change the core count.
 

ezaron

Ed Zaron
New Member
Aha. Thanks a lot.
In order to "re-run without a mask_table", do I simply move my MOM_layout and SIS_layout files out of the way, or do I need to change other input files? Or, do I just comment out the MASKTABLE in the *_layout files?

Presumably, I will need to increase the processor count if I do not use the mask_table. Is that correct?
 
Presumably, I will need to increase the processor count if I do not use the mask_table. Is that correct?
For my grids, a higher processor count adds to the cells that would be masked out, so that's when I want a mask_table. I can debug on only four cores, no mask_table.
 
Top