Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPT ERROR: in cheyenne

milicak

Mehmet Ilicak
New Member
Hi,

I am trying to run a MOM6 regional simulation in cheyenne.
I succesfully ran 1 whole year in another hpc using gnu compiler, and I am trying to run with different intel compilers in cheyenne.

However, when I started to run, I managed to run first 6 months without any problem. Then I had to resubmit the simulation because of wallclock issues.
The simulation started to fail at the same time (2 months after restart, 1996/08/16, 00-00-00) with the following error;
"MPT ERROR: MPI_COMM_WORLD rank 14 has terminated without calling MPI_Finalize() aborting job "
There was no other information.

After multiple different MOM_input options and different intel versions, somehow error log file produced this;
"MPT ERROR: Rank 30(g:30) is aborting with error code 1.
MPT Version: HPE MPT 2.19 02/23/19 05:30:09
MPT: --------stack traceback-------
FATAL from PE 1: NETCDF ERROR: NetCDF: HDF error File=INPUT/seawifs-clim-1997-2010.smoothed.nc Field=chlor_a"

That file was same in both machines. Neverthless since chlor_a field included NaN on the land points, just in case I decided to create a new
file with flooded on the land points.

I managed to run the second 6 months. And I tried to keep continue on the simulation, but the model stopped again with no error information at all!
The simulation was on the second year month 4 (1997/04/16, 00-00-00) .

Then I resubmit the simulation again this time with VERBOSITY = 6, and this time it stopped at a different time (1997/ 3/24 12:40: 0 ), no fatal error in the error log.
And last part of the output file was the following;

NOTE from PE 0: callTree: o done with find_uv_at_h (diabatic)
NOTE from PE 0: callTree: ---> set_diffusivity(), MOM_set_diffusivity.F90
NOTE from PE 0: callTree: o done with calculate_kappa_shear (set_diffusivity)

Has anybody have a suggestion?

P.S. These are my latest modules in cheyenne;
module load ncarenv
module load intel/19.1.1
module load netcdf/4.7.4
module load mpt/2.22

Thanks in advance,

Mehmet
 

adcroft

Alistair Adcroft
Member
If you roll back to a previous restart, can you reproduce the last interval? Sounds like either a corrupted restart file on disk, or some non-reproducible code (usually uninitialized variables).
 
Top