0.10 degree Global simulation not running with initial data specified

akhtert · May 31, 2023

Hello,

I am trying a 0.1 degree global simulation in CLM5 with SP compset. I can run the same using Coldstart on. However, I need to use initial data file from a previous 0.25 degree global simulation. When I try to use that by specifying in the clm namelist file, the model aborts w/o specifying any error in my understanding. I tried changing the PE layout and increasing the number of processors but it aborts every time. I am running on Cheyenne and the cases are:
coldstart- /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_test_coldstrt
and /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_sp

with initial data file: /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin_new
and /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_s3

Any help regarding the issue will be highly appreciated.

Thank you in advance- Tanjila
(reposting in this forum form high/variable resolution)

oleson · May 31, 2023

I'd like to take a look at your cases, but your permissions are set such that I can't:

drwx------ 29 akhtert ncar 4096 Apr 4 11:21 cases

akhtert · May 31, 2023

oleson said:
I'd like to take a look at your cases, but your permissions are set such that I can't:

drwx------ 29 akhtert ncar 4096 Apr 4 11:21

Sorry for the inconvenience, I changed that.

oleson · May 31, 2023

Is this the case that's failing?

/glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_s3

It looks like it might be failing during the interpolation of the finidat file or perhaps immediately following that when it tries to read the data back in. There is a finidat_interp_dest.nc file that has the same time stamp as the time stamp of the cesm log file. However, I don't see any output about interpolation in the lnd log. You could try compiling and running with DEBUG=TRUE to see if you get more log output.

The finidat file that is being interpolated is:

/glade/scratch/akhtert/archive/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin/rest/0008-01-01-00000/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin.clm2.r.0008-01-01-00000.nc

It looks like that is a 0.1deg file? If so, is setting use_init_interp=.true. necessary?

oleson · May 31, 2023

Ok, I do see that interpolation has been completed successfully after all. Although I don't see the need to do that:

input gridcells = 1537983 output gridcells = 1537983
input landuntis = 2898154 output landunits = 2898154
input columns = 4451483 output columns = 4451483
input pfts = 8829229 output pfts = 8829229

akhtert · May 31, 2023

So for another simulation in the same directory (/glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin), I used finidat from a 0.25 degree simulation and used use_init_interp=.true. That simulation ran for 9 years and then stopped. After that I started having this problem. Then for the new runs I tried to continue from 10th year and was using .10 degree finidat, yes, I do not need use_init_interp=.true. I will try another run without interpolation and DEBUG=TRUE to check. Thank you for your suggestion.

I will update here.

Thanks again
Tanjila

akhtert · Jun 2, 2023

oleson said:
Ok, I do see that interpolation has been completed successfully after all. Although I don't see the need to do that:

input gridcells = 1537983 output gridcells = 1537983
input landuntis = 2898154 output landunits = 2898154
input columns = 4451483 output columns = 4451483
input pfts = 8829229 output pfts = 8829229

Hello, I tried running with DEBUG=TRUE and no interpolation. Unfortunately, this time the model keeps running (STOP_N=2, STOP_OPTION=nmonths) till the maximum walltime is exceeded. I used a total of 2196 pes. There is nor error in the logfile and the walltime of 12 hrs should work for 2 months even more. I would really appreciate any suggestion.

case: /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_s4

Best Tanjila

oleson · Jun 4, 2023

It looks like it ran about a day and a half and then starting getting errors like this:

580:MPT: rank 580 dropping unexpected RC packet from 548 (647:1542), presumed failover - 18 0 1

This feels like a potential hardware issue. For example, see this recent post:

Run fails after 5 months with unfamiliar error

Hi all, I'm working on a regionally-refined spectral element grid and my case has failed with what looks like a fairly general MPT error but it's not one I've seen before, and the usual 'search for error message in CESM forum' approach hasn't been useful. The case runs OK for 3 months. After I...

bb.cgd.ucar.edu

Also, I've seen these errors recently in some CESM3 development simulations that are being run.
As suggested in the post, I would contact CISL to see if they can help. In the meantime, you could try running that case again to see if you can complete a short run (maybe just a few days) successfully. Just limit your wallclock time in env_workflow.xml so you don't burn a lot of core-hours if there is another failure.

oleson · Jun 6, 2023

You are probably aware of this since you are in contact with CISL, but I thought I would post this daily bulletin communication from CISL:

June 5, 2023

The Cheyenne Infiniband high-speed network has suffered two failed switches as a result of cooling system problems that will require a system outage in order for the vendor to repair and replace. The dates for this outage are currently being planned. We expect the outage will require 3-5 days effort and will occur no earlier than 3 weeks from now. If users have any flexibility to defer running large node count jobs until after this outage, we recommend deferring jobs when practical.

In the meantime, users will likely experience a higher rate of job failures than typical, especially at large node counts. Error messages such as

ERROR: Extracting flags from IB packet of unknown length
Transport retry count exceeded on mlx5_0:1/IB
MPT: rank XXX dropping unexpected RC packet from YYY …, presumed failover
(no error message, but no output, either.)

Are all likely related to this network path error.

Unfortunately at the moment the remediations are limited. Users are encouraged to resubmit failed jobs, and optionally include the PBS directive “#PBS -l place=group=rack” in their batch scripts when requiring 250 nodes or less. This will request PBS to select nodes from the same rack, perhaps reducing but likely not eliminating the impact of the failed switches.

0.10 degree Global simulation not running with initial data specified

akhtert

Tanjila Akhter

Member

oleson

Keith Oleson

CSEG and Liaisons

akhtert

Tanjila Akhter

Member

oleson

Keith Oleson

CSEG and Liaisons

oleson

Keith Oleson

CSEG and Liaisons

akhtert

Tanjila Akhter

Member

akhtert

Tanjila Akhter

Member

oleson

Keith Oleson

CSEG and Liaisons

Run fails after 5 months with unfamiliar error

oleson

Keith Oleson

CSEG and Liaisons