Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

0.10 degree Global simulation not running with initial data specified

akhtert

Tanjila Akhter
New Member
Hello,

I am trying a 0.1 degree global simulation in CLM5 with SP compset. I can run the same using Coldstart on. However, I need to use initial data file from a previous 0.25 degree global simulation. When I try to use that by specifying in the clm namelist file, the model aborts w/o specifying any error in my understanding. I tried changing the PE layout and increasing the number of processors but it aborts every time. I am running on Cheyenne and the cases are:
coldstart- /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_test_coldstrt
and /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_sp

with initial data file: /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin_new
and /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_s3

Any help regarding the issue will be highly appreciated.

Thank you in advance- Tanjila
(reposting in this forum form high/variable resolution)
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I'd like to take a look at your cases, but your permissions are set such that I can't:

drwx------ 29 akhtert ncar 4096 Apr 4 11:21 cases
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Is this the case that's failing?

/glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_s3

It looks like it might be failing during the interpolation of the finidat file or perhaps immediately following that when it tries to read the data back in. There is a finidat_interp_dest.nc file that has the same time stamp as the time stamp of the cesm log file. However, I don't see any output about interpolation in the lnd log. You could try compiling and running with DEBUG=TRUE to see if you get more log output.

The finidat file that is being interpolated is:

/glade/scratch/akhtert/archive/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin/rest/0008-01-01-00000/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin.clm2.r.0008-01-01-00000.nc

It looks like that is a 0.1deg file? If so, is setting use_init_interp=.true. necessary?
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Ok, I do see that interpolation has been completed successfully after all. Although I don't see the need to do that:

input gridcells = 1537983 output gridcells = 1537983
input landuntis = 2898154 output landunits = 2898154
input columns = 4451483 output columns = 4451483
input pfts = 8829229 output pfts = 8829229
 

akhtert

Tanjila Akhter
New Member
So for another simulation in the same directory (/glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_spin), I used finidat from a 0.25 degree simulation and used use_init_interp=.true. That simulation ran for 9 years and then stopped. After that I started having this problem. Then for the new runs I tried to continue from 10th year and was using .10 degree finidat, yes, I do not need use_init_interp=.true. I will try another run without interpolation and DEBUG=TRUE to check. Thank you for your suggestion.

I will update here.

Thanks again
Tanjila
 

akhtert

Tanjila Akhter
New Member
Ok, I do see that interpolation has been completed successfully after all. Although I don't see the need to do that:

input gridcells = 1537983 output gridcells = 1537983
input landuntis = 2898154 output landunits = 2898154
input columns = 4451483 output columns = 4451483
input pfts = 8829229 output pfts = 8829229
Hello, I tried running with DEBUG=TRUE and no interpolation. Unfortunately, this time the model keeps running (STOP_N=2, STOP_OPTION=nmonths) till the maximum walltime is exceeded. I used a total of 2196 pes. There is nor error in the logfile and the walltime of 12 hrs should work for 2 months even more. I would really appreciate any suggestion.

case: /glade/work/akhtert/cases/Spinup1979/0.10degree/0.10d_gwlat_GSWP3_I2000Clm50SpGs_s4

Best Tanjila
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
It looks like it ran about a day and a half and then starting getting errors like this:

580:MPT: rank 580 dropping unexpected RC packet from 548 (647:1542), presumed failover - 18 0 1

This feels like a potential hardware issue. For example, see this recent post:


Also, I've seen these errors recently in some CESM3 development simulations that are being run.
As suggested in the post, I would contact CISL to see if they can help. In the meantime, you could try running that case again to see if you can complete a short run (maybe just a few days) successfully. Just limit your wallclock time in env_workflow.xml so you don't burn a lot of core-hours if there is another failure.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
You are probably aware of this since you are in contact with CISL, but I thought I would post this daily bulletin communication from CISL:

June 5, 2023
The Cheyenne Infiniband high-speed network has suffered two failed switches as a result of cooling system problems that will require a system outage in order for the vendor to repair and replace. The dates for this outage are currently being planned. We expect the outage will require 3-5 days effort and will occur no earlier than 3 weeks from now. If users have any flexibility to defer running large node count jobs until after this outage, we recommend deferring jobs when practical.

In the meantime, users will likely experience a higher rate of job failures than typical, especially at large node counts. Error messages such as
  • ERROR: Extracting flags from IB packet of unknown length
  • Transport retry count exceeded on mlx5_0:1/IB
  • MPT: rank XXX dropping unexpected RC packet from YYY …, presumed failover
  • (no error message, but no output, either.)
Are all likely related to this network path error.

Unfortunately at the moment the remediations are limited. Users are encouraged to resubmit failed jobs, and optionally include the PBS directive “#PBS -l place=group=rack” in their batch scripts when requiring 250 nodes or less. This will request PBS to select nodes from the same rack, perhaps reducing but likely not eliminating the impact of the failed switches.
 
Top