Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Run fails when RESUBMIT > 0

brendanclark

Brendan Clark
New Member
Hello,

I am using tag alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171. I have succesfuly run a simulation in CPLHIST mode, but once I introduce RESUBMIT > 0 for STOP_OPTION=nyears, the run fails before the first resubmission. There are no useful errors. If I do the same simulation but for RESUBMIT=2, STOP_OPTION=nmonths, STOP_N=1, it runs successfully. Is there something I am doing wrong that is causing the run to fail?

Case: /glade/campaign/univ/urtg0006/Brendan/CTSMcases/210324.b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.CLMf09_g17.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.GGCMI_landuse_2016-2069_resub53_V2

Scratch output: /glade/derecho/scratch/brendanc/210324.b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.CLMf09_g17.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.GGCMI_landuse_2016-2069_resub53_V2/run

Thanks,
Brendan
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I don't think there's any unreasonable about what you are doing. Are you saying that you are changing RESUBMIT to > 0 while the model is running, and then the model fails immediately?
I don't seem to have permission to access your case directory. Maybe you can change permissions so that we can look at it. In that directory, there should be a run.* file that might contain some information.
 

brendanclark

Brendan Clark
New Member
Sorry, I am not changing RESUBMIT while the model is running, those are two different runs I tried. It is only once I do a new run and make RESUBMIT > 0 with STOP_OPTION=nyears (instead of nmonths, which ran successfully) that the runs fails. Specifically, I am doing RESUBMIT=53,STOP_N=1,STOP_OPTION=nyears for DATM_YR_START --val 2016 and DATM_YR_END --val 2069. It runs for one year and fails before resubmitting. If I run the same simulation for multiple years without resubmitting, it also runs successfully. I copied the run.* and env_run.xml files to the scratch directory I posted above.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Ok, thanks. I see in the atm log that is it trying to write a restart file at the end of the run:

210324.b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.CLMf09_g17.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.GGCMI_landuse_2016-2069_resub53_V2.datm.r.2017-01-01-00000.nc20170101

I don't know why it is prepending 210324. to the beginning of the file. That might be a clue but I'm not sure to what.
I guess you could try compiling and running with DEBUG set to TRUE to see if you get a better traceback.
I assume you aren't out of disc space....
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Oops, sorry, I guess that's part of your case name, I'm not paying close enough attention!
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I was looking at your case and see that you did get some tracebacks. But it doesn't look like they are very useful as it is dying at different places in the code depending probably on the processor.
I do see that the atm log indicates that it is trying to write a restart file and that file ends up with zero length:

-rw-r--r-- 1 brendanc ncar 0 Mar 21 15:51 b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.03212024.DATM.53RESUB.2016-2069.datm.r.2017-01-01-00000.nc

I also saw earlier in one of your log files this message:

Abort with message NetCDF: One or more variable sizes violate format constraints

The datm restart header should look something like this:

dimensions:
strlen = 256 ;
nt = 2004 ;
nfiles = 240 ;
nstreams = 7 ;
variables:
int ymdLB(nstreams) ;
int ymdUB(nstreams) ;
int todLB(nstreams) ;
int todUB(nstreams) ;
int nfiles(nstreams) ;
int offset(nstreams) ;
int k_lvd(nstreams) ;
int n_lvd(nstreams) ;
int k_gvd(nstreams) ;
int n_gvd(nstreams) ;
int nt(nstreams, nfiles) ;
int haveData(nstreams, nfiles) ;
char filename(nstreams, nfiles, strlen) ;
int date(nstreams, nfiles, nt) ;
int timeofday(nstreams, nfiles, nt) ;

So I'm wondering if maybe one of these variables is too large. Maybe filename, which would contain all of the filenames for your entire forcing dataset (2016-2069) for all of your streams.
Have you tried running with yearly restarts but restricting the forcing file streams to just be a few years instead of the entire time series?
When you ran with monthly restarts, did you use the entire forcing dataset or did you use a subset?
Grasping at straws here.
 

brendanclark

Brendan Clark
New Member
I think the problem is not with resubmit, but with the creation of restart files with the run I'm doing. I did a longer simulation without resubmit (where I am modifying PCT_CROP, PCT_CFT, and FERTNITRO_CFT in landuse.timeseries) and it ran and created output until the very end where it failed for a reason I can't understand. It made a retsart file that I tried using to continue the run, but that run fails with the error:
ERROR initInterp set_mindist: Cannot find any input points matching output point
Consider rerunning with the following in user_nl_clm:
init_interp_fill_missing_with_natveg = .true.

I tried setting init_interp_fill_missing_with_natveg = .true. but it fails with the same error. I'm not sure what might be wrong with the restart file that is being generated.

First run:
/glade/work/brendanc/CTSMcases/220324.b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.GGCMI_LU.2016-2042
/glade/derecho/scratch/brendanc/220324.b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.GGCMI_LU.2016-2042/run

Second run using the restart file generated from first:
/glade/work/brendanc/CTSMcases/250324.b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.GGCMI_LU.2043_test_restart
/glade/derecho/scratch/brendanc/250324.b.e21.BWSSP245cmip6.f09_g17.CMIP6-SSP2-4.5-WACCM.006.alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171.GGCMI_LU.2043_test_restart/run

thanks,
Brendan
 

brendanclark

Brendan Clark
New Member
It is also worth noting that I have done the exact same run with the same modifications which ran successfully and created a restart file that worked. The only difference is that for that run I used ctsm5.1.dev157 with a modified 5.1 landuse file instead of alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171 with a modified 5.2 landuse file.
 

slevis

Moderator
@brendanclark the fact that the same setup works for you in ctsm5.1 but fails in ctsm5.2 may suggest a bug in ctsm5.2.

First I wanted to clarify: Does your ctsm5.2 experiment also fail when you use the default landuse file rather than your modified one? If so, pls point me to that case, and I will try replicating and then debugging the error. If the answer is no, then we will need to think through other troubleshooting options.
 

brendanclark

Brendan Clark
New Member
@slevis @oleson It actually seems that Keith’s thought from before was correct and too many forcing file names in the stream files (such as ~30 years) causes the error where the filename variable is too large, and the run cannot make a restart file and resubmit or successfully finish running. Is there a workaround to this or does this restrict the length of runs in CPLHIST mode? I know that 30 years does not work, and 10 years does but not exactly sure where the cutoff is. Thanks.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I'll ping our software group here. I'm not sure if there is a workaround for this, other than maybe concatenating the CPLHIST file into monthly thus reducing the number of filenames, or maybe there is something in the code that can be changed. @slevis @erik
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
The filename length issue has a fix in cdeps1.0.30 with this PR: Increase stream filename length to CX (512) from CL (256) by ekluzek · Pull Request #265 · ESCOMP/CDEPS. So you could try updating the cdeps external to that version and see if that helps.

The number of filenames is NOT fixed as far as I can tell. The array is allocatable in FORTRAN so it shouldn't have an arbitrary limit. It sounds like the fail is in the FORTRAN and not in the python script that creates it right? I can't think of why the python would fail, but maybe there is some limit in that code? So does the datm.streams.xml file it creates look fine or is it truncated?
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Our hypothesis is that it is having trouble writing the datm restart file, it aborts with

Abort with message NetCDF: One or more variable sizes violate format constraints

and writes a zero-length restart file, then crashes.

The file contains a variable char filename(nstreams, nfiles, strlen). In this particular case it would be dimensioned as filename(9, 19710, 256).
We're thinking that this might violate format constraints? But I guess you are saying there there shouldn't be a limit?
@erik
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Ahh, NetCDF format restraints. I was thinking of the CDEPS FORTRAN code and that it had limits on number of filenames or other such limits. OK I read more of the thread and it does sound like a NetCDF format limit.

So then the question is what format is the restart file in (i.e. use ncdump -k <filename>). If it's classic for example we should be able to change it to one of the larger ones by just changing the open statement. And the CDF5 format should be unlimited according to this chart...

 

brendanclark

Brendan Clark
New Member
Ahh, NetCDF format restraints. I was thinking of the CDEPS FORTRAN code and that it had limits on number of filenames or other such limits. OK I read more of the thread and it does sound like a NetCDF format limit.

So then the question is what format is the restart file in (i.e. use ncdump -k <filename>). If it's classic for example we should be able to change it to one of the larger ones by just changing the open statement. And the CDF5 format should be unlimited according to this chart...

The datm restart file is classic so maybe changing it to CDF5 would fix this issue.
 
Top