restart file error: error reading variable, more than max-value-size

kmcmonigal · Aug 29, 2022

I am attempting to branch a hybrid BSSP370smbb forced run of cesm2.1.4-rc.08 from historically forced restart files that included modifications to the ocean-atmosphere coupling. The run will restart from the CMIP6 restart files, but not from restart files from my modified BHISTsmbb run. I swapped out each restart file individually and determined that my cpl.r.2015-01-01-00000.nc restart file is the problematic one.

I altered the user_nl_clm file to include:
use_init_interp = .true.
use_c13 = .false.
use_c14 = .false.
glacier_region_behavior = 'single_at_atm_topo', 'virtual', 'virtual', 'multiple'

Those changes fixed other problems that had occurred, but I am unsure what to do now. The error message suggests that some variables in the restart file are too large to be read. The model initializes and gives some pop output, but fails during a 5 day run.

Section of cesm.log error message:
271:MPT: #5 <signal handler called>
271:MPT: #6 0x0000000006dec40f in soiltemperaturemod::soilthermprop (bounds=...,
271:MPT: num_nolakec=311, filter_nolakec=...,
271:MPT: tk=<error reading variable: value requires 327968 bytes, which is more than max-value-size>,
271:MPT: cv=<error reading variable: value requires 327968 bytes, which is more than max-value-size>, tk_h2osfc=..., urbanparams_inst=..., temperature_inst=...,
271:MPT: waterstate_inst=..., soilstate_inst=...)

oleson · Aug 29, 2022

One of our software engineers points out that:

"The portion of the log file that the user posted is not the relevant part: I think that just indicates a problem printing all of the variable information in the traceback. If you look at the full attached cesm log file, you'll see a floating point exception at /glade/work/kmcmonigal/tmp_aug2022/cesm2.1.4-rc.08/components/clm/src/biogeophys/SoilTemperatureMod.F90:718

which is

bw(c,j) = (h2osoi_ice(c,j)+h2osoi_liq(c,j))/(frac_sno(c)*dz(c,j))"

Presumably either frac_sno or dz is zero here.
I myself also see a very large snow balance error at nstep=0 in the lnd log file:

WARNING: snow balance error
nstep= 0 local indexc= 1092 col%itype= 401
lun%itype= 4 errh2osno= 9894.09545224555

These balance errors don't stop the model in the first few time steps, but we don't usually see such a large balance error in the log file even at the beginning of the model run.
col%itype = 401 is a landice multiple elevation class. The size of the snow balance error might indicate that there is some inconsistency between the snow variables in the restart file (e.g., frac_sno is zero but h2osno is some large value) or in the interpolated file.
I guess you could start by looking at some of the variables that go into the balance check (in BalanceCheckMod.F90) to see if there is anything unusual there:

write(iulog,*)'errh2osno = ',errh2osno(indexc)
write(iulog,*)'snl = ',col%snl(indexc)
write(iulog,*)'snow_depth = ',snow_depth(indexc)
write(iulog,*)'frac_sno_eff = ',frac_sno_eff(indexc)
write(iulog,*)'h2osno = ',h2osno(indexc)
write(iulog,*)'h2osno_old = ',h2osno_old(indexc)
write(iulog,*)'snow_sources = ',snow_sources(indexc)*dtime
write(iulog,*)'snow_sinks = ',snow_sinks(indexc)*dtime
write(iulog,*)'qflx_prec_grnd = ',qflx_prec_grnd(indexc)*dtime
write(iulog,*)'qflx_snow_grnd_col = ',qflx_snow_grnd_col(indexc)*dtime
write(iulog,*)'qflx_rain_grnd_col = ',qflx_rain_grnd_col(indexc)*dtime
write(iulog,*)'qflx_sub_snow = ',qflx_sub_snow(indexc)*dtime
write(iulog,*)'qflx_snow_drain = ',qflx_snow_drain(indexc)*dtime
write(iulog,*)'qflx_evap_grnd = ',qflx_evap_grnd(indexc)*dtime
write(iulog,*)'qflx_top_soil = ',qflx_top_soil(indexc)*dtime
write(iulog,*)'qflx_dew_snow = ',qflx_dew_snow(indexc)*dtime
write(iulog,*)'qflx_dew_grnd = ',qflx_dew_grnd(indexc)*dtime
write(iulog,*)'qflx_snwcp_ice = ',qflx_snwcp_ice(indexc)*dtime
write(iulog,*)'qflx_snwcp_liq = ',qflx_snwcp_liq(indexc)*dtime
write(iulog,*)'qflx_snwcp_discarded_ice = ',qflx_snwcp_discarded_ice(indexc)*dtime
write(iulog,*)'qflx_snwcp_discarded_liq = ',qflx_snwcp_discarded_liq(indexc)*dtime
write(iulog,*)'qflx_sl_top_soil = ',qflx_sl_top_soil(indexc)*dtime

I'm not sure why the coupler restart file would be involved here. Maybe try a different restart file for the clm initial conditions.

oleson · Aug 29, 2022

I guess you've tried a different initial condition, from the CMIP6 historical, which worked.
I see that the initial file you are using appears to have been generated from a simulation which used a later version of CLM than the version you are using for the current simulation, so maybe there is some backward incompatibility. Otherwise, your setup looks ok to me.
I'll take a closer look later this week.

kmcmonigal · Aug 30, 2022

Thanks Keith. I will try another different initial condition from our altered historical runs (this is for an ensemble so we have several). Agreed backwards incompatibility could be an issue

oleson · Aug 31, 2022

I've started a discussion here on this issue to see if we can get more help on this:

Backward incompatibility associated with H2OSNO on initial (restart) files? · Discussion #1842 · ESCOMP/CTSM

The discussion here is initiated from a user post on the CESM Forum: https://bb.cgd.ucar.edu/cesm/threads/restart-file-error-error-reading-variable-more-than-max-value-size.7616/ One thing I'd ...

github.com

restart file error: error reading variable, more than max-value-size

kmcmonigal

Kay

New Member

Attachments

oleson

Keith Oleson

CSEG and Liaisons

oleson

Keith Oleson

CSEG and Liaisons

kmcmonigal

Kay

New Member

oleson

Keith Oleson

CSEG and Liaisons

Backward incompatibility associated with H2OSNO on initial (restart) files? · Discussion #1842 · ESCOMP/CTSM