negative h2osoi_ice about 300 years into a ctsm standalone spin-up

aherring · Apr 18, 2020

I'm spinning up the snowpack in an I compset, cycling through 20 year streams from a coupled F compset simulation. After 290 years into the simulation, ctsm errors out due to a negative h2osoi_ice occurrence. From the cesm.log:

Code:

281: ERROR: In UpdateState_TopLayerFluxes, h2osoi_ice has gone significantly negativ
281: e
281: Bulk/tracer name = bulk
281: c, lev_top(c) =        36026           0
281: h2osoi_ice_top_orig =   3.707650617013752E-006
281: h2osoi_ice          =  -5.285485590866834E-019
281: frac_sno_eff        =    1.00000000000000
281: qflx_dew_snow*dtime =   0.000000000000000E+000
281: qflx_sub_snow*dtime =   3.707650617014281E-006
281: ENDRUN:
281: ERROR:
281: In UpdateState_TopLayerFluxes, h2osoi_ice has gone significantly negative
281:Image              PC                Routine            Line        Source
281:cesm.exe           0000000001AADF9A  Unknown               Unknown  Unknown
281:cesm.exe           00000000011AAA00  shr_abort_mod_mp_         114  shr_abort_mod.F90
281:cesm.exe           00000000005032AF  abortutils_mp_end          50  abortutils.F90
281:cesm.exe           00000000008C2522  snowhydrologymod_        1190  SnowHydrologyMod.F90
281:cesm.exe           00000000008BF03D  snowhydrologymod_         986  SnowHydrologyMod.F90
281:cesm.exe           000000000081ABD5  hydrologynodraina         287  HydrologyNoDrainageMod.F90
281:cesm.exe           000000000050AD68  clm_driver_mp_clm         778  clm_driver.F90
281:cesm.exe           00000000004F843B  lnd_comp_mct_mp_l         458  lnd_comp_mct.F90
281:cesm.exe           00000000004282A4  component_mod_mp_         737  component_mod.F90
281:cesm.exe           0000000000409DCB  cime_comp_mod_mp_        2615  cime_comp_mod.F90
281:cesm.exe           0000000000427EDC  MAIN__                    133  cime_driver.F90
281:cesm.exe           0000000000407CA2  Unknown               Unknown  Unknown
281:libc.so.6          00002BA1FD7BF6E5  __libc_start_main     Unknown  Unknown
281:cesm.exe           0000000000407BA9  Unknown               Unknown  Unknown
281:MPT ERROR: Rank 281(g:281) is aborting with error code 1001.
281:    Process ID: 43616, Host: r4i5n11, Program: /glade/scratch/aherring/ctsm1.0.dev079.se_grids.n02_I2000Clm50SpSpinup_ne0np4.ARCTIC.ne30x4_mt12_1800pes_200412_1979bc-Nx10yrs/bld/cesm.exe
281:    MPT Version: HPE MPT 2.19  02/23/19 05:30:09
281:
281:MPT: --------stack traceback-------
281:MPT: Attaching to program: /proc/43616/exe, process 43616
281:MPT: done.

The code base is a branch off of ctsm1.0.dev079. The log files are here:

/glade/scratch/aherring/ctsm1.0.dev079.se_grids.n02_I2000Clm50SpSpinup_ne0np4.ARCTIC.ne30x4_mt12_1800pes_200412_1979bc-Nx10yrs/run

Ideally, I would branch off this run well before the crash occurred, but with some type of work around in the code so I can complete the spin-up.

Spin-up progress is attached, for context.

sacks · Apr 19, 2020

@oleson pointed me to this. He suggested:

I'm not familiar with this code but it looks like a negative h2osoi_ice error is being triggered because it was a bit too negative to get zeroed-out.

The negative value is:

-5.285485590866834E-019

and the absolute value threshold that the value would be zeroed-out at appears that it would be:

1.E-13 X 3.707650617013752E-006= 3.707e-19

So it doesn't get zeroed-out.

My suggestion would be to set rel_epsilon = 1.e-12_r8 just to be able to continue the simulation.

I agree with his assessment. This error check was introduced in ctsm1.0.dev060. As I noted in the ChangeLog entry for that tag:

It's possible that the extra error checks I have added (to ensure we
don't have greater-than-roundoff-level negative residuals) will be
triggered in rare circumstances in a production run, even though they
were never triggered in the test suite.

And now that seems to have happened.

Changing this to 1e-12 seems safe here, and it could possibly even go a bit larger. For now, @aherring, I suggest simply changing rel_epsilon in src/utils/NumericsMod.F90 from 1e-13 to 1e-12. Note that this will change behavior slightly in some other parts of the code, too, but it shouldn't be a big deal. Slightly longer term, I can put in an argument to the truncate function that allows different tolerance values in different circumstances.

aherring · Apr 19, 2020

Thanks Keith. Bill, I hope you feel vindicated by your ChangeLog comment ... u should!

I branched off with the lower tolerance and am now well past the point where the error occurred. I will probably keep this tolerance at 1.e-12 for variable resolution runs, even after the spin-up, in case it's a symptom of higher resolution grids. If this tolerance is still too small and errors out for the same reason, Ill make sure to update this thread.

sacks · Apr 24, 2020

Fixed in ctsm1.0.dev091

aherring · Apr 24, 2020

Thanks Bill

pawanvats5086@gmail_com · Jan 7, 2021

sacks said:
@oleson pointed me to this. He suggested:

I agree with his assessment. This error check was introduced in ctsm1.0.dev060. As I noted in the ChangeLog entry for that tag:

And now that seems to have happened.

Changing this to 1e-12 seems safe here, and it could possibly even go a bit larger. For now, @aherring, I suggest simply changing rel_epsilon in src/utils/NumericsMod.F90 from 1e-13 to 1e-12. Note that this will change behavior slightly in some other parts of the code, too, but it shouldn't be a big deal. Slightly longer term, I can put in an argument to the truncate function that allows different tolerance values in different circumstances.

Dear Sacks,
I am running the E component set of CESM2.2.0 long name is1850_CAM60_CLM45%SP_CICE_DOCN%SOM_MOSART_SGLC_SWAV_TEST. While running the simulation get crashed with the error.
"....
ERROR: In UpdateState_TopLayerFluxes, h2osoi_ice has gone significantly negative
Bulk/tracer name = bulk
c, lev_top(c) = 7398 0
h2osoi_ice_top_orig = 1.464606154954986E-002
h2osoi_ice = -2.014664392947504E-003
frac_sno_eff = 0.238922227275374
qflx_soliddew_to_top_layer*dtime = 0.000000000000000E+000
qflx_solidevap_from_top_layer*dtime = 6.973284207372957E-002

ENDRUN:
ERROR:
In UpdateState_TopLayerFluxes, h2osoi_ice has gone significantly negative
max rss=384151552.0 MB
memory_write: model date = 00010131 28800 memory = -0.00 MB (highwater) 366.36 MB (usage) (pe= 480 comps= LND ICE GLC WAV IAC ESP)
...."
As discussed above, I also change the value of rel_epsilon in NumericsMod.F90 from 1e-13 to 1e-12, but still, I am facing the same error.
Could you please suggest something to solve this error.
Thanking you
Pawan Vats

oleson · Jan 9, 2021

I'm not sure what to suggest other than to set the value of rel_epsilon to something slightly larger than the negative h2osoi_ice and see if you can run past the point at which the error occurs.
Also, I see you are using CLM4.5, not CLM5 (the default version of CLM in CESM2.2). Unless you have a specific reason for using CLM4.5, I would recommend using CLM5. We haven't done long coupled simulations with CLM4.5 and it may not be as robust to errors as CLM5.

sacks · Jan 11, 2021

Hi Pawan,

Thank you for reporting this error. I am sorry you are running into this issue.

As Keith says, if you're able to do this with CLM5, that could be best.

However, this does seem like a real issue, and not one that just is related to the tolerance of the error check. I was just able to reproduce this in a simpler land-only configuration, and I will take some time to look into it further and try to track down the cause.

sacks · Jan 14, 2021

I have been looking into this and have some ideas about a solution, but it's going to take some time to get it in place. I have opened h2osoi_ice can go significantly negative · Issue #1253 · ESCOMP/CTSM so that we can track this issue; further discussion will happen there, so feel free to subscribe to that issue for updates.

Let us know if this issue is preventing you from making progress on your work - for example, because updating to CLM5 is not an option for you. If so, I can suggest a workaround. (It will probably involve reverting the changes in ctsm1.0.dev060, but I'd want to look into that a bit more before actually suggesting it, so please let me know if that would be helpful to you.)

sacks · Mar 19, 2021

Hi Pawan,

We believe we have fixed this issue, in ctsm5.1.dev028. If you're still running into this problem, please try updating to that tag and seeing if it fixes your issue – we'd be interested to know if it does. Let me know if you need detailed instructions for how to do this update.

negative h2osoi_ice about 300 years into a ctsm standalone spin-up

aherring

Adam

Member

Attachments

sacks

Bill Sacks

CSEG and Liaisons

aherring

Adam

Member

sacks

Bill Sacks

CSEG and Liaisons

aherring

Adam

Member

pawanvats5086@gmail_com

New Member

oleson

Keith Oleson

CSEG and Liaisons

sacks

Bill Sacks

CSEG and Liaisons

sacks

Bill Sacks

CSEG and Liaisons

sacks

Bill Sacks

CSEG and Liaisons