segmentation fault in xtpv of tp_core.F90

wadewei · Apr 26, 2020

Hello everyone,

I frequently encounter below errors on some of my slightly customized CESM 1.2.2 runs. The same errors persist every tens if not several model years, even though I switched to almost default settings with minor changes in co2 mixing ratio etc. Does anyone have clues of possible solutions?

Thanks a lot!

BalanceCheck: soil balance error nstep = 3496 point = 3792 imbalance = -0.000001 W/m2
QNEG4 WARNING from TPHYSAC Max possible LH flx exceeded at 1 points. , Worst excess = -3.3422E-06, lchnk = 723, i = 8, same as indices lat = 150, lon = 207
BalanceCheck: soil balance error nstep = 3499 point = 11878 imbalance = -0.000000 W/m2
QNEG4 WARNING from TPHYSAC Max possible LH flx exceeded at 1 points. , Worst excess = -8.3540E-05, lchnk = ***, i = 8, same as indices lat = 146, lon = 217
BalanceCheck: soil balance error nstep = 3500 point = 11878 imbalance = -0.000000 W/m2
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000000C9C2D7 tp_core_mp_xtpv_ 469 tp_core.F90
cesm.exe 0000000000C956B4 tp_core_mp_tp2c_ 119 tp_core.F90
cesm.exe 0000000000C7B144 sw_core_mp_c_sw_ 309 sw_core.F90
cesm.exe 0000000000AE09CB cd_core_ 732 cd_core.F90
cesm.exe 00000000008242DB dyn_comp_mp_dyn_r 1822 dyn_comp.F90
cesm.exe 000000000061B2E9 stepon_mp_stepon_ 422 stepon.F90
cesm.exe 00000000004AA3F2 cam_comp_mp_cam_r 250 cam_comp.F90
cesm.exe 00000000004985D1 atm_comp_mct_mp_a 561 atm_comp_mct.F90
cesm.exe 00000000004176BD ccsm_comp_mod_mp_ 4079 ccsm_comp_mod.F90
cesm.exe 0000000000434CF3 MAIN__ 91 ccsm_driver.F90
cesm.exe 000000000041175C Unknown Unknown Unknown
libc.so.6 00002ACBF7BCA445 Unknown Unknown Unknown
cesm.exe 0000000000411659 Unknown Unknown Unknown

user_nl_cam:
co2vmr = 185e-6
ch4vmr = 350e-9
n2ovmr = 200e-9
f11vmr = 0.0
f12vmr = 0.0

wadewei · Apr 26, 2020

Forgot to mention... It's F_1850 at f09_g16.

nusbaume · Apr 27, 2020

Hi Wade Wei,

The error appears to be occurring in the tracer advection, which is often times causes by numerical errors generated from a time-step that is too large.
So, I would recommend simply reducing the model time-step, which can be done by adding dtime to user_nl_cam and setting it to be a smaller number (I believe your dtime is defaulting to 1800, so try 1200 or 900 instead). You can also increase nsplit and nspltrac in the namelist to similarly reduce the dynamics timestep for tracer advection, although make sure that nsplit remains evenly divisible by nsplittrac.

Also, just a heads up that reducing the time-step will cause your model run to be slower.

Hope that helps, and if for some reason this doesn't fix your issue then feel free to reply to this thread letting us know.

Thanks, and have a great day!

Jesse

wadewei · Apr 28, 2020

Hi Jesse,

I did go ahead and set dtime=450, nsplit=4 and nspltrac=2 (something like that) and it still blows up with the same error. Interestingly though, the problem seems to go away after I changed pes settings from 384 cores/16 nodes running each component one by one to 192 cores. I will keep observing but is there a plausible explanation for this behavior if it was true?

Thanks and be safe!

Wade

nusbaume · Apr 29, 2020

Hi Wade,

Hmm, I am not sure why changing the processor count would cause a seg fault to happen. Maybe @eaton has an idea?

eaton · Apr 29, 2020

Unless the BFBFLAG is set TRUE (via an xmlchange command) the change in processor count will result in a perturbation to the solution. Whether that could be the difference between a run that seg faults and one that doesn't is hard to say.

Also note that when running from CESM scripts the model timestep is set at a system wide level via the ATM_NCPL variable in the env_run.xml file. I'm pretty sure that the dtime value in user_nl_cam will be ignored.

tomaslovato · May 6, 2020

Hi,
I take advantage of this open thread as I'm also having the same issue reported by @wadewei.

I'm currently using CESM 1.2.2 as well, but with the ocean component being NEMO in place of POP.
The run that generates the segfault is a 4xCO2 simulation at 1 degree (f09_n13) with these components 1850_CAM5%4xCO2_CLM45%BGC_CICE_NEMO_RTM_SGLC_SWAV (compiled with ifort 19.5)

The error produced by the model is the following (as in the initial post)
==== backtrace (tid: 206006) ====
0 0x00000000014432b0 tp_core_mp_xtpv_() /users_home/csp/cmip01/EXP/segfault.4xCO2/SourceMods/src.cam/tp_core.F90:476
1 0x000000000143a181 tp_core_mp_tp2c_() /users_home/csp/cmip01/EXP/segfault.4xCO2/SourceMods/src.cam/tp_core.F90:231
2 0x000000000142187c sw_core_mp_c_sw_() /users_home/csp/cmip01/EXP/segfault.4xCO2/SourceMods/src.cam/sw_core.F90:309
3 0x000000000136a784 cd_core_() /users_home/csp/cmip01/GIT/cesm/models/atm/cam/src/dynamics/fv/cd_core.F90:732
4 0x0000000001144064 dyn_comp_mp_dyn_run_() /users_home/csp/cmip01/GIT/cesm/models/atm/cam/src/dynamics/fv/dyn_comp.F90:1822
5 0x0000000000f5ca5f stepon_mp_stepon_run1_() /users_home/csp/cmip01/GIT/cesm/models/atm/cam/src/dynamics/fv/stepon.F90:422
6 0x00000000004c816b cam_comp_mp_cam_run1_() /users_home/csp/cmip01/GIT/cesm/models/atm/cam/src/control/cam_comp.F90:250
7 0x00000000004b8c3b atm_comp_mct_mp_atm_run_mct_() /users_home/csp/cmip01/GIT/cesm/models/atm/cam/src/cpl_mct/atm_comp_mct.F90:561
8 0x00000000004336f4 ccsm_comp_mod_mp_ccsm_run_() /users_home/csp/cmip01/GIT/cesm/models/drv/driver/ccsm_comp_mod.F90:4082
9 0x0000000000450192 MAIN__() /users_home/csp/cmip01/GIT/cesm/models/drv/driver/ccsm_driver.F90:91
10 0x000000000042d4e2 main() ???:0
11 0x00000000000223d5 __libc_start_main() ???:0
12 0x000000000042d3e9 _start() ???:0
=================================

The above error in tp_core.F90 point to the bold line here below
if(iord == 1 .or. cosav(j) < cos_upw) then
do i=1,im
iu = real(i,r8) - cv(i,j)
fxv(i,j) = mfxv(i,j)*qtmpv(iu,j)
enddo

where the index iu turns out to be -2147483648 since the value of cv(i,j)is NaN (which clearly triggers the segfault error).

After digging a bit in the code during last week i found out that cv(i,j) is nan as consequence of velocities going to NaN in function d2a2c_winds (in sw_core.F90).

At this point I enable the state vars check of CAM by adding to user_nl_cam state_debug_checks = .true.

In the next simulation, the code stopped much earlier, in the check of physical state variables due to negative T values
ERROR: shr_assert_in_domain: state%t has invalid value -7145.49615152464
at location: 12 7
Expected value to be greater than 0.000000000000000E+000
(shr_sys_abort) ERROR: Invalid value produced in physics_state by package radheat.
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping

Clearly something wrong occurs in the computation of the net radiative heating tendency in radheat.F90 that is simply the difference between shortwave heating (qsr) and longwave heating (qsl), which on their turn are computed within radiation.F90.

This is how far I made it by now, so if anyone has some suggestion or help on this it will be very much appreciated!

Bests, Tomas

eaton · May 6, 2020

Thanks for this analysis. I have a vague recollection that this type of behavior (bad heating rates causing negative temperatures) has been observed before in the context of simulations where the atmospheric state gets too far outside historical climate values for Earth. The problem may be traced to assumptions in the RRTMG code which contains tables for optical properties of gases which are intended to be interpolated for efficient calculations. Atmosphere conditions outside expected ranges can cause the interpolation to instead result in an extrapolation and the checks in the code are not sufficient in all cases to detect the problem. With enough effort it should be possible to locate the problem in the RRTMG code and to add clamps to keep the optical properties within reasonable bounds. But it should be realized that this type of "fix" is allowing RRTMG to continue working outside the range of conditions for which it has been validated. I believe this issue is being addressed in newer versions of the RRTMG package (the newer package is called RRTMGP), but I don't know the status of this work.

tomaslovato · May 8, 2020

@eaton Thanks for your prompt reply and the informations about RRTMG package!
I assume that the approach suggested in the above messages of reducing the time step could be an 'alternative' way to go through this problem, but we will investigate a bit longer the issue before giving up! Bests

tomaslovato · May 11, 2020

Hi @eaton,
I work a bit more on this issue following your comment on RRTMG code parameterizations for optical properties.

As the problem arises from the computation of the net radiative heating tendency that is the sum (not the difference as I wrote above) of shortwave heating (qsr) and longwave heating (qsl), I added before the call to radheat in radiation.f90 few print of these two variables at the grid cell leading to inconsistent temperature values and it turned out that longwave heating had wrong values.
...
node 597 qrl (12,:) -5.422876281458102E-002
-3.227881420759256E-002 -3.323525058131436E-002 -2.351795708826856E-002
-3.021039565765879E-002 -364.567550589085 -4396.27186609277
-269.657259509662 -329.449491848069 -441.409852974744
-499.676759792232 -469.655990917300 -1193.57729444409
-1169.91795996251 -792.417403851771 -376.517360493769
-165.676389702708 146.731046224264 176.761579772432
270.518350028806 357.623656624195 411.104688399236
389.624512407969 352.529180998520 263.094102768977
206.315086960337 142.151338825983 95.8399030303118
14.2314636167282 -7.341730804796196E-004

Afterwards I went through the computation of longwave heating terms and I found that in physics/rrtmg/ext/rrtmg_mcica/rrtmg_lw_rtrnmc.f90 a control to constrain the computation optical properties was already in there.

I compared this routine with the corresponding one of CAM6 (from CESM2.1.1) and the constraint of secdiff computations were applied in a slightly different way.

I modified secdiff constraints in the code of rrtmg_lw_rtrnmc.f90 as in CAM6, namely

@@ -232,10 +233,10 @@
secdiff(ibnd) = 1.66_r8
else
secdiff(ibnd) = a0(ibnd) + a1(ibnd)*exp(a2(ibnd)*pwvcm)
+ if (secdiff(ibnd) .gt. 1.80_r8) secdiff(ibnd) = 1.80_r8
+ if (secdiff(ibnd) .lt. 1.50_r8) secdiff(ibnd) = 1.50_r8
endif
enddo
- if (pwvcm.lt.1.0) secdiff(6) = 1.80_r8
- if (pwvcm.gt.7.1) secdiff(7) = 1.50_r8

urad(0) = 0.0_r8
drad(0) = 0.0_r8

and the model finally completed one year of simulation without further errors (I also maintained the state_debug_checks = .true.) !!

I guess that the secant values control was updated in CAM6 to be more general, since the original one was activated only for certain bands and precipitable water vapor thresholds (this is my guess as I don't have access to CAM dev repository to see the revision control history).

It would be interesting to see if the above changes will work also for the simulation of @wadewei using low values for prescribed CO2 mixing ratio.

Finally, I found useful hints also from this presentation http://cesm.ucar.edu/events/wg-meetings/2017/presentations/pwg/ottobliesner.pdf and this repository with modified radiative schemes for PALEO experiments rrtmg in palm/trunk/LIB – PALM

Bests

eaton · May 11, 2020

Thanks for this feedback and sorry you needed to rediscover this fix which is in the cesm2 release. FYI, I was able to find where this change was made in the development cycle by looking in the doc/ChangeLog file which is in the cesm2 release code.

tomaslovato · May 12, 2020

Hi @eaton I guess this is a sort of round loop that occurs as it is quite difficult to keep up with the developments of each model component. Btw, could you please indicate which is the Changelog point (or cam development cycle) that refers to this bugfix?
Thanks

eaton · May 12, 2020

From any cesm2 release have a look in the file components/cam/doc/ChangeLog. If you search for "rrtmg_lw_rtrnmc.f90" you'll find that the fix was applied in CAM tag cam5_4_83 (on 2016-09-15). The entries in this log go back over 20 years. However the development code is only available since CAM tag cam6_0_000 which is when we moved to a github repo (github.com/ESCOMP/CAM). The ChangeLog file in the ESCOMP/CAM repo on the cam_development branch contains entries for all development tags up to present. Hope that helps.

tomaslovato · May 18, 2020

Thanks for all the tips @eaton ! I guess this issue can be considered as closed... bests

segmentation fault in xtpv of tp_core.F90

wadewei

Wade Wei

Member

wadewei

Wade Wei

Member

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

wadewei

Wade Wei

Member

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

eaton

CSEG and Liaisons

tomaslovato

Tomas Lovato

New Member

eaton

CSEG and Liaisons

tomaslovato

Tomas Lovato

New Member

tomaslovato

Tomas Lovato

New Member

eaton

CSEG and Liaisons

tomaslovato

Tomas Lovato

New Member

eaton

CSEG and Liaisons

tomaslovato

Tomas Lovato

New Member