Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Repeated WACCM crash in solar maximum time slice employing NRLSSI2

Hi all!
I am running CESM1(WACCM) as part of the cesm 1.0.6 suite for  a set of time slice simulations. The common basis for all experiments is the F_2000_WACCM.
 However, I use a different SST/ICE field as lower boundary forcing (1995-2004 mean annual cycle of HadISST1.1) and (this is the important point) different constant spectral solar irradiance forcings.All my simulations run smooth and stable. All except for one which is forced by solar maximum conditions of Nov 1989 according to the new NRLSSI2 (data made available by Judith Lean via personal communication).Certainly, all the forcing files are prepared correctly. Actually, I just copied  the default files used by the F_2000_WACCM compset and exchanged the contained data.For SST/ICE this original file was: .../inputdata/ocn/docn7/SSTDATA/sst_HadOIBl_bc_1.9x2.5_clim_c061031.nc
For spectral solar irradiance: .../inputdata/atm/cam/solar/spectral_irradiance_Lean_1610-2009_ann_c100405.nc
And for F10.7, kp, and ap: .../inputdata/atm/waccm/phot/wa_smax_c100517.nc[here and in the following I shorten the absolute paths to the respective files, hopefully nevertheless indicating which file I mean]
As stated above, I am confident that the forcing files are correct. This is fostered by the fact that a total number of 10 time slice experiments (only SSI forcing different from the one simulation crashing) ran completely fine for 50years.Regarding the simulation which is repeatedly crashing after a maximum of 5.5 years, the .out file always tells me:Model did not complete - see .../cpl.log.XXXXXX-XXXXXXHowever, no crash-related info at all is contained in the coupler log. Only the .../cesm.log.XXXXXX-XXXXXX contained some info for my first try:forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source            
ccsm.exe           00000000014C1A29  Unknown               Unknown  Unknown
ccsm.exe           00000000014C03A0  Unknown               Unknown  Unknown
ccsm.exe           000000000147DC02  Unknown               Unknown  Unknown
ccsm.exe           000000000140ABA3  Unknown               Unknown  Unknown
ccsm.exe           000000000141354B  Unknown               Unknown  Unknown
libpthread.so.0    00002B602B990500  Unknown               Unknown  Unknown
ccsm.exe           00000000008A6C3D  gw_drag_mp_gw_dra        2096  gw_drag.F90
ccsm.exe           000000000089BF8F  gw_drag_mp_gw_int         809  gw_drag.F90
ccsm.exe           0000000000639362  tphysac_                  263  tphysac.F90
ccsm.exe           0000000000572BF4  physpkg_mp_phys_r         849  physpkg.F90
ccsm.exe           000000000048ED1E  cam_comp_mp_cam_r         279  cam_comp.F90
ccsm.exe           000000000047D67A  atm_comp_mct_mp_a         528  atm_comp_mct.F90
ccsm.exe           000000000040FEBB  ccsm_comp_mod_mp_        2166  ccsm_comp_mod.F90
ccsm.exe           0000000000422F7B  MAIN__                     91  ccsm_driver.F90
ccsm.exe           000000000040E296  Unknown               Unknown  Unknown
libc.so.6          00002B602BBBDCDD  Unknown               Unknown  Unknown
ccsm.exe           000000000040E189  Unknown               Unknown  Unknown
forrtl: error (69): process interrupted (SIGINT)
+ hundreds of similar linesAfter that I recompiled the model setting DEBUG=TRUE in env_build.xml and INFO_DBUG=3 in env_run.xmlExpectedly, the model crashed again, the resulting message in .../.out pointed again at the coupler log. This coupler log was now full of messages. The content for the timestep where the model crashed gave the following last lines:
comm_diag xxx sorr  35-7.5035924229171823754E+03 send atm Fall_flxdst1
comm_diag xxx sorr  36-4.0277447075338284776E+04 send atm Fall_flxdst2
comm_diag xxx sorr  37-9.4446187010784124141E+04 send atm Fall_flxdst3
comm_diag xxx sorr  38-8.8965352332325681346E+04 send atm Fall_flxdst4Comparing this to other timesteps gives me the feeling that some data receiving should follow now, something like (teken from the timestep before)
comm_diag xxx sorr   1 3.2501741968186116000E+16 recv atm Sa_z
comm_diag xxx sorr   2-4.5541183399918668750E+14 recv atm Sa_u
comm_diag xxx sorr   3 2.7827167220025068750E+14 recv atm Sa_v
...
but this is not happening anymore. Reading the advices to get past WACCM crashes, after that I tried increasing combinations of nspltvrm,  nspltrac and nsplit but none of these things helped. In fact, when running the experiment from the beginning instead of restarting the crashes occured even earlier for increased nspltvrm and especially nsplit
The behaviour of the various log-files is always equivalent to my descriptions above.Only in one of my tries I got a core dump file with the crash. Trying to debug this (first time I ever did this) a backtrace gave me the following message (no idea whether this is helpful)
0x000000000040841b at .../ccsm.exe section .text offset 33819 So, does anyone have further hints or ideas how to overcome this crash. I start fearing that WACCM4 is not meant to deal with constant solarmax-forcing of NRLSSI2 :-/Thanks in advance, Tim
 

jedwards

CSEG and Liaisons
Staff member
Here is what you are looking for, the model crashed at line 2096 of gw_drag.F90.   Look there for the field or fields that are out of spec.gw_drag_mp_gw_dra        2096  gw_drag.F90You probably need to reduce the timestep or increase the dynamics substeps to continue.
 
Thanks for the very quick reply!Just to make sure that I got you right:
Reducing the dynamics substep would work by increasing "nsplit", wouldn't it? I tried this already (value of 12 instead of 8).Reducing the timestep would work be decreasing "dtime", right?
As far as I understand, this has to be identically set in cam.buildnml.csh and clm.buildnml.cshEdit: OK, I recognized that "ATM_NCPL" in env_conf.xml has to be adjusted as well. Do you have any recommendations of which "dtime" to use then? I would try with 1200 (20 minutes) now, but maybe you have a differing advice.Thanks a lot,Tim
 
Top