WACCM crashing after 95 years

lantao@ucar_edu · Jan 15, 2014

Hello, I am running CESM1.1.1 (WACCM4) with prescribed sea ice and SST forcing (F_2000_WACCM). It crashes after 95 years.Right before the model crashes, the error message is like below: 78: findsp not converging at point i, k            2          54
78: t, q, p, enin    202.013868948721 5.886565323329356E-002
78:   23194.1639514279        369817.680521060
78: tsp, qsp, enout    652.677208061405       -3.77411252009257
78: -5985690.77341165
78: ENDRUN:cldwat::FINDSP -- not converging
78:Abort(1) on node 78 (rank 78 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 78
78:INFO: 0031-306 pm_atexit: pm_exit_value is 1.
INFO: 0031-251 task 26 exited: rc=1
ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 26Could anyone give some suggestions on how to deal with this ? Thanks, Lantao

santos · Jan 21, 2014

Sorry for the slow response, Lantao. I've just gotten back from vacation.Until recently, the water vapor saturation routines, such as findsp, can behave badly when given cold inputs; they were rewritten for CESM 1.2. Additionally, WACCM is operating on the edge of numerical stability for the FV dycore; this means that in rare cases it can encounter specific combinations of temperature/pressure/humidity that cause a crash in these routines.The simplest thing that you could try would be to use the namelist option "nspltvrm = 2". This should improve model stability without changing climate, and hopefully can prevent this error from occurring again.

lantao@ucar_edu · May 1, 2014

Hi, Sean,My WACCM experiment still has some intermittent crashing, for some different reasons this time. The error message is like below: 39: imp_sol: Time step 1.1250000000000E+01 failed to converge @ (lchnk,lev,col,nstep) = 420 53 2****** 39: imp_sol: Failed to converge @ (lchnk,lev,col,nstep,dt,time) = 420 53 2****** 1.1250000000000E+01 1.8000000000000E+03 39: CL 1.000E+00 39: CLO 1.000E+00 39: OCLO 1.000E+00 39: HOCL 1.000E+00 39: CLONO2 1.000E+00 39: imp_sol : @ (lchnk,lev,col) = 420 53 2 failed 39: 165 timesINFO: 0031-251 task 149 exited: rc=-11INFO: 0031-251 task 59 exited: rc=-1134:forrtl: error (78): process killed (SIGTERM)The way I did by following your advise to pass the crashing is to switch the parameter "nspltvrm" between 2 and 1 (default in CESM1.1.1). Temporarily this works, but the crashing still occur intermittently roughly every 20 years for either value of nspltvrm. I am just wondering if there is any other solution for this problem. Thanks a lot for the help.

lantao@ucar_edu · May 1, 2014

And one more question about the size of the WACCM log files.I noticed that each cesm.log.??????-??????.gz is very large (300MB), so that I have to move these log files frequently out of my home directory. Otherwise my home directory will be full within a few weeks. After checking the log file, I found that most of the file is fillled by warning messages like below: 4: filew failed, worst i, j, qtmp, q = 1 13 10: filew failed, worst i, j, qtmp, q = 1 33 10: -1.384606648208407E-195 -1.310203105267379E-195These warning messages occupy over 90% of the log file. I am just wondering if this is common in WACCM experiments. Thanks,

WACCM crashing after 95 years

lantao@ucar_edu

New Member

santos

Member

lantao@ucar_edu

New Member

lantao@ucar_edu

New Member