Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM/CAM future run terminates after running for 83 years: Is any probelm in BC ??

Dear Users,
I am Running CAM vias cesm script using CESM1.2.0 for future time period under rcp85 scenario forcings, Where i prepared the boundary condition by adding Present day time climatological SST and SIC Bias of AOGCM output to the future output of AOGCM(HadGEM2-ES) for the realization r4i1p1.Steps as given below.1.) calculated the climatological bias for present day period for ts and sic from coupled model output.2.) Added the present day climatological bias ts and sic in future ts and sic given by coupled model future output for rcp45 and rcp85.
3.) Now 'TS' and 'SIC' are saved as future SST and SEAICE separately to create the model readable boundary condition by time-diddling process using icesst tool.4.) Use the regid utility to regrid the the SST and SeaICE on model grid.5.) Use the bcgen utility to imterpolate and SST and SeaICe and write the output file as in model readable input file format. I Created the future case using as follows.
create_newcase -case HadGEM2_RCP85_r4i1p1_9x125_2006-2099_1 -res f09_f09 -user_compset RCP8_CAM5_CLM40%SP_CICE%PRES_DOCN%DOM_RTM_SGLC_SWAV -mach IITD -mpilib mpich -compiler pgi

After running for 83 years stuck at December 18 of 2083.  i got the following error:
----------------------------------------------------------------------------------------------------------------------------------------------------
 QNEG3 from convect_deep/Q:m=  1 lat/lchnk=   2940 Min. mixing ratio violated at    1 points.  Reset to  1.0E-12 Worst =-1.9E-05 at i,k=   2  8
 BalanceCheck: soil balance error nstep =   1365901 point = 17281 imbalance =   -0.000003 W/m2
 imp_sol: Time step   1.8000000000000E+03 failed to converge @ (lchnk,lev,col,nstep) =   3719    15     2******
 imp_sol: Time step   9.0000000000000E+02 failed to converge @ (lchnk,lev,col,nstep) =   3719    15     2******
 imp_sol: Time step   4.5000000000000E+02 failed to converge @ (lchnk,lev,col,nstep) =   3719    15     2******
 imp_sol: Time step   2.2500000000000E+02 failed to converge @ (lchnk,lev,col,nstep) =   3719    15     2******
 imp_sol: Time step   1.1250000000000E+02 failed to converge @ (lchnk,lev,col,nstep) =   3719    15     2******
 imp_sol: Time step   4.5000000000000E+01 failed to converge @ (lchnk,lev,col,nstep) =   3719    15     2******
 imp_sol: Failed to converge @ (lchnk,lev,col,nstep,dt,time) =   3719    15     2******  4.5000000000000E+01  1.1250000000000E+02
 DMS       1.000E+00
 imp_sol : @ (lchnk,lev,col) =          3719           15            2  failed
             6  times
 CALEDDY: Warning, CL with zero TKE, i, kt, kb             2            6
           16
 CALEDDY: Warning, CL with zero TKE, i, kt, kb             2            6
           16
 CALEDDY: Warning, CL with zero TKE, i, kt, kb             2            6
           16
 CALEDDY: Warning, CL with zero TKE, i, kt, kb             2            6
           16
 CALEDDY: Warning, CL with zero TKE, i, kt, kb             2            6
           16
 Lagrangian levels are crossing
 Run will ABORT!
 Suggest to increase NSPLTVRM
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 970 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
--------------------------------------------------------------------------------------------------------------------------------------------------
I checked the  NSPLTVRM = 2

I am attaching the animation of 1 year( mid 2083to mid 2084) SST and SIC boundary condition here Model TS refers to AOGCM TS data and sic and BC refer to boundary condition generated.
please look at it Is there anything suspicious in BC (Specially Dec 2083)??

and suggest something,

I will really appreciate your help.

Thank You.
 
Hi,
I tried to run the model with setting (increasing ) NSPLTVRM namelist variable, as suggest on the post ( https://bb.cgd.ucar.edu/node/1002415
I did following settings,nspltvrm=4
 state_debug_checks = .true.
---------Model get terminated at initialization step on with following error message:,hr_stream_set) filename = /home/cas/phd/asz118159/inputdata/atm/cam/sst/HadGEM2-ES_rcp85_r4i1p1_RCo2_9x125_2006-2099_full.nc                                                                               
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(shr_sys_abort) ERROR: Bad namelist settings for FV subcycling.
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Is there any mistake in setting the nspltvrm ??Please advise. Log is attachhed here, Thanking you in anticipation
 
Hi,
I tried to run the model with setting (increasing ) NSPLTVRM namelist variable, as suggest on the post ( https://bb.cgd.ucar.edu/node/1002415
I did following settings,nspltvrm=4
 state_debug_checks = .true.
---------Model get terminated at initialization step on with following error message:,hr_stream_set) filename = /home/cas/phd/asz118159/inputdata/atm/cam/sst/HadGEM2-ES_rcp85_r4i1p1_RCo2_9x125_2006-2099_full.nc                                                                               
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(shr_sys_abort) ERROR: Bad namelist settings for FV subcycling.
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Is there any mistake in setting the nspltvrm ??Please advise. Log is attachhed here, Thanking you in anticipation
 
Hi,
I tried to run the model with setting (increasing ) NSPLTVRM namelist variable, as suggest on the post ( https://bb.cgd.ucar.edu/node/1002415
I did following settings,nspltvrm=4
 state_debug_checks = .true.
---------Model get terminated at initialization step on with following error message:,hr_stream_set) filename = /home/cas/phd/asz118159/inputdata/atm/cam/sst/HadGEM2-ES_rcp85_r4i1p1_RCo2_9x125_2006-2099_full.nc                                                                               
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(shr_sys_abort) ERROR: Bad namelist settings for FV subcycling.
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Is there any mistake in setting the nspltvrm ??Please advise. Log is attachhed here, Thanking you in anticipation
 
Hi,
I tried Two small run starting from 2083-12-01 for five months, First i have given a setting of nsplit, nsplttrac and nspltvrm as follows.1.)  I passed the following namelist setting in use_nl_cam nsplit=12
 nspltrac=12
 nspltvrm=6
 state_debug_checks=.true.

This test run upto 2084-01-12  and then terminated with following error message:---------------------------------------------------------QNEG3 from TPHYSBCb:m=  5 lat/lchnk=   3264 Min. mixing ratio violated at    1 points.  Reset to  0.0E+00 Worst =-4.6E-07 at i,k=   2  1
 ERROR: shr_assert_in_domain: state%t has invalid value
   -75723315846.43571       at location:             2            9
 Expected value to be greater than     0.000000000000000
(shr_sys_abort) ERROR: Invalid value produced in physics_state by package radheat.
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 72 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
 QNEG3 from TPHYSBCb:m=  5 lat/lchnk=   1146 Min. mixing ratio violated at    1 points.  Reset to  0.0E+00 Worst =-1.0E-08 at i,k=   2  1
--------------------------------------------------------------------------
--------------------------------------------------------------------------------Now again i doubts over the boundary conditions, something in Radiation module causing model the CRASH/Blow off, I had attached the monthly gif file in first post,Please look at it,2.) I tried with another namelist setting.nsplit=8
 nspltrac=8
 nspltvrm=4
 state_debug_checks=.true.Now it passes all the crash points and ran sucessfully and running in to 2084-02-11 model day,Still i am not getting confident about the Boundary conditions OR Somehting else under the RCP85 that is cuasing the issues. Please help,Thanking You in anticipation. 
 
Hi,
I tried Two small run starting from 2083-12-01 for five months, First i have given a setting of nsplit, nsplttrac and nspltvrm as follows.1.)  I passed the following namelist setting in use_nl_cam nsplit=12
 nspltrac=12
 nspltvrm=6
 state_debug_checks=.true.

This test run upto 2084-01-12  and then terminated with following error message:---------------------------------------------------------QNEG3 from TPHYSBCb:m=  5 lat/lchnk=   3264 Min. mixing ratio violated at    1 points.  Reset to  0.0E+00 Worst =-4.6E-07 at i,k=   2  1
 ERROR: shr_assert_in_domain: state%t has invalid value
   -75723315846.43571       at location:             2            9
 Expected value to be greater than     0.000000000000000
(shr_sys_abort) ERROR: Invalid value produced in physics_state by package radheat.
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 72 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
 QNEG3 from TPHYSBCb:m=  5 lat/lchnk=   1146 Min. mixing ratio violated at    1 points.  Reset to  0.0E+00 Worst =-1.0E-08 at i,k=   2  1
--------------------------------------------------------------------------
--------------------------------------------------------------------------------Now again i doubts over the boundary conditions, something in Radiation module causing model the CRASH/Blow off, I had attached the monthly gif file in first post,Please look at it,2.) I tried with another namelist setting.nsplit=8
 nspltrac=8
 nspltvrm=4
 state_debug_checks=.true.Now it passes all the crash points and ran sucessfully and running in to 2084-02-11 model day,Still i am not getting confident about the Boundary conditions OR Somehting else under the RCP85 that is cuasing the issues. Please help,Thanking You in anticipation. 
 
Hi,
I tried Two small run starting from 2083-12-01 for five months, First i have given a setting of nsplit, nsplttrac and nspltvrm as follows.1.)  I passed the following namelist setting in use_nl_cam nsplit=12
 nspltrac=12
 nspltvrm=6
 state_debug_checks=.true.

This test run upto 2084-01-12  and then terminated with following error message:---------------------------------------------------------QNEG3 from TPHYSBCb:m=  5 lat/lchnk=   3264 Min. mixing ratio violated at    1 points.  Reset to  0.0E+00 Worst =-4.6E-07 at i,k=   2  1
 ERROR: shr_assert_in_domain: state%t has invalid value
   -75723315846.43571       at location:             2            9
 Expected value to be greater than     0.000000000000000
(shr_sys_abort) ERROR: Invalid value produced in physics_state by package radheat.
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 72 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
 QNEG3 from TPHYSBCb:m=  5 lat/lchnk=   1146 Min. mixing ratio violated at    1 points.  Reset to  0.0E+00 Worst =-1.0E-08 at i,k=   2  1
--------------------------------------------------------------------------
--------------------------------------------------------------------------------Now again i doubts over the boundary conditions, something in Radiation module causing model the CRASH/Blow off, I had attached the monthly gif file in first post,Please look at it,2.) I tried with another namelist setting.nsplit=8
 nspltrac=8
 nspltvrm=4
 state_debug_checks=.true.Now it passes all the crash points and ran sucessfully and running in to 2084-02-11 model day,Still i am not getting confident about the Boundary conditions OR Somehting else under the RCP85 that is cuasing the issues. Please help,Thanking You in anticipation. 
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Can you see the traceback in the cesm.log file to see exactly where it's aborting? If it doesn't show try running with DEBUG=TRUE. The CESM and CAM users's guide's have hints about how to debug problems. It would be good to look at hints there.
 

Zhibo Li

Zhibo Li
New Member
Bug solved in a strange way.
I changed the CPU numbers from 320(10nodes) to 192(6nodes), and the high-resolution model ran pretty well.
 
Top