Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CAM5 run time error

murali@uni_no

New Member
Hi,I am running CAM5.1.1 in the standalone mode at 0.47x0.63 resolution. Number of levels is 30. It runs in the data ocean mode with prescribed, daily SST and ice concentration fields from NOAA. The experiment was setup so that the model runs for 20 years from 1983 to 2002. It runs successfully upto 1998 but crashes there. The crash message in the log file is "_pmiu_daemon(SIGCHLD):  PE RANK 2 exit signal Aborted". Just before the crash,  there is a message saying that "QNEG3 from TPHYSBCb:m=  5 lat/lchnk= 2466 Min. mixing ratio violated at    1 points.  Reset to  0.0E+00 Worst =-3.4E-06 at i,k=   4  1". But I don't think this is a cause for the crash as there are many mixing ratio violations taking place at numerous places before in the log file. There are these warnings as well: "pLCL does not converge and is set to psmin in uwshcu.F90".
The error appears to be at the end of this loop (as indicated by the core dump file):
Code:
<span style="color: #ff0000;">        if (a(icol,icol).eq.0.) then
            write(iulog,*) 'singular matrix in gaussj 2'
            do ii = 1, np
            do jj = 1, np
               write(iulog,*) ii, jj, aa(ii,jj), bb(ii,1)
            end do
            end do
            call endrun

(gdb) p a(icol, icol)
 = 0
(gdb) p ii
 = -35040
(gdb) p jj
 = 0
(gdb) p aa(ii,jj)
no such vector element
(gdb) p bb(ii,1)
no such vector element
(gdb)</span> <br /><br />
 Here are the lines from the core dump file:-------------------------------------------------------------------------------------------------------------------------------------------------------------(gdb) bt
#0  0x0000000001414dab in raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
#1  0x00000000014e4041 in abort () at abort.c:92
#2  0x000000000108a4a2 in MPID_Abort ()
#3  0x000000000106abcc in PMPI_Abort ()
#4  0x000000000103e7cd in pmpi_abort__ ()
#5  0x00000000004b3bef in abortutils::endrun (msg='') at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/utils/abortutils.F90:36
#6  0x000000000058ba11 in cldwat2m_macro::gaussj (a=..., n=2, np=0, b=..., m=Cannot access memory at address 0x0
)
    at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/physics/cam/cldwat2m_macro.F90:3535
#7  0x0000000000584ef2 in cldwat2m_macro::mmacro_pcond (lchnk=1053, ncol=-29584, dt=6.9533558063733605e-310, p=..., dp=..., t0=..., qv0=..., ql0=..., qi0=..., nl0=...,
    ni0=..., a_t=..., a_qv=..., a_ql=..., a_qi=..., a_nl=..., a_ni=..., c_t=..., c_qv=..., c_ql=..., c_qi=..., c_nl=..., c_ni=..., c_qlst=..., d_t=..., d_qv=..., d_ql=...,
    d_qi=..., d_nl=..., d_ni=..., a_cud=..., a_cu0=..., landfrac=..., snowh=..., s_tendout=..., qv_tendout=..., ql_tendout=..., qi_tendout=..., nl_tendout=...,
    ni_tendout=..., qme=..., qvadj=..., qladj=..., qiadj=..., qllim=..., qilim=..., cld=..., al_st_star=..., ai_st_star=..., ql_st_star=..., qi_st_star=...)
    at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/physics/cam/cldwat2m_macro.F90:773
#8  0x00000000008c70ed in macrop_driver::macrop_driver_tend (state=..., ptend_all=..., dtime=1800, landfrac=..., ocnfrac=..., snowh=..., dlf=..., dlf2=..., cmfmc=...,
    cmfmc2=..., ts=..., sst=..., zdu=..., pbuf=Cannot access memory at address 0x7fffffff8c70
) at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/physics/cam/macrop_driver.F90:788
#9  0x0000000000fccd4b in tphysbc (ztodt=1800, pblht=..., tpert=..., qpert=..., fsns=..., fsnt=..., flns=..., flnt=..., state=Cannot access memory at address 0x7fffffff8d70
)
    at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/physics/cam/tphysbc.F90:382
#10 0x0000000000a72a54 in physpkg::phys_run1 (phys_state=Cannot access memory at address 0x7fffffff8da0
) at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/physics/cam/physpkg.F90:665
#11 0x000000000050b65a in cam_comp::cam_run1 (cam_in=Asked for position 0 of stack, stack only has 0 elements on it.
) at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/control/cam_comp.F90:218
#12 0x00000000004dfaf9 in atm_comp_mct::atm_run_mct (eclock=..., cdata_a=..., x2a_a=..., a2x_a=...)
    at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/atm/cam/src/cpl_mct/atm_comp_mct.F90:523
#13 0x0000000000565eec in ccsm_comp_mod::ccsm_run () at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/drv/driver/ccsm_comp_mod.F90:2165
#14 0x0000000000569959 in ccsm_driver () at /home/bjerknes/mad042/cesm1_0_4-cam-standalone/models/drv/driver/ccsm_driver.F90:47
#15 0x00000000004008d0 in main ()
#16 0x00000000014dea54 in __libc_start_main (main=0x400890 , argc=1, ubp_av=0x7fffffffa198, init=0x3, fini=0xfcaad68, rtld_fini=0, stack_end=0x7fffffffa188)
    at libc-start.c:226
#17 0x00000000004007a5 in _start () at ../sysdeps/x86_64/elf/start.S:113
--------------------------------------------------------------------------------------------------------------------------------------------------------------------Could anyone comment on this issue?Thanks !
Code:
<br /><br />
 
 

murali@uni_no

New Member
I have an update on the above issue. I tried to do a simple continuation run from few model days before the crash takes place with the corresponding restart files, just to make sure that the crash has nothing to do with memory issues. But the model crashed exactly at the same point as before with the same messages. Then I tried a branch run from the same point so that I can modify some parameters and resolve the issue. The model runs successfully this time. I expected the model to run in a bit-to-bit fashion and crash at the same point as before as I did not change anything in the namelist except providing values for cam_branch_file, nrevsn, ice_ic, restfilm and restart_file which correspond to the master restart files for atm, clm, cice, docn and cpl. Further, the output is not same as before. There are differences. Additionally, there is a shift in the time at the starting point of the model run. This holds only in the case of the history stream h1 (this contains 6 hourly instantaneous fields). The master restart files correspond to the time unit 1998-07-01-00000. So the output file should write the h1 files in 1998-07-**-00000 (each file containing 4 time levels) fashion. On the contrary, I get them as 1998-07-01-21600 with the very first time skipped at the beginning.  I have cross-checked and confirmed that the time settings in the restart files and namelists match. May be this is a trivial issue but I am not able to trace this. Doing branch runs in standalone CAM is not documented well. I collected information from different documentations and got the run right. It is possible that I am missing something.In short, I am wondering about:1) Why does the model run successfully in the branch mode when no parameters are changed?2) Why is the output different than the start-up run? 3) What causes the model to skip the first time level?I have uploaded the namelist settings I used for the start-up run and for the branch run.  If anyone has a slightest of an idea regarding these, please comment.Regards,Muralidhar
 

eaton

CSEG and Liaisons
A branch run should give identical results to the restart as you expected.  If it doesn't then something is wrong.  Sometimes it's a feature of the system that changes and you don't have any control over it.  In your original post there is not much useful information from the debugger output.  This is typical of a production run where the executable was not built with the debug flags.  In that case it is often useful to rebuild the executable with debug options on, and then do the restart run.  Sometimes this will provide more information about the failure.  Other times the run will go right past the original point of failure which indicates a possible problem with the optimized code or perhaps the system had a failure during the original run.  These kinds of problems can be extremely difficult to track down.I think the output from the branch run in the h1 file is correct.  On a branch you won't get a time sample at 1998-07-01-00000 because that output was part of the run that you branched from.  On the other hand an initial run starting from 1998-07-01-00000 will contain a time sample in the h1 file at hour 0 because the initial conditions (updated by a partially complete timestep) are written to all history file sequences except for the monthly average one.
  
 

murali@uni_no

New Member
  Thank you Eaton !I did a simple continuation of the crashed simulation in the debug mode. The error message in the log file now is "PE RANK 2 exit signal Floating point exception" at the crash time. This appears to come from an arithmetic operation in the subroutine qsinvert() in the module "uwshcu.F90" as indicated by the core dump file (attachment).I think this problem is connected to the "pLCL does not converge and is set to psmin in uwshcu.F90" and "mixing ratio violated at......" messages appearing in the log file. How can we resolve this?  I noticed in the output history files that over some points lying direclty over coastlies, specific humidity has values like 9.9e+36. Can this be the cause of the error? If it is, I wonder how did the model run for 15 years successfully and reasonably well?One more question: can we change the model time step and paramertrization schemes in a branch run or it is possible only in a hybrid run?   
 

eaton

CSEG and Liaisons
Tracking down this kind of problem is never easy.  My first assumption would be that qsinvert is getting an unrealistic atm state and so I'd try to identify the column which is causing the problem in qsinvert, and then trace back where the bad value is coming from.  It's also possible that qsinvert has a bug and that is triggered by a realistic atm state, but that seems less likely.  When there is an obvious problem like a specific humidity value of 9.9e+36 then tracing back where that is coming from would be the first thing to do.
Sometimes reducing the timestep will be a successful way to get a run going which has encountered a stability problem.  This has to be done using a hybrid run.
 
Top