Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

model crashes but would power through if restart (CONTINUE_RUN=TRUE)?

wsy

siyuan wang
New Member
Hello folks, I'm running CAM-chem with some updates to the chemistry. The model runs fine in short tests (e.g., a few days to a month) and is producing reasonable results. But sometimes would crash in long tests (say 1 year), like the "failed to converge" error shown in cesm.log (this one is about 8 months into a simulation). I happened to have restart files written 1-2 weeks before this failure, so I was hoping to restart from that point so I can reproduce the error and debug it. It's a little weird that the model is able to power through and keep running. I vaguely remember that someone mentioned that occasional "failed to converge" is probably not the end of the world so long as the model keeps running.

Wonder if anyone has any idea what might be happening. Thank you in advance.

---------------------
...
dec0417.hsn.de.hpc.ucar.edu 43: isw= 13 specrefindex(isw)= (1.56000000000000,5.500000000000000E-003)
dec0417.hsn.de.hpc.ucar.edu 43: specdens= 2600.00000000000
dec0417.hsn.de.hpc.ucar.edu 43: l= 14 vol(l)= 4.344120569520975E-040
dec0417.hsn.de.hpc.ucar.edu 43: isw= 13 specrefindex(isw)= (1.56000000000000,5.500000000000000E-003)
dec0417.hsn.de.hpc.ucar.edu 43: specdens= 2600.00000000000
dec0417.hsn.de.hpc.ucar.edu 43: l= 15 vol(l)= 8.140346183739101E-011
dec0417.hsn.de.hpc.ucar.edu 43: isw= 13 specrefindex(isw)= (1.48399996757507,9.999999939225290E-009)
dec0417.hsn.de.hpc.ucar.edu 43: specdens= 1770.00000000000
dec0437.hsn.de.hpc.ucar.edu 1407: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 4864 166 5115
dec0417.hsn.de.hpc.ucar.edu 48: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2146 166 5115
dec0426.hsn.de.hpc.ucar.edu 689: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3427 155 6217
dec0417.hsn.de.hpc.ucar.edu 84: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2218 155 6217
dec0430.hsn.de.hpc.ucar.edu 969: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3988 167 6239
dec0445.hsn.de.hpc.ucar.edu 1802: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 5654 155 7056
dec0445.hsn.de.hpc.ucar.edu 1906: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 5862 155 7056
dec0421.hsn.de.hpc.ucar.edu 289: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2628 155 7056
dec0430.hsn.de.hpc.ucar.edu 954: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3958 165 9561
dec0445.hsn.de.hpc.ucar.edu 1875: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 5799 165 10473
dec0421.hsn.de.hpc.ucar.edu 289: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2628 152 10665
dec0446.hsn.de.hpc.ucar.edu 1940: imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 5930 152 12056
dec0426.hsn.de.hpc.ucar.edu 665: forrtl: error (78): process killed (SIGTERM)
dec0426.hsn.de.hpc.ucar.edu 665: Image PC Routine Line Source
dec0426.hsn.de.hpc.ucar.edu 665: libpthread-2.31.s 00001517CC6828C0 Unknown Unknown Unknown
dec0426.hsn.de.hpc.ucar.edu 665: libmpi_intel.so.1 00001517CBBE4976 Unknown Unknown Unknown
dec0426.hsn.de.hpc.ucar.edu 657: forrtl: error (78): process killed (SIGTERM)
 
Top