Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

ice: Vertical thermo error

Hi all,I am running a CESM 1.0.6 B1850CN case, which crashed at Model Year 391. Below is the error message I got just before the line "359:forrtl: error (78): process killed (SIGTERM)". I found someone else on the forum who encountered similar error and solved it by reducing the timestep of the ocean component. I wonder should I do the same? Why not the timestep of the ice component?Thanks a lot,Shineng 210: Thermo iteration does not converge,istep1, my_task, i, j:      183277 210:         210           5          71 210: Ice thickness:  0.304990231284917 210: Snow thickness:  0.000000000000000E+000 210: dTsf, Tsf_errmax:  8.681028118573408E-012  5.000000000000000E-004 210: Tsf:  0.000000000000000E+000 210: fsurf:   7.97436120047061 210: fcondtop, fcondbot, fswint   7.97436120080571        16.8896766960386 210:   19.2628045380274 210: fswsfc, fswthrun   28.9156069993825        27.6167851718069 210: Flux conservation error =  3.551186011918617E-010 210: Internal snow absorption: 210:  0.000000000000000E+000 210: Internal ice absorption: 210:   10.9082961390655        4.73599459455542        2.30521393613914 210:   1.31329986826734 210: Initial snow temperatures: 210:  0.000000000000000E+000 210: Initial ice temperatures: 210: -0.191802146393597      -0.737987439304102       -1.09846369089355 210:  -1.42274118270238 210: Final snow temperatures: 210:  0.000000000000000E+000 210: Final ice temperatures: 210: -0.191205075489690      -0.735301673292061       -1.09701825916023 210:  -1.42964223580686 210: istep1, my_task, iblk =      183277         210           1 210: Global block:         211 210: Global i and j:         204         326 210: Lat, Lon:   61.2282675113457       -166.461789478145 210:(shr_sys_abort) ERROR: ice: Vertical thermo error 210:(shr_sys_abort) WARNING: calling shr_mpi_abort() and stoppingINFO: 0031-251  task 210 exited: rc=-11
 

dbailey

CSEG and Liaisons
Staff member
This is normally indicative of a problem somewhere else in the system, but it is probably worth adding an FAQ on this. Here are the steps I would suggest:1. Turn on frequent history output from the coupler starting from the last restart. This is HIST_OPTION and HIST_N depending on the version of the code. Look carefully at all of the fields going into the CICE model.2. If everything makes physical sense going into the ice, then you can see if everything makes physical sense within the ice using the following CICE namelist changes:print_points = .true.latpnt = latn, latslonpnt = lonn, lonsdiagfreq = 1where latn/lonn and lats/lons are the latitudes and longitudes of two points. One is northern hemisphere, one is south. Change one set of these values to correspond to the values from your error output. Rerun the model from the last restart.3. If everything there looks ok, you can attempt to increase the iterations in the thermodynamics (ice_therm_vertical.F90). Increase nitermax to 200 in the source code module ice_therm_vertical.F90 (copied into SourceMods/src.cice) and rerun from the last restart.

4. This does not usually help. The final thing to try is decreasing the thermodynamic timestep in the CICE model. This can only be done by changing the coupling interval with the atmosphere (ATM_NCPL/ICE_NCPL). Increase these values. Note, that you cannot do a 'branch' or 'continue' run with CAM and change these values. So, it will require a new run with a 'hybrid' start. If you are using the DATM, you can change these in all types of runs.Dave
 

dbailey

CSEG and Liaisons
Staff member
This is normally indicative of a problem somewhere else in the system, but it is probably worth adding an FAQ on this. Here are the steps I would suggest:1. Turn on frequent history output from the coupler starting from the last restart. This is HIST_OPTION and HIST_N depending on the version of the code. Look carefully at all of the fields going into the CICE model.2. If everything makes physical sense going into the ice, then you can see if everything makes physical sense within the ice using the following CICE namelist changes:print_points = .true.latpnt = latn, latslonpnt = lonn, lonsdiagfreq = 1where latn/lonn and lats/lons are the latitudes and longitudes of two points. One is northern hemisphere, one is south. Change one set of these values to correspond to the values from your error output. Rerun the model from the last restart.3. If everything there looks ok, you can attempt to increase the iterations in the thermodynamics (ice_therm_vertical.F90). Increase nitermax to 200 in the source code module ice_therm_vertical.F90 (copied into SourceMods/src.cice) and rerun from the last restart.

4. This does not usually help. The final thing to try is decreasing the thermodynamic timestep in the CICE model. This can only be done by changing the coupling interval with the atmosphere (ATM_NCPL/ICE_NCPL). Increase these values. Note, that you cannot do a 'branch' or 'continue' run with CAM and change these values. So, it will require a new run with a 'hybrid' start. If you are using the DATM, you can change these in all types of runs.Dave
 

duvivier

CSEG and Liaisons
Staff member
Hi,These errors can be a result of many things including weird fluxes passed through the coupler, so changing the ocean timestep may not help in your case. I'd suggest outputting frequent coupler history files immediately before the error occurs to see if there are any weird fluxes at the location of the crash and then from there you may be able to better determine how to fix this. The point you want to look at is identified as:  210: Global i and j:         204         326Alice
 

duvivier

CSEG and Liaisons
Staff member
Hi,These errors can be a result of many things including weird fluxes passed through the coupler, so changing the ocean timestep may not help in your case. I'd suggest outputting frequent coupler history files immediately before the error occurs to see if there are any weird fluxes at the location of the crash and then from there you may be able to better determine how to fix this. The point you want to look at is identified as:  210: Global i and j:         204         326Alice
 
Top