Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

An issue that can cause POP2 to crash on restart

santos

Member
There are a few cases where POP2 crashes during a restart due to what appears to be a convergence issue, but is actually a problem caused by reading corrupt data from a restart file. The apparent error in the restart log:
Code:
POP Exiting...
POP_SolversChronGear: solver not converged
POP_SolverRun: error in ChronGear
POP_BarotropicDriver: error in solver
Step: error in barotropic
The problem is due to a POP2 namelist setting that is not updated correctly.Some compsets have a binary init_ts_file, and appropriate format:init_ts_file_fmt = 'bin'Upon restarting, the new file is a netCDF file, and this setting should be used:init_ts_file_fmt = 'nc'In some early CESM1.1 betas, this problem happened all the time (?). After it was fixed, the problem would still occur if there was a problem in preview_namelists. There are two known cases where this has happened:
  1. Until recently (i.e. until later CESM1.2 betas), the chemistry preprocessor would cause CAM's configure script to fail on yellowstone batch nodes, which would indirectly prevent POP's build-namelist from running. The solution in this case is to simply run preview_namelists after the initial run and before the first restart.
  2. Ryan Neely has encountered this problem on Zeus in CESM1.0.5. It is not yet clear whether this is a bug in CESM1.0.5 itself, or a problem with the port to Zeus.
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
I have recreated the issue in CESM 1.0.5 on yellowstone:
  1. Create a B1850WCN case (f19_g16 resolution)
  2. Build
  3. Submit - the first run will finish successfully
  4. Submit again - the second run will error out (it will find an rpointer file and set init_ts_file_fmt = 'nc', but because CONTINUE_RUN is false it will set init_ts_file to the initial file, which is binary format)
The fix we put in CESM 1.1 isn't directly applicable (POP uses the build-namelist script to generate pop2_in for CESM 1.1 and later), but I'll port it over to the old build system.
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
I have recreated the issue in CESM 1.0.5 on yellowstone:
  1. Create a B1850WCN case (f19_g16 resolution)
  2. Build
  3. Submit - the first run will finish successfully
  4. Submit again - the second run will error out (it will find an rpointer file and set init_ts_file_fmt = 'nc', but because CONTINUE_RUN is false it will set init_ts_file to the initial file, which is binary format)
The fix we put in CESM 1.1 isn't directly applicable (POP uses the build-namelist script to generate pop2_in for CESM 1.1 and later), but I'll port it over to the old build system.
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
I have recreated the issue in CESM 1.0.5 on yellowstone:
  1. Create a B1850WCN case (f19_g16 resolution)
  2. Build
  3. Submit - the first run will finish successfully
  4. Submit again - the second run will error out (it will find an rpointer file and set init_ts_file_fmt = 'nc', but because CONTINUE_RUN is false it will set init_ts_file to the initial file, which is binary format)
The fix we put in CESM 1.1 isn't directly applicable (POP uses the build-namelist script to generate pop2_in for CESM 1.1 and later), but I'll port it over to the old build system.
 

santos

Member
This post was originally in the WACCM forums, since a WACCM user encountered this problem, but I have moved it here since it seems more appropriate. 
 

santos

Member
This post was originally in the WACCM forums, since a WACCM user encountered this problem, but I have moved it here since it seems more appropriate. 
 

santos

Member
This post was originally in the WACCM forums, since a WACCM user encountered this problem, but I have moved it here since it seems more appropriate. 
 

rneely

New Member
Hey all,I actually ran into this problem running CESM 1.0.5 on NOAA's Zeus computer after I created a new compset for running a coupled waccm-sc model with 1850 conditions.After working around the problem by manually changing init_ts_file_fmt as above I found that I could run the model for a day or two in startup mode(continue run =false) and save a restart file. I could then change continue_run to true, rebuild, and then the model would have init_ts_file_fmt as nc and the model would restart and resubmit automatically.It seems that the model just needs to know you want to continue the run before it will set the correct variables. 
 

rneely

New Member
Hey all,I actually ran into this problem running CESM 1.0.5 on NOAA's Zeus computer after I created a new compset for running a coupled waccm-sc model with 1850 conditions.After working around the problem by manually changing init_ts_file_fmt as above I found that I could run the model for a day or two in startup mode(continue run =false) and save a restart file. I could then change continue_run to true, rebuild, and then the model would have init_ts_file_fmt as nc and the model would restart and resubmit automatically.It seems that the model just needs to know you want to continue the run before it will set the correct variables. 
 

rneely

New Member
Hey all,I actually ran into this problem running CESM 1.0.5 on NOAA's Zeus computer after I created a new compset for running a coupled waccm-sc model with 1850 conditions.After working around the problem by manually changing init_ts_file_fmt as above I found that I could run the model for a day or two in startup mode(continue run =false) and save a restart file. I could then change continue_run to true, rebuild, and then the model would have init_ts_file_fmt as nc and the model would restart and resubmit automatically.It seems that the model just needs to know you want to continue the run before it will set the correct variables. 
 

santos

Member
Hi, Ryan.Setting CONTINUE_RUN to TRUE is the intended way to continue a run; the model is designed to start at the beginning every time you submit a job unless you set this (or set RESUBMIT > 0, which will set CONTINUE_RUN to TRUE for you at the end of the first job). Furthermore, this is a runtime option, so rebuilding should not have been necessary.Can you let us know what you were doing before, when the crash actually happened? What (if anything) were you changing between the startup and restart runs?
 

santos

Member
Hi, Ryan.Setting CONTINUE_RUN to TRUE is the intended way to continue a run; the model is designed to start at the beginning every time you submit a job unless you set this (or set RESUBMIT > 0, which will set CONTINUE_RUN to TRUE for you at the end of the first job). Furthermore, this is a runtime option, so rebuilding should not have been necessary.Can you let us know what you were doing before, when the crash actually happened? What (if anything) were you changing between the startup and restart runs?
 

santos

Member
Hi, Ryan.Setting CONTINUE_RUN to TRUE is the intended way to continue a run; the model is designed to start at the beginning every time you submit a job unless you set this (or set RESUBMIT > 0, which will set CONTINUE_RUN to TRUE for you at the end of the first job). Furthermore, this is a runtime option, so rebuilding should not have been necessary.Can you let us know what you were doing before, when the crash actually happened? What (if anything) were you changing between the startup and restart runs?
 

rneely

New Member
Yes,  the only thing I changed between the startup and restart was to set continue to true and rebuild. This works every time for me though it seems like it is unnecessary. If I did not rebuild the model would look for the bin files in stead of the nc restart files. 
 

rneely

New Member
Yes,  the only thing I changed between the startup and restart was to set continue to true and rebuild. This works every time for me though it seems like it is unnecessary. If I did not rebuild the model would look for the bin files in stead of the nc restart files. 
 

rneely

New Member
Yes,  the only thing I changed between the startup and restart was to set continue to true and rebuild. This works every time for me though it seems like it is unnecessary. If I did not rebuild the model would look for the bin files in stead of the nc restart files. 
 

hannay

Cecile Hannay
AMWG Liaison
Staff member
Is there a fix for this issue ? I am running into the problem trying to start from a binary file. Then, when I try to restart it crashes. Should I try to restart with init_ts_file_fmt = 'nc' instead. Thanks 
 

hannay

Cecile Hannay
AMWG Liaison
Staff member
Is there a fix for this issue ? I am running into the problem trying to start from a binary file. Then, when I try to restart it crashes. Should I try to restart with init_ts_file_fmt = 'nc' instead. Thanks 
 

hannay

Cecile Hannay
AMWG Liaison
Staff member
Is there a fix for this issue ? I am running into the problem trying to start from a binary file. Then, when I try to restart it crashes. Should I try to restart with init_ts_file_fmt = 'nc' instead. Thanks 
 

mlevy

Michael Levy
CSEG and Liaisons
Staff member
Can you point me to a case directory? I want to check a couple of things:
1) Make sure your case is using a netcdf restart file with init_ts_file_fmt='bin'2) See what version of CESM you are running, because I think different versions had different fixes (though I might be thinking of a different issue...) Thanks!~Mike
 
Top