An issue that can cause POP2 to crash on restart

santos · Apr 11, 2013

There are a few cases where POP2 crashes during a restart due to what appears to be a convergence issue, but is actually a problem caused by reading corrupt data from a restart file. The apparent error in the restart log:

Code:

POP Exiting...
POP_SolversChronGear: solver not converged
POP_SolverRun: error in ChronGear
POP_BarotropicDriver: error in solver
Step: error in barotropic

The problem is due to a POP2 namelist setting that is not updated correctly.Some compsets have a binary init_ts_file, and appropriate format:init_ts_file_fmt = 'bin'Upon restarting, the new file is a netCDF file, and this setting should be used:init_ts_file_fmt = 'nc'In some early CESM1.1 betas, this problem happened all the time (?). After it was fixed, the problem would still occur if there was a problem in preview_namelists. There are two known cases where this has happened:

Until recently (i.e. until later CESM1.2 betas), the chemistry preprocessor would cause CAM's configure script to fail on yellowstone batch nodes, which would indirectly prevent POP's build-namelist from running. The solution in this case is to simply run preview_namelists after the initial run and before the first restart.
Ryan Neely has encountered this problem on Zeus in CESM1.0.5. It is not yet clear whether this is a bug in CESM1.0.5 itself, or a problem with the port to Zeus.

mlevy · Apr 11, 2013

I have recreated the issue in CESM 1.0.5 on yellowstone:

Create a B1850WCN case (f19_g16 resolution)
Build
Submit - the first run will finish successfully
Submit again - the second run will error out (it will find an rpointer file and set init_ts_file_fmt = 'nc', but because CONTINUE_RUN is false it will set init_ts_file to the initial file, which is binary format)

The fix we put in CESM 1.1 isn't directly applicable (POP uses the build-namelist script to generate pop2_in for CESM 1.1 and later), but I'll port it over to the old build system.

mlevy · Apr 11, 2013

I have recreated the issue in CESM 1.0.5 on yellowstone:

Create a B1850WCN case (f19_g16 resolution)
Build
Submit - the first run will finish successfully
Submit again - the second run will error out (it will find an rpointer file and set init_ts_file_fmt = 'nc', but because CONTINUE_RUN is false it will set init_ts_file to the initial file, which is binary format)

The fix we put in CESM 1.1 isn't directly applicable (POP uses the build-namelist script to generate pop2_in for CESM 1.1 and later), but I'll port it over to the old build system.

mlevy · Apr 11, 2013

I have recreated the issue in CESM 1.0.5 on yellowstone:

Create a B1850WCN case (f19_g16 resolution)
Build
Submit - the first run will finish successfully
Submit again - the second run will error out (it will find an rpointer file and set init_ts_file_fmt = 'nc', but because CONTINUE_RUN is false it will set init_ts_file to the initial file, which is binary format)

The fix we put in CESM 1.1 isn't directly applicable (POP uses the build-namelist script to generate pop2_in for CESM 1.1 and later), but I'll port it over to the old build system.

santos · Apr 26, 2013

This post was originally in the WACCM forums, since a WACCM user encountered this problem, but I have moved it here since it seems more appropriate.

santos · Apr 26, 2013

This post was originally in the WACCM forums, since a WACCM user encountered this problem, but I have moved it here since it seems more appropriate.

santos · Apr 26, 2013

This post was originally in the WACCM forums, since a WACCM user encountered this problem, but I have moved it here since it seems more appropriate.

rneely · May 6, 2013

Hey all,I actually ran into this problem running CESM 1.0.5 on NOAA's Zeus computer after I created a new compset for running a coupled waccm-sc model with 1850 conditions.After working around the problem by manually changing init_ts_file_fmt as above I found that I could run the model for a day or two in startup mode(continue run =false) and save a restart file. I could then change continue_run to true, rebuild, and then the model would have init_ts_file_fmt as nc and the model would restart and resubmit automatically.It seems that the model just needs to know you want to continue the run before it will set the correct variables.

rneely · May 6, 2013

Hey all,I actually ran into this problem running CESM 1.0.5 on NOAA's Zeus computer after I created a new compset for running a coupled waccm-sc model with 1850 conditions.After working around the problem by manually changing init_ts_file_fmt as above I found that I could run the model for a day or two in startup mode(continue run =false) and save a restart file. I could then change continue_run to true, rebuild, and then the model would have init_ts_file_fmt as nc and the model would restart and resubmit automatically.It seems that the model just needs to know you want to continue the run before it will set the correct variables.

rneely · May 6, 2013

Hey all,I actually ran into this problem running CESM 1.0.5 on NOAA's Zeus computer after I created a new compset for running a coupled waccm-sc model with 1850 conditions.After working around the problem by manually changing init_ts_file_fmt as above I found that I could run the model for a day or two in startup mode(continue run =false) and save a restart file. I could then change continue_run to true, rebuild, and then the model would have init_ts_file_fmt as nc and the model would restart and resubmit automatically.It seems that the model just needs to know you want to continue the run before it will set the correct variables.

santos · May 6, 2013

Hi, Ryan.Setting CONTINUE_RUN to TRUE is the intended way to continue a run; the model is designed to start at the beginning every time you submit a job unless you set this (or set RESUBMIT > 0, which will set CONTINUE_RUN to TRUE for you at the end of the first job). Furthermore, this is a runtime option, so rebuilding should not have been necessary.Can you let us know what you were doing before, when the crash actually happened? What (if anything) were you changing between the startup and restart runs?

santos · May 6, 2013

Hi, Ryan.Setting CONTINUE_RUN to TRUE is the intended way to continue a run; the model is designed to start at the beginning every time you submit a job unless you set this (or set RESUBMIT > 0, which will set CONTINUE_RUN to TRUE for you at the end of the first job). Furthermore, this is a runtime option, so rebuilding should not have been necessary.Can you let us know what you were doing before, when the crash actually happened? What (if anything) were you changing between the startup and restart runs?

santos · May 6, 2013

Hi, Ryan.Setting CONTINUE_RUN to TRUE is the intended way to continue a run; the model is designed to start at the beginning every time you submit a job unless you set this (or set RESUBMIT > 0, which will set CONTINUE_RUN to TRUE for you at the end of the first job). Furthermore, this is a runtime option, so rebuilding should not have been necessary.Can you let us know what you were doing before, when the crash actually happened? What (if anything) were you changing between the startup and restart runs?

rneely · May 6, 2013

Yes, the only thing I changed between the startup and restart was to set continue to true and rebuild. This works every time for me though it seems like it is unnecessary. If I did not rebuild the model would look for the bin files in stead of the nc restart files.

rneely · May 6, 2013

Yes, the only thing I changed between the startup and restart was to set continue to true and rebuild. This works every time for me though it seems like it is unnecessary. If I did not rebuild the model would look for the bin files in stead of the nc restart files.

rneely · May 6, 2013

Yes, the only thing I changed between the startup and restart was to set continue to true and rebuild. This works every time for me though it seems like it is unnecessary. If I did not rebuild the model would look for the bin files in stead of the nc restart files.

hannay · Oct 25, 2013

Is there a fix for this issue ? I am running into the problem trying to start from a binary file. Then, when I try to restart it crashes. Should I try to restart with init_ts_file_fmt = 'nc' instead. Thanks

hannay · Oct 25, 2013

Is there a fix for this issue ? I am running into the problem trying to start from a binary file. Then, when I try to restart it crashes. Should I try to restart with init_ts_file_fmt = 'nc' instead. Thanks

hannay · Oct 25, 2013

Is there a fix for this issue ? I am running into the problem trying to start from a binary file. Then, when I try to restart it crashes. Should I try to restart with init_ts_file_fmt = 'nc' instead. Thanks

mlevy · Oct 25, 2013

Can you point me to a case directory? I want to check a couple of things:
1) Make sure your case is using a netcdf restart file with init_ts_file_fmt='bin'2) See what version of CESM you are running, because I think different versions had different fixes (though I might be thinking of a different issue...) Thanks!~Mike

An issue that can cause POP2 to crash on restart

Member

Michael Levy

CSEG and Liaisons

Michael Levy

CSEG and Liaisons

Michael Levy

CSEG and Liaisons

Member

Member

Member

New Member

New Member

New Member

Member

Member

Member

New Member

New Member

New Member

Cecile Hannay

AMWG Liaison

Cecile Hannay

AMWG Liaison

Cecile Hannay

AMWG Liaison

Michael Levy

CSEG and Liaisons