Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

Problem running CESM: failed to converge

koichi omi

koichi
New Member
Hi all,

I run CESM2.1.2 on my laboratory computer.

I set STOP_OPTION=nsteps, STOP_N=1, but the calculation never stops.

cesm.log file says
"
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2616 52 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 1288 52 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 1952 53 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2948 81 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3031 70 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3114 70 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3114 86 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2450 70 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 2450 86 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3363 68 0
imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep) = 3363 84 0
...
"
The sentences like this were repeated in the log file.
Please see log file attached to this.

Is this normal behavior?

Here is the configuration of the case

resolution: f09_f09_mg17
compset: FW1850 (1850_CAM60%WCTS_CLM50%SP_CICE%PRES_DOCN%DOM_MOSART_CISM2%NOEVOLVE_SWAV )

Is there any suggestion to fix this error?

Sincerely,
 

Attachments

  • cesm.log.200219-132645.txt
    316.7 KB · Views: 12

koichi omi

koichi
New Member
Thank you so much for your reply.

I tried the setting of STOP_OPTION=nsteps, STOP_N=10 under the same configuration of the case.
In addition, I tried STOP_OPTION=nyears.
However, the same problem occurred.

I also noticed the warning below in log file when the program reads the input data.
"WARNING: Rearr optional argument is a pio2 feature, ignored in pio1"
please see the log file attached to this.
Does this warning have a relationship with the running problem?

Is there any suggestion to fix this problem?
 

Attachments

  • cesm.log.200220-104745.txt
    322.6 KB · Views: 7

jedwards

CSEG and Liaisons
Staff member
Right at the top of the log you sent me I see:
(seq_timemgr_clockInit) WARNING: Stop time too short, not all components will
be advanced and restarts won't be written

Could you send your drv_in file and cpl.log for the case with STOP_N=10, STOP_OPTION='nsteps'
 

koichi omi

koichi
New Member
Thank you for your quick reply.

These are the drv_in and cpl.log.

Sincerely,
 

Attachments

  • cpl.log.200220-104745.txt
    79.7 KB · Views: 1
  • drv_in.txt
    5.9 KB · Views: 2

jedwards

CSEG and Liaisons
Staff member
I don't see any problem here - the cpl.log ends before the first time step starts so how does this indicate that it doesn't stop after 10?
 

koichi omi

koichi
New Member
Thank you for your reply.
Sorry, I should explain the problem more clearly.
The problem is that the calculation is not advanced from the initial step.
Therefore, the calculation never ends.
So I want to know why this happens.
 

koichi omi

koichi
New Member
Thank you for your reply.

I tried F1850. The calculation was completed successfully.
What should I do for FW1850?
 

mmills

CSEG and Liaisons
Staff member
The failing implicit solver message indicates that the chemistry is likely unstable with the initial condition provided. This can happen if you make changes in chemistry, or perhaps just because you are running on a different computer than the one that initial condition file was generated with. Have you made any changes to the chemistry?

In cases such as this, you can usually get to a stable state by running the model forward a few days with higher time splitting (shorter time steps) for dynamics, tracer advection, and vertical remapping. The default time splits for WACCM are:

fv_nsplit = 16
fv_nspltrac = 4
fv_nspltvrm = 4

Try running for 2 days with this in your user_nl_cam file in your case directory:

fv_nsplit = 128
fv_nspltrac = 128
fv_nspltvrm = 128
inithist = 'DAILY'

If the run completes 2 days, you can continue the run after removing these lines from the user_nl_cam. The purpose of the inithist line is to produce a new initial condition file (*.cam.i.*) that you can use for future stable startups. Let me know if this works.
 

koichi omi

koichi
New Member
Thank you so much for your reply.

All I changed was MPI_RUN_COMMAND and run_exe. I didn't change the chemistry.
Also, I run CESM on the same computer.

I tried the settings you told me, but the same problem of convergence occurred.
 

koichi omi

koichi
New Member
This problem was solved by deleting all the input files and downloading them again.
The cause of this problem may be that one of the input data files was broken.
Thank you!
 

jedwards

CSEG and Liaisons
Staff member
For cesm2.1.x there is a new option to the check_input_data command - ./check_input_data --chksum will compare chksums for all of the inputdata used in a case and let you know if any input files are corrupted.
 

Hemraj

Hemraj Bhattarai
New Member
I found the same problem (imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep)) as well. These are my major setting of the model.
codebase=cesm2_1_3
compset=FC2010climo
resolution=f09_f09_mg17

As suggested above, I also checked if my input file is corrupted (using: ./check_input_data --chksum), but it seems okay.
Could you please suggest to me why this problem is occurring and how can I resolve it?
Thank you so much.
 

Hemraj

Hemraj Bhattarai
New Member
I found the same problem (imp_sol: time step 1800.000 failed to converge @ (lchnk,vctrpos,nstep)) as well. These are my major setting of the model.
codebase=cesm2_1_3
compset=FC2010climo
resolution=f09_f09_mg17

As suggested above, I also checked if my input file is corrupted (using: ./check_input_data --chksum), but it seems okay.
Could you please suggest to me why this problem is occurring and how can I resolve it?
Thank you so much.
for a better understanding of the problem, I am also attaching the log file.
 

Attachments

  • cesm.log.295327.txt
    285.2 KB · Views: 2

jedwards

CSEG and Liaisons
Staff member
This looks like a possible system issue rather than a model issue - a model issue would normally present a traceback and abort message at the end of the log. Check the lnd.log to see if there is further info there and also check the slurm logs in your case directory for errors there.
 

Hemraj

Hemraj Bhattarai
New Member
This looks like a possible system issue rather than a model issue - a model issue would normally present a traceback and abort message at the end of the log. Check the lnd.log to see if there is further info there and also check the slurm logs in your case directory for errors there.
Thank you so much for the reply.
I checked both lnd.log and slurm logs ($case>logs>run_environment.txt.297530.210604-044944), but didn't find any error. For your reference, the files are attached herewith.
Could it be because of the input file where I made changes?
Thank you.
 

Attachments

  • run_environment.txt.297530.txt
    13.9 KB · Views: 0
  • lnd.log.297530.txt
    185.8 KB · Views: 1
Top