Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

coupled model(B_1850-2000_CN) stopped after running for a few years

Hi,

I am using CESM1_2_0. I created a hybrid case
create_newcase -case ../caseoutput/b20trcn_f19g16 -compset B_1850-2000_CN -res f19_g16 -mach szhpc

Then, this model started from 1948 with initial conditions in 1948 of b40.20th.track1.2deg.001.

The model did not run in 1959-09, although the state of this case is still running when I used "bjobs".

I do not know the reason, but there are "BalanceCheck: soil balance error nstep" and "filew failed, worst i, j, qtmp, q " in the cesm.log. What do these mean? Some last several lines in cesm.log are showed as followed. Some logs files are also attached.

Thanks,
yao

**********
BalanceCheck: soil balance error nstep = 202442 point = 3732 imbalance = -0.000003 W/m2
BalanceCheck: soil balance error nstep = 202633 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202634 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202637 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202638 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202727 point = 11043 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202728 point = 11043 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202729 point = 11046 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202730 point = 11046 imbalance = 0.000000 W/m2
Opened file ./b20trcn_f19g16.rtm.h0.1959-07.nc to write 655360
Opened file ./b20trcn_f19g16.clm2.h0.1959-07.nc to write 655360
Opened file b20trcn_f19g16.cam.h0.1959-07.nc to write 655360
BalanceCheck: soil balance error nstep = 203023 point = 11045 imbalance = -0.000002 W/m2
BalanceCheck: soil balance error nstep = 203024 point = 11045 imbalance = -0.000002 W/m2
filew failed, worst i, j, qtmp, q = 1 72
-2.189726374793342E-015 1.335410596380857E-026
QNEG3 from convect_deep/CLDLIQ:m= 2 lat/lchnk= 913 Min. mixing ratio violated at 1 points. Reset to 0.0E+00 Worst =-2.7E-10 at i,k= 6 24
filew failed, worst i, j, qtmp, q = 1 69
-2.374679514594295E-011 7.881057934984575E-032
filew failed, worst i, j, qtmp, q = 1 73
-2.940830193003910E-018 2.341641996943399E-045
dpcoup cant adjust 3 565 8 -8.398103431537513E-024
-1.006041895856967E-021 4.294449624196567E-024
filew failed, worst i, j, qtmp, q = 1 74
-9.091953665231636E-021 5.014301619971557E-039
Opened file ./b20trcn_f19g16.rtm.h0.1959-08.nc to write 655360
Opened file ./b20trcn_f19g16.clm2.h0.1959-08.nc to write 655360
Opened file b20trcn_f19g16.cam.h0.1959-08.nc to write 655360
filew failed, worst i, j, qtmp, q = 1 63
-8.228847990948502E-019 1.745182945588461E-040
filew failed, worst i, j, qtmp, q = 1 63
-1.891540041527286E-018 -8.047831091565376E-019
Opened file ./b20trcn_f19g16.rtm.h0.1959-09.nc to write 655360
Opened file ./b20trcn_f19g16.clm2.h0.1959-09.nc to write 655360
Opened file b20trcn_f19g16.cam.h0.1959-09.nc to write 655360
filew failed, worst i, j, qtmp, q = 1 63
-4.258028852041570E-009 6.091987584039642E-032
filew failed, worst i, j, qtmp, q = 1 62
-4.030730745896910E-011 -3.944113039400929E-011
filew failed, worst i, j, qtmp, q = 1 63
-8.378404384572377E-009 -4.131869037032669E-009
********
 

eaton

CSEG and Liaisons
The "filew failed ..." messages are from CAM's FV dycore.  They are just warnings that the dycore isn't able to fix negative tracer values with the borrowing scheme it's trying to use.  This doesn't cause a model failure.  Any negative tracer values will be subsequently "fixed" in a non mass conserving way by the qneg3 algorithm in the physics package.
 

mai

Member
If the atm.log.140422-232906 file (which you did not attach) does not have any error messages at the end, I'd say your job is still running. Everything else you have shown looks like normal output.
 
Thanks eaton and mai.Sorry for forgetting upload atm.log. Now the log file and one picture have been uploaded. I do not think there is any error in atm.log. But the picture shows that the local time(Chinese time) of output files for each month. Each output files were created every 14 minutes before 1959-09. And then the model stopped at 07:21 when it was creating atm.log. But it was about 15:00 when I wrote this the thread. So it means that no new files were outputed between 07:21 and 15:00. I do not know what is the problem.
 

santos

Member
There is no error in the log, and from what you are saying, the batch job continued but the model stopped producing output.My best guess is that there was a temporary problem with your system. You can try restarting the job, and see if it succeeds this time.
 
OK, I also encountered a similar situation and the model did not run in 2007-09. I am using CESM1_2_1 and creating a startup case from 1979-2012.Have you solved the problem ? 
 

zhuchenxia

New Member
Hi,

I am using CESM1_2_0. I created a hybrid case
create_newcase -case ../caseoutput/b20trcn_f19g16 -compset B_1850-2000_CN -res f19_g16 -mach szhpc

Then, this model started from 1948 with initial conditions in 1948 of b40.20th.track1.2deg.001.

The model did not run in 1959-09, although the state of this case is still running when I used "bjobs".

I do not know the reason, but there are "BalanceCheck: soil balance error nstep" and "filew failed, worst i, j, qtmp, q " in the cesm.log. What do these mean? Some last several lines in cesm.log are showed as followed. Some logs files are also attached.

Thanks,
yao

**********
BalanceCheck: soil balance error nstep = 202442 point = 3732 imbalance = -0.000003 W/m2
BalanceCheck: soil balance error nstep = 202633 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202634 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202637 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202638 point = 3737 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202727 point = 11043 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202728 point = 11043 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202729 point = 11046 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 202730 point = 11046 imbalance = 0.000000 W/m2
Opened file ./b20trcn_f19g16.rtm.h0.1959-07.nc to write 655360
Opened file ./b20trcn_f19g16.clm2.h0.1959-07.nc to write 655360
Opened file b20trcn_f19g16.cam.h0.1959-07.nc to write 655360
BalanceCheck: soil balance error nstep = 203023 point = 11045 imbalance = -0.000002 W/m2
BalanceCheck: soil balance error nstep = 203024 point = 11045 imbalance = -0.000002 W/m2
filew failed, worst i, j, qtmp, q = 1 72
-2.189726374793342E-015 1.335410596380857E-026
QNEG3 from convect_deep/CLDLIQ:m= 2 lat/lchnk= 913 Min. mixing ratio violated at 1 points. Reset to 0.0E+00 Worst =-2.7E-10 at i,k= 6 24
filew failed, worst i, j, qtmp, q = 1 69
-2.374679514594295E-011 7.881057934984575E-032
filew failed, worst i, j, qtmp, q = 1 73
-2.940830193003910E-018 2.341641996943399E-045
dpcoup cant adjust 3 565 8 -8.398103431537513E-024
-1.006041895856967E-021 4.294449624196567E-024
filew failed, worst i, j, qtmp, q = 1 74
-9.091953665231636E-021 5.014301619971557E-039
Opened file ./b20trcn_f19g16.rtm.h0.1959-08.nc to write 655360
Opened file ./b20trcn_f19g16.clm2.h0.1959-08.nc to write 655360
Opened file b20trcn_f19g16.cam.h0.1959-08.nc to write 655360
filew failed, worst i, j, qtmp, q = 1 63
-8.228847990948502E-019 1.745182945588461E-040
filew failed, worst i, j, qtmp, q = 1 63
-1.891540041527286E-018 -8.047831091565376E-019
Opened file ./b20trcn_f19g16.rtm.h0.1959-09.nc to write 655360
Opened file ./b20trcn_f19g16.clm2.h0.1959-09.nc to write 655360
Opened file b20trcn_f19g16.cam.h0.1959-09.nc to write 655360
filew failed, worst i, j, qtmp, q = 1 63
-4.258028852041570E-009 6.091987584039642E-032
filew failed, worst i, j, qtmp, q = 1 62
-4.030730745896910E-011 -3.944113039400929E-011
filew failed, worst i, j, qtmp, q = 1 63
-8.378404384572377E-009 -4.131869037032669E-009
********
Hi, Yao. Have you solved the problem and still remembered the way of problem-solving? I have encountered a similar situation.

Here are some detailed information about my case.
compset: F_2000_CAM5
mach: our own machine
processors: 18
mpi: mpich
build: succeed
submit: fail

I made the case with INFO_DBUG=1. It took me 30 hours and the case stopped without producing outputs. In CaseStatus, there is no information about 'run failed', just ends at 'run started'. But when I 'bjobs', there is no running jobs there.

I opened the information in cesm.log in the run directory. Here are the information of the last few lines.
QNEG3 from vertical diffusion/SO2:m= 8 lat/lchnk= 1922 Min. mixing ratio violated at 2 points. Reset to 1.0E-36 Worst =-1.5E-12 at i,k= 2 30
filew failed, worst i, j, qtmp, q = 1 66
-7.151256598823758E-022 2.214308297315904E-025
filew failed, worst i, j, qtmp, q = 1 67
-3.961143198214729E-021 1.163926728481917E-025
QNEG3 from vertical diffusion/SO2:m= 8 lat/lchnk= 1921 Min. mixing ratio violated at 2 points. Reset to 1.0E-36 Worst =-2.0E-12 at i,k= 2 30
filew failed, worst i, j, qtmp, q = 1 89
-5.505644719907718E-021 2.077154760046541E-024
filew failed, worst i, j, qtmp, q = 1 89
-6.688683957690169E-021 1.198538870746115E-026
filew failed, worst i, j, qtmp, q = 1 114
-5.860415243590058E-022 1.944859505689513E-024
filew failed, worst i, j, qtmp, q = 1 113
-1.571191409239440E-021 1.891108692832233E-024
filew failed, worst i, j, qtmp, q = 1 48
-1.752370976580843E-032 6.691991436730873E-034
filew failed, worst i, j, qtmp, q = 1 128
-1.192603958284727E-031 2.776626993647009E-034
filew failed, worst i, j, qtmp, q = 1 128
-5.025388103881758E-031 2.774024828304842E-034
filew failed, worst i, j, qtmp, q = 1 129
-7.195791309399674E-033 2.773799223920591E-034

In the run directory, the model stopped at the atm.log. Inside atm.log, I searched 'error' and just found these error sentences.
1. UW_errorPBL
2. print_energy_errors is set F

I also show my cesm.log and atm.log as attachments to show more details. If you need some more information, please don't hesitate to contact me.
If you saw this question, any suggestions would be appreciated.
Thank you very much!
 

Attachments

  • atm.log.zip
    113.4 KB · Views: 1
  • cesm.log.zip
    35.4 KB · Views: 2
Top