Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Fail to run default CESM1.2.2.1 compset=E_1850_CN with 2 nodes on Cheyenne

I want to speed up CESM1.2.2.1 E_1850_CN compset on Cheyenne by using multiple nodes.I create the case by

Code:
~/ucar_models/cesm1_2_2_1/scripts/create_newcase 
-case      test_64cores                  
-compset   E_1850_CN                     
-res       f45_g37                       
-mach      cheyenne
<br /><br />
In env_run.xml I set

Code:
STOP_N="12780"
DOCN_SOM_FILENAME="pop_frc.gx3v7.110128.nc"<br />
In env_mach_pes.xml, I set



   
   
   
   
   

   
   
   
   
   

   
   
   
   
   

   
   
   
   
   

   
   
   

   
   
   
   
   

   
   
   
   
   

   
   
   
   
   

   
   
   
   
   
   
   
   

   
   
   
   
   
   
   
   

In test_64cores.run,
Code:
#!/bin/csh -f
###PBS -A
#PBS -N test_64cores
#PBS -q regular
#PBS -l select=2:ncpus=36:mpiprocs=36:ompthreads=1
#PBS -l walltime=08:00:00
#PBS -j oe
#PBS -S /bin/csh -V
However, the program freezes without any termination or further execution until it hits wall clock time. The last few lines of each component areCESM
Code:
19: BalanceCheck: soil balance error nstep =    424571 point =  1109 imbalance =   -0.000002 W/m2
19: BalanceCheck: soil balance error nstep =    424572 point =  1109 imbalance =   -0.000002 W/m2
1: Opened file ./test_64cores.rtm.h0.0025-03.nc to write      458752
1: Opened file ./test_64cores.clm2.h0.0025-03.nc to write      458752
1: Opened file test_64cores.cam.h0.0025-03.nc to write      458752
2: BalanceCheck: soil balance error nstep =    424837 point =   141 imbalance =   -0.000000 W/m2
2: BalanceCheck: soil balance error nstep =    424838 point =   141 imbalance =   -0.000000 W/m2
3: BalanceCheck: soil balance error nstep =    424883 point =   204 imbalance =   -0.000000 W/m2
3: BalanceCheck: soil balance error nstep =    424884 point =   204 imbalance =   -0.000000 W/m2
53:  filew failed, worst i, j, qtmp, q =            1          30
53: -8.211380995416666E-009  1.859680624937745E-041
20: QNEG3 from TPHYSBCb:m=  3 lat/lchnk=    148 Min. mixing ratio violated at    2 points.  Reset to  0.0E+00 Worst =-1.2E-12 at i,k=  10 23
1: BalanceCheck: soil balance error nstep =    424935 point =    97 imbalance =   -0.000000 W/m2
1: BalanceCheck: soil balance error nstep =    424936 point =    97 imbalance =   -0.000000 W/m2
18: BalanceCheck: soil balance error nstep =    424959 point =  1057 imbalance =   -0.000002 W/m2
18: BalanceCheck: soil balance error nstep =    424960 point =  1057 imbalance =   -0.000002 W/m2
20: BalanceCheck: soil balance error nstep =    425003 point =  1159 imbalance =   -0.000001 W/m2
19: BalanceCheck: soil balance error nstep =    425003 point =  1110 imbalance =   -0.000000 W/m2
19: BalanceCheck: soil balance error nstep =    425004 point =  1110 imbalance =   -0.000000 W/m2
20: BalanceCheck: soil balance error nstep =    425004 point =  1159 imbalance =   -0.000001 W/m2
19: BalanceCheck: soil balance error nstep =    425095 point =  1109 imbalance =   -0.000001 W/m2
19: BalanceCheck: soil balance error nstep =    425096 point =  1109 imbalance =   -0.000001 W/m2
38: BalanceCheck: soil balance error nstep =    425101 point =  2236 imbalance =   -0.000001 W/m2
38: BalanceCheck: soil balance error nstep =    425102 point =  2236 imbalance =   -0.000001 W/m2
53:  filew failed, worst i, j, qtmp, q =            1          30
53: -1.490387661098496E-016  7.093350521850853E-047
31: BalanceCheck: soil balance error nstep =    425131 point =  1865 imbalance =   -0.000001 W/m2
31: BalanceCheck: soil balance error nstep =    425132 point =  1865 imbalance =   -0.000001 W/m2
60: BalanceCheck: soil balance error nstep =    425145 point =  3617 imbalance =   -0.000003 W/m2
60: BalanceCheck: soil balance error nstep =    425146 point =  3617 imbalance =   -0.000003 W/m2
23: BalanceCheck: soil balance error nstep =    425169 point =  1357 imbalance =   -0.000004 W/m2
23: BalanceCheck: soil balance error nstep =    425170 point =  1357 imbalance =   -0.000004 W/m2
6: BalanceCheck: soil balance error nstep =    425189 point =   359 imbalance =   -0.000001 W/m2
6: BalanceCheck: soil balance error nstep =    425190 point =   359 imbalance =   -0.000001 W/m2
51: BalanceCheck: soil balance error nstep =    425283 point =  3049 imbalance =   -0.000001 W/m2
51: BalanceCheck: soil balance error nstep =    425284 point =  3049 imbalance =   -0.000001 W/m2
CPL
Code:
(seq_diag_print_mct) NET AREA BUDGET (m2/m2): period =  monthly: date =    250401     0
                       atm            lnd            ocn         ice nh         ice sh        *SUM*  
        area    -1.00000000     0.29324025     0.64718132     0.04019648     0.01938454     0.00000258
  
(seq_diag_print_mct) NET HEAT BUDGET (W/m2): period =  monthly: date =    250401     0
                       atm            lnd            rof            ocn         ice nh         ice sh        *SUM*  
     hfreeze     0.00000000     0.00000000     0.00000000     0.07910454    -0.02622360    -0.05286458     0.00001636
       hmelt     0.00000000     0.00000000     0.00000000    -0.74348212     0.42499467     0.31831747    -0.00016998
      hnetsw  -163.05330953    38.32536366     0.00000000   123.34557208     0.88798738     0.48733692    -0.00704949
       hlwdn  -322.74136669    81.85235878     0.00000000   230.18081722     6.71059416     3.99699017    -0.00060636
       hlwup   382.22905060  -100.23357861     0.00000000  -268.95603266    -8.26359495    -4.77674509    -0.00090071
     hlatvap    77.82491763   -11.88317259     0.00000000   -65.87975725    -0.04217840    -0.02029389    -0.00048449
     hlatfus     0.79383088    -0.34822528     0.00000000    -0.26460311    -0.10267895    -0.07831654     0.00000700
      hiroff     0.00000000     0.03837988     0.00003043     0.00000000     0.00000000     0.00000000     0.03841031
        hsen    18.41571865    -7.80060996     0.00000000   -10.74228657     0.15480912    -0.02895466    -0.00132343
       *SUM*    -6.53115846    -0.04948413     0.00003043     7.01933214    -0.25629057    -0.15453020     0.02789921
  
(seq_diag_print_mct) NET WATER BUDGET (kg/m2s*1e6): period =  monthly: date =    250401     0
                       atm            lnd            rof            ocn         ice nh         ice sh        *SUM*  
     wfreeze     0.00000000     0.00000000     0.00000000    -0.20972689     0.06952565     0.14015786    -0.00004338
       wmelt     0.00000000     0.00000000     0.00000000    -0.28424494     0.50163837    -0.21658632     0.00080711
       wrain   -28.88472229     6.75483040     0.00000000    22.03273135     0.06024532     0.03685768    -0.00005755
       wsnow    -2.37887588     1.04352795     0.00000000     0.79293710     0.30769837     0.23469147    -0.00002099
       wevap    31.10445695    -4.74133316     0.00000000   -26.34136635    -0.01483905    -0.00711210    -0.00019372
     wrunoff     0.00000000    -2.80489561     0.12314988     0.00000000     0.00000000     0.00000000    -2.68174572
     wfrzrof     0.00000000    -0.11501312    -0.00009119     0.00000000     0.00000000     0.00000000    -0.11510431
       *SUM*    -0.15914122     0.13711647     0.12305869    -4.00966972     0.92426866     0.18800857    -2.79635855
  
 tStamp_write: model date =   250401       0 wall clock = 2019-04-18 01:38:04 avg dt =     2.38 dt =     2.84
 memory_write: model date =   250401       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250402       0 wall clock = 2019-04-18 01:38:06 avg dt =     2.38 dt =     2.38
 memory_write: model date =   250402       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250403       0 wall clock = 2019-04-18 01:38:09 avg dt =     2.38 dt =     2.37
 memory_write: model date =   250403       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250404       0 wall clock = 2019-04-18 01:38:11 avg dt =     2.38 dt =     2.39
 memory_write: model date =   250404       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250405       0 wall clock = 2019-04-18 01:38:13 avg dt =     2.38 dt =     2.36
 memory_write: model date =   250405       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250406       0 wall clock = 2019-04-18 01:38:16 avg dt =     2.38 dt =     2.43
 memory_write: model date =   250406       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250407       0 wall clock = 2019-04-18 01:38:18 avg dt =     2.38 dt =     2.40
 memory_write: model date =   250407       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250408       0 wall clock = 2019-04-18 01:38:21 avg dt =     2.38 dt =     2.39
 memory_write: model date =   250408       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250409       0 wall clock = 2019-04-18 01:38:23 avg dt =     2.38 dt =     2.37
 memory_write: model date =   250409       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250410       0 wall clock = 2019-04-18 01:38:25 avg dt =     2.38 dt =     2.38
 memory_write: model date =   250410       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250411       0 wall clock = 2019-04-18 01:38:28 avg dt =     2.38 dt =     2.36
 memory_write: model date =   250411       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
 tStamp_write: model date =   250412       0 wall clock = 2019-04-18 01:38:30 avg dt =     2.38 dt =     2.36
 memory_write: model date =   250412       0 memory =     146.74 MB (highwater)         -0.00 MB (usage)  (pe=    0 comps= cpl ATM LND OCN ICE GLC ROF WAV)
ATM
Code:
nstep, te   425343   0.32921039180626612E+10   0.32921041332761240E+10   0.11929028825786629E-04   0.98505215837519107E+05
 nstep, te   425344   0.32921264972353024E+10   0.32921250316163564E+10  -0.81237513104121090E-04   0.98505239159271499E+05
 nstep, te   425345   0.32921471968084836E+10   0.32921456711939802E+10  -0.84562986199844059E-04   0.98505249864586105E+05
 nstep, te   425346   0.32921716055284190E+10   0.32921694904318500E+10  -0.11723724872707561E-03   0.98505269089986512E+05
 nstep, te   425347   0.32921957574583359E+10   0.32921931562663460E+10  -0.14418092821176325E-03   0.98505270934110289E+05
 nstep, te   425348   0.32922196046240435E+10   0.32922173836605086E+10  -0.12310530945831881E-03   0.98505282935014067E+05
 nstep, te   425349   0.32922440268539519E+10   0.32922415843710651E+10  -0.13538384641695618E-03   0.98505291279411394E+05
 nstep, te   425350   0.32922652208534374E+10   0.32922635071687074E+10  -0.94987447426059490E-04   0.98505297846811067E+05
 nstep, te   425351   0.32922871265105324E+10   0.32922853150614657E+10  -0.10040639742915476E-03   0.98505305220259645E+05
  
      120000      250412
 Total Mass=   985.053077818377      (mb), Dry Mass=   982.880000093488      (mb
 )
 Total Precipitable Water =   22.1603331466039      (kg/m**2)
 PS max =    1037.62475230829       min =    567.014606950163     
 U  max =    66.1332969844372       min =   -47.1004067702001     
 V  max =    33.7656336834553       min =   -34.9560846633157     
 T  max =    310.242788710309       min =    183.428902372904     
 W (mb/day) max =    380.038874494323       min =   -407.924193252446     
 Average Height (geopotential units) =    582.988339352997     
 PRECC max =    41.2052734358385       min =   0.000000000000000E+000
 PRECL max =    22.8397352784598       min =   0.000000000000000E+000
 Total precp=   2.73879696830441       CON=   2.06939326576035       LS=
  0.669403702544058     
  
 nstep, te   425352   0.32923056843956847E+10   0.32923045000447469E+10  -0.65647116472378241E-04   0.98505307781837677E+05
 nstep, te   425353   0.32923244978453741E+10   0.32923233300704021E+10  -0.64728328066199621E-04   0.98505312330263710E+05
 nstep, te   425354   0.32923392900191512E+10   0.32923387833463516E+10  -0.28084249955686805E-04   0.98505309301401896E+05
 nstep, te   425355   0.32923543066691456E+10   0.32923538723877988E+10  -0.24071680734112838E-04   0.98505309512877124E+05
 nstep, te   425356   0.32923643985989852E+10   0.32923649802449522E+10   0.32239920973220952E-04   0.98505307527387660E+05
 nstep, te   425357   0.32923754230899429E+10   0.32923760318558717E+10   0.33743144592790234E-04   0.98505314875918775E+05
 nstep, te   425358   0.32923790140984435E+10   0.32923805358942785E+10   0.84351278844188780E-04   0.98505300128855233E+05
 nstep, te   425359   0.32923836379697976E+10   0.32923848794635139E+10   0.68814485489876364E-04   0.98505286270728568E+05
OCN
Code:
(docn_comp_run) ocn: model date   250412       0s
(docn_comp_run) ocn: model date   250412    1800s
(docn_comp_run) ocn: model date   250412    3600s
(docn_comp_run) ocn: model date   250412    5400s
(docn_comp_run) ocn: model date   250412    7200s
(docn_comp_run) ocn: model date   250412    9000s
(docn_comp_run) ocn: model date   250412   10800s
(docn_comp_run) ocn: model date   250412   12600s
(docn_comp_run) ocn: model date   250412   14400s
(docn_comp_run) ocn: model date   250412   16200s
(docn_comp_run) ocn: model date   250412   18000s
(docn_comp_run) ocn: model date   250412   19800s
(docn_comp_run) ocn: model date   250412   21600s
(docn_comp_run) ocn: model date   250412   23400s
(docn_comp_run) ocn: model date   250412   25200s
(docn_comp_run) ocn: model date   250412   27000s
(docn_comp_run) ocn: model date   250412   28800s
(docn_comp_run) ocn: model date   250412   30600s
(docn_comp_run) ocn: model date   250412   32400s
(docn_comp_run) ocn: model date   250412   34200s
(docn_comp_run) ocn: model date   250412   36000s
(docn_comp_run) ocn: model date   250412   37800s
(docn_comp_run) ocn: model date   250412   39600s
(docn_comp_run) ocn: model date   250412   41400s
(docn_comp_run) ocn: model date   250412   43200s
(docn_comp_run) ocn: model date   250412   45000s
(docn_comp_run) ocn: model date   250412   46800s
(docn_comp_run) ocn: model date   250412   48600s
(docn_comp_run) ocn: model date   250412   50400s
(docn_comp_run) ocn: model date   250412   52200s
(docn_comp_run) ocn: model date   250412   54000s
LND
Code:
clm2: completed timestep       425329
 clm2: completed timestep       425330
 clm2: completed timestep       425331
 clm2: completed timestep       425332
 clm2: completed timestep       425333
 clm2: completed timestep       425334
 clm2: completed timestep       425335
 clm2: completed timestep       425336
 clm2: completed timestep       425337
 clm2: completed timestep       425338
 clm2: completed timestep       425339
 clm2: completed timestep       425340
 clm2: completed timestep       425341
 clm2: completed timestep       425342
 clm2: completed timestep       425343
 clm2: completed timestep       425344
 clm2: completed timestep       425345
 clm2: completed timestep       425346
 clm2: completed timestep       425347
 clm2: completed timestep       425348
 clm2: completed timestep       425349
 clm2: completed timestep       425350
 clm2: completed timestep       425351
 clm2: completed timestep       425352
 clm2: completed timestep       425353
 clm2: completed timestep       425354
 clm2: completed timestep       425355
 clm2: completed timestep       425356
 clm2: completed timestep       425357
 clm2: completed timestep       425358
ICE
Code:
aero:            3  faero-fsoot   :    964238.712753577     
   6965.40539540224     
 aero:            3  aerotot       :    11632605283.1229     
   154355202.439234     
 aero:            3  aerotot change:    964238.703186035     
   6965.40536493063     
 aero:            3  aeromax agg:   2.635567523154776E-003
  7.296360311253620E-005
                                             Arctic                 Antarctic
total ice area  (km^2) =    2.00945086160375476E+07   1.35309127651979420E+07
total ice extent(km^2) =    2.20291190032451749E+07   1.47384636229330283E+07
total ice volume (m^3) =    6.88024434738068750E+13   2.34776381742008086E+13
total snw volume (m^3) =    6.66084391472641406E+12   5.72946900102083301E+12
tot kinetic energy (J) =    1.74521771242845000E+14   3.04268053722771375E+14
rms ice speed    (m/s) =        0.07311672633334403       0.16119528570787445
average albedo         =        0.72837027208873206       0.72235144764844084
max ice volume     (m) =       11.36771336308637892       8.15412867414148046
max ice speed    (m/s) =        0.37128519369046487       0.33065583035276547
max strength    (kN/m) =      918.44658971979947637     176.50532225058483959
 ----------------------------
arwt rain h2o kg in dt =    2.52054563698879585E+10   3.20214785572211838E+10
arwt snow h2o kg in dt =    2.80552510902386841E+11   3.04370879343336914E+11
arwt evap h2o kg in dt =   -4.19320814133738632E+10   3.49570613782832861E+09
arwt frzl h2o kg in dt =    1.43208999101032887E+10   2.21800242156592621E+11
arwt frsh h2o kg in dt =   -9.30511758540998383E+10  -1.14381027288429443E+12
arwt ice mass (kg)     =    6.30918406654809040E+16   2.15289942057421400E+16
arwt snw mass (kg)     =    2.19807849185971675E+15   1.89072477033687500E+15
arwt tot mass (kg)     =    6.52899191573406240E+16   2.34197189760790160E+16
arwt tot mass chng(kg) =    3.71197961616000000E+11   1.70549857907200000E+12
arwt water flux        =    3.71197961623104004E+11   1.70549857907927344E+12
 (=rain+snow+evap+frzl-fresh)  
water flux error       =    1.08807056249070093E-16   3.10568948646613347E-16
 ----------------------------
arwt atm heat flux (W) =   -1.61052913247670625E+14  -4.09693210590881625E+14
arwt ocn heat flux (W) =   -1.86565098595157125E+14  -1.04561653219656187E+14
arwt frzl heat flux(W) =    2.65493572222303711E+12   4.11193004486971953E+13
arwt tot energy    (J) =   -2.27898341317840061E+22  -7.70622817393212444E+21
arwt net heat      (J) =    4.11430493254742320E+16  -6.23251544075860736E+17
arwt tot energy chng(J)=    4.11458421611560960E+16  -6.23250774161358848E+17
arwt heat error        =    1.22547433461525362E-10   9.99080853188842203E-11
 ----------------------------
arwt salt mass (kg)    =    2.52367362661923625E+14   8.61159768229685625E+13
arwt salt mass chng(kg)=    9.59008541255586863E+08   5.35289761172562981E+09
arwt salt flx in dt(kg)=   -9.59008541289906263E+08  -5.35289761175689507E+09
arwt salt flx error    =   -1.35989853938951434E-16  -3.63059909932104165E-16
 ----------------------------
ROF
Code:
(Rtmrun) model date is      250326           0
  
 (Rtmrun) model date is      250327           0
  
 (Rtmrun) model date is      250328           0
  
 (Rtmrun) model date is      250329           0
  
 (Rtmrun) model date is      250330           0
  
 (Rtmrun) model date is      250331           0
  
 (Rtmrun) model date is      250401           0
 hist_htapes_wrapup : Creating history file ./test_64cores.rtm.h0.0025-03.nc
  at nstep =        70800
 calling htape_create for file t =            1
 htape_create : Opening netcdf htape ./test_64cores.rtm.h0.0025-03.nc
 htape_create : Successfully defined netcdf history file            1
 
 hist_htapes_wrapup : Writing current time sample to local history file 
 ./test_64cores.rtm.h0.0025-03.nc at nstep =        70800 
  for history time interval beginning at    8819.00000000000       and ending at
     8850.00000000000     
 
 
 hist_htapes_wrapup : Closing local history file 
 ./test_64cores.rtm.h0.0025-03.nc at nstep =        70800
 
  
 (Rtmrun) model date is      250402           0
  
 (Rtmrun) model date is      250403           0
  
 (Rtmrun) model date is      250404           0
  
 (Rtmrun) model date is      250405           0
  
 (Rtmrun) model date is      250406           0
  
 (Rtmrun) model date is      250407           0
  
 (Rtmrun) model date is      250408           0
  
 (Rtmrun) model date is      250409           0
  
 (Rtmrun) model date is      250410           0
  
 (Rtmrun) model date is      250411           0
  
 (Rtmrun) model date is      250412           0
 ##################Again: The program freezes without any termination or further execution until it hits wall clock time. I am now using 2 nodes. In env_mach_pes.xml I also tried 72 and 68 cores and none of them work. Does anyone know if I am doing things correctly?
 
Top