Main menu

Navigation

CESM 1_2_0 SOM PE layouts on Yellowstone

4 posts / 0 new
Last post
zwang19@...
CESM 1_2_0 SOM PE layouts on Yellowstone

Dear All,

 

I would like to run a Slab Ocean Model for many model years in CESM 1_2_0 with a low resolution (f45_g37) on Yellowstone.

Maybe because it’s not consuming much computational resource, CESM assigned queue ‘caldera’ by default after setup.

However, it’s very slow (8 hours wall clock time per model year).

 

I checked the model timing and found this:

    TOT Run Time:   27073.821 seconds       74.175 seconds/mday         3.19 myears/wday

    LND Run Time:     404.472 seconds        1.108 seconds/mday       213.61 myears/wday

    ROF Run Time:      48.541 seconds        0.133 seconds/mday      1779.94 myears/wday

    ICE Run Time:    1140.622 seconds        3.125 seconds/mday        75.75 myears/wday

    ATM Run Time:   24680.386 seconds       67.617 seconds/mday         3.50 myears/wday

    OCN Run Time:      24.727 seconds        0.068 seconds/mday      3494.16 myears/wday

    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday

    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday

    CPL Run Time:     373.764 seconds        1.024 seconds/mday       231.16 myears/wday

    CPL COMM Time:   2080.388 seconds        5.700 seconds/mday        41.53 myears/wday

 

So the ATM component is taking much time, and I tried changing the model PE layout.

The default PE setting below runs without a problem on Yellowstone’s caldera queue,

but when I changed all NTASKS to 16, or increased NTASKS only for ATM to 16,

the model would run 5 model days for more than 20 mins and got killed for running over time.

Copying env_mach_pes.xml files from other cases didn't work either.

 

Will you please help to make the model run faster?

 

 

<entry id="NTASKS_ATM"   value="8"  />

<entry id="NTHRDS_ATM"   value="1"  />

<entry id="ROOTPE_ATM"   value="0"  />

<entry id="NINST_ATM"   value="1"  />

<entry id="NINST_ATM_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_LND"   value="8"  />

<entry id="NTHRDS_LND"   value="1"  />

<entry id="ROOTPE_LND"   value="0"  />

<entry id="NINST_LND"   value="1"  />

<entry id="NINST_LND_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_ICE"   value="8"  />

<entry id="NTHRDS_ICE"   value="1"  />

<entry id="ROOTPE_ICE"   value="0"  />

<entry id="NINST_ICE"   value="1"  />

<entry id="NINST_ICE_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_OCN"   value="8"  />

<entry id="NTHRDS_OCN"   value="1"  />

<entry id="ROOTPE_OCN"   value="0"  />

<entry id="NINST_OCN"   value="1"  />

<entry id="NINST_OCN_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_CPL"   value="8"  />

<entry id="NTHRDS_CPL"   value="1"  />

<entry id="ROOTPE_CPL"   value="0"  />

 

<entry id="NTASKS_GLC"   value="8"  />

<entry id="NTHRDS_GLC"   value="1"  />

<entry id="ROOTPE_GLC"   value="0"  />

<entry id="NINST_GLC"   value="1"  />

<entry id="NINST_GLC_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_ROF"   value="8"  />

<entry id="NTHRDS_ROF"   value="1"  />

<entry id="ROOTPE_ROF"   value="0"  />

<entry id="NINST_ROF"   value="1"  />

<entry id="NINST_ROF_LAYOUT"   value="concurrent"  />

 

<entry id="NTASKS_WAV"   value="8"  />

<entry id="NTHRDS_WAV"   value="1"  />

<entry id="ROOTPE_WAV"   value="0"  />

<entry id="NINST_WAV"   value="1"  />

<entry id="NINST_WAV_LAYOUT"   value="concurrent"  />

 

<entry id="PSTRID_ATM"   value="1"  />

<entry id="PSTRID_LND"   value="1"  />

<entry id="PSTRID_ICE"   value="1"  />

<entry id="PSTRID_OCN"   value="1"  />

<entry id="PSTRID_CPL"   value="1"  />

<entry id="PSTRID_GLC"   value="1"  />

<entry id="PSTRID_ROF"   value="1"  />

<entry id="PSTRID_WAV"   value="1"  />

 

<entry id="TOTALPES"   value="8"  />

<entry id="PES_LEVEL"   value="1r"  />

<entry id="MAX_TASKS_PER_NODE"   value="30"  />

<entry id="PES_PER_NODE"   value="16"  />

<entry id="COST_PES"   value="16"  />

<entry id="CCSM_PCOST"   value="2"  />

<entry id="CCSM_TCOST"   value="0"  />

<entry id="CCSM_ESTCOST"   value="6"  />

 

</config_definition>

 

Best,

Zaiyu

jedwards

You can keep the original PE layout and change to the small queue to use dedicated resources instead of the shared queue by editing the $CASE.run script.   Another option is to increase the nthrds for the atm component from 1 to 4, this change will cause your job to run in the regular queue which will also give you dedicated resources.  

CESM Software Engineer

zwang19@...

Thanks a lot for your help. I then tried changing the queue from caldera to small;

the corresponding timing speeded up from 68.325 seconds/mday to 68.050 seconds/mday.

Then I increased NTHRDS_ATM from 1 to 4, runtime became 55.858 seconds/mday,

still with the atmosphere component consuming most of the time.

Could there be some more improvements?

zwang19@...

I later changed all the NTASKS to be the maximum 255 supported on Yellowstone and remained NTHRDS to be 1.

Then the model can run for 49 model years per day!

The only thing I need to complain about is the CONTINUE_RUN option.

According to the user guide: "A brief note on restarting runs. When you first begin a branch, hybrid or startup run, CONTINUE_RUN must be set to FALSE. When you successfully run and get a restart file, you will need to change CONTINUE_RUN to TRUE for the remainder of your run."

The model should be able to warn the user if CONTINUE_RUN is set to be TRUE on a initial run, rather than the fact that it would acquiesce and remain running forever without stopping.

Log in or register to post comments

Who's new

  • jwolff
  • tinna.gunnarsdo...
  • sarthak2235@...
  • eolivares@...
  • shubham.gandhi@...