Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Speeding up my simulation

Hi Everyone,I am trying to create a faster PES layout for the BWCN compset on Yellowstone. The default layout and timing is:env_mach_pes.xml:































































timing:  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        360         0        180    x 2       1      (1     )
  glc = sglc       360         0        180    x 2       1      (1     )
  wav = swav       360         0        180    x 2       1      (1     )
  lnd = clm        120         0        60     x 2       1      (1     )
  rof = rtm        120         0        60     x 2       1      (1     )
  ice = cice       240         60       120    x 2       1      (1     )
  atm = cam        360         0        180    x 2       1      (1     )
  ocn = pop2       60          180      30     x 2       1      (1     )

  total pes active           : 420
  pes per node               : 16
  pe count for cost estimate : 224

  Overall Metrics:
    Model Cost:            1263.13   pe-hrs/simulated_year
    Model Throughput:         4.26   simulated_years/day

    Init Time   :      72.006 seconds
    Run Time    :   40600.688 seconds       55.617 seconds/day
    Final Time  :       0.059 seconds

    Actual Ocn Init Wait Time     :       0.000 seconds
    Estimated Ocn Init Run Time   :       0.000 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:   40600.688 seconds       55.617 seconds/mday         4.26 myears/wday
    LND Run Time:     584.937 seconds        0.801 seconds/mday       295.42 myears/wday
    ROF Run Time:      21.871 seconds        0.030 seconds/mday      7900.87 myears/wday
    ICE Run Time:    2406.515 seconds        3.297 seconds/mday        71.81 myears/wday
    ATM Run Time:   37422.500 seconds       51.264 seconds/mday         4.62 myears/wday
    OCN Run Time:   11552.612 seconds       15.825 seconds/mday        14.96 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:    2230.241 seconds        3.055 seconds/mday        77.48 myears/wday
    CPL COMM Time:  29009.397 seconds       39.739 seconds/mday         5.96 myears/wday
As you can see, the ATM run time is 51.264 seconds. I also noticed that the PES_LEVEL is '2rp'. I tried ramping up the number of cores to get ATM to run faster, but it actually ran slower (see timing below):timing:  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        480         0        240    x 2       1      (1     )
  glc = sglc       360         0        180    x 2       1      (1     )
  wav = swav       360         0        180    x 2       1      (1     )
  lnd = clm        60          0        30     x 2       1      (1     )
  rof = rtm        120         0        60     x 2       1      (1     )
  ice = cice       480         60       240    x 2       1      (1     )
  atm = cam        1280        0        640    x 2       1      (1     )
  ocn = pop2       60          180      30     x 2       1      (1     )

  total pes active           : 1280
  pes per node               : 16
  pe count for cost estimate : 688

  Overall Metrics:
    Model Cost:            5022.65   pe-hrs/simulated_year
    Model Throughput:         3.29   simulated_years/day

    Init Time   :      71.674 seconds
    Run Time    :     360.018 seconds       72.004 seconds/day
    Final Time  :       0.080 seconds

    Actual Ocn Init Wait Time     :       0.000 seconds
    Estimated Ocn Init Run Time   :      16.866 seconds
    Estimated Run Time Correction :      16.866 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:     360.018 seconds       72.004 seconds/mday         3.29 myears/wday
    LND Run Time:      10.600 seconds        2.120 seconds/mday       111.66 myears/wday
    ROF Run Time:       0.539 seconds        0.108 seconds/mday      2195.85 myears/wday
    ICE Run Time:      15.844 seconds        3.169 seconds/mday        74.70 myears/wday
    ATM Run Time:     319.446 seconds       63.889 seconds/mday         3.71 myears/wday
    OCN Run Time:      84.330 seconds       16.866 seconds/mday        14.03 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:      20.090 seconds        4.018 seconds/mday        58.91 myears/wday
    CPL COMM Time:     24.674 seconds        4.935 seconds/mday        47.97 myears/wday

So then I compared to an optimized B compset for 1850 with WACCM that someone else is using and it uses a PES_LEVEL of '3rcm'. I have no idea what the PES_LEVEL is since the only description I can find is, 'pes level determined by automated initialization (DO NOT EDIT)'.Like a scientist, I edited it in my simulation (along with halving my PES_PER_NODE to 16), and now it runs a little faster, see the following timing (note I used many less pes for atm):  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        240         0        240    x 1       1      (1     )
  glc = sglc       1           0        1      x 1       1      (1     )
  wav = swav       1           0        1      x 1       1      (1     )
  lnd = clm        48          240      48     x 1       1      (1     )
  rof = rtm        30          0        30     x 1       1      (1     )
  ice = cice       240         0        240    x 1       1      (1     )
  atm = cam        320         0        320    x 1       1      (1     )
  ocn = pop2       32          320      32     x 1       1      (1     )

  total pes active           : 352
  pes per node               : 16
  pe count for cost estimate : 704

  Overall Metrics:
    Model Cost:            3425.68   pe-hrs/simulated_year
    Model Throughput:         4.93   simulated_years/day

    Init Time   :      54.028 seconds
    Run Time    :     239.968 seconds       47.994 seconds/day
    Final Time  :       0.146 seconds

    Actual Ocn Init Wait Time     :      54.340 seconds
    Estimated Ocn Init Run Time   :      12.721 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:     239.968 seconds       47.994 seconds/mday         4.93 myears/wday
    LND Run Time:       7.713 seconds        1.543 seconds/mday       153.45 myears/wday
    ROF Run Time:       0.441 seconds        0.088 seconds/mday      2683.81 myears/wday
    ICE Run Time:       9.087 seconds        1.817 seconds/mday       130.25 myears/wday
    ATM Run Time:     225.255 seconds       45.051 seconds/mday         5.25 myears/wday
    OCN Run Time:      63.605 seconds       12.721 seconds/mday        18.61 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:       4.826 seconds        0.965 seconds/mday       245.25 myears/wday
    CPL COMM Time:    133.160 seconds       26.632 seconds/mday         8.89 myears/wday

So my question is, what is PES_LEVEL? What else can I do to make this run faster? I know that ATM runs sequentially with LND and ICE, which all in turn run parallel to OCN. So I am really just trying to reduce ATM as much as possible. Thanks for any advice!-Dr. Ethan D. Peck
 

jedwards

CSEG and Liaisons
Staff member
The part you are missing is the ROOTPE settings.   THe ROOTPE_ICE should be the same as NTASKS_LND and ROOTPE_OCN should be the same asNTASKS_ATM.   When you increased the NTASKS_ATM without changing ROOTPE_OCN you caused ATM and OCN pe's to overlap and prevented thosecomponents from running concurrently.    
 

jedwards

CSEG and Liaisons
Staff member
PE_Levels was someones idea of how to provide several pe layouts for the same compset, it's not very well maintained.
 
So I tried adding the ROOTPES, this is the timing results:  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        240         0        240    x 1       1      (1     )
  glc = sglc       1           0        1      x 1       1      (1     )
  wav = swav       1           0        1      x 1       1      (1     )
  lnd = clm        48          240      48     x 1       1      (1     )
  rof = rtm        30          0        30     x 1       1      (1     )
  ice = cice       240         48       240    x 1       1      (1     )
  atm = cam        640         0        640    x 1       1      (1     )
  ocn = pop2       32          640      32     x 1       1      (1     )

  total pes active           : 672
  pes per node               : 16
  pe count for cost estimate : 1344

  Overall Metrics:
    Model Cost:            7001.57   pe-hrs/simulated_year
    Model Throughput:         4.61   simulated_years/day

    Init Time   :      55.585 seconds
    Run Time    :     256.907 seconds       51.381 seconds/day
    Final Time  :       0.099 seconds

    Actual Ocn Init Wait Time     :      62.345 seconds
    Estimated Ocn Init Run Time   :      12.756 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:     256.907 seconds       51.381 seconds/mday         4.61 myears/wday
    LND Run Time:       7.535 seconds        1.507 seconds/mday       157.08 myears/wday
    ROF Run Time:       0.422 seconds        0.084 seconds/mday      2804.65 myears/wday
    ICE Run Time:       9.049 seconds        1.810 seconds/mday       130.79 myears/wday
    ATM Run Time:     236.425 seconds       47.285 seconds/mday         5.01 myears/wday
    OCN Run Time:      63.780 seconds       12.756 seconds/mday        18.56 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:      12.279 seconds        2.456 seconds/mday        96.39 myears/wday
    CPL COMM Time:    141.907 seconds       28.381 seconds/mday         8.34 myears/wday

This did not go any faster than the last ones. Any other ideas to speed up the runs, or should I just quit with this?-Ethan
 

jedwards

CSEG and Liaisons
Staff member
In some versions of the model, in file env_mach_specific the variable MP_EAGER_LIMIT is set to 0.    Try commenting this out, it should helpperformance - however in some CESM configurations this creates a memory leak - watch the memory usage output in cpl.log  - it might be expected to grow a little overtime, but if you've triggered the memory leak it will grow very quickly.    Also - why did you turn threading off?   Setting each components NTHRDS=2 should give you an additional 10-15% through the use of the systems hyperthreading capability.   
 
Top