Speeding up my simulation

Hi Everyone,I am trying to create a faster PES layout for the BWCN compset on Yellowstone. The default layout and timing is:env_mach_pes.xml:































































timing:  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        360         0        180    x 2       1      (1     )
  glc = sglc       360         0        180    x 2       1      (1     )
  wav = swav       360         0        180    x 2       1      (1     )
  lnd = clm        120         0        60     x 2       1      (1     )
  rof = rtm        120         0        60     x 2       1      (1     )
  ice = cice       240         60       120    x 2       1      (1     )
  atm = cam        360         0        180    x 2       1      (1     )
  ocn = pop2       60          180      30     x 2       1      (1     )

  total pes active           : 420
  pes per node               : 16
  pe count for cost estimate : 224

  Overall Metrics:
    Model Cost:            1263.13   pe-hrs/simulated_year
    Model Throughput:         4.26   simulated_years/day

    Init Time   :      72.006 seconds
    Run Time    :   40600.688 seconds       55.617 seconds/day
    Final Time  :       0.059 seconds

    Actual Ocn Init Wait Time     :       0.000 seconds
    Estimated Ocn Init Run Time   :       0.000 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:   40600.688 seconds       55.617 seconds/mday         4.26 myears/wday
    LND Run Time:     584.937 seconds        0.801 seconds/mday       295.42 myears/wday
    ROF Run Time:      21.871 seconds        0.030 seconds/mday      7900.87 myears/wday
    ICE Run Time:    2406.515 seconds        3.297 seconds/mday        71.81 myears/wday
    ATM Run Time:   37422.500 seconds       51.264 seconds/mday         4.62 myears/wday
    OCN Run Time:   11552.612 seconds       15.825 seconds/mday        14.96 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:    2230.241 seconds        3.055 seconds/mday        77.48 myears/wday
    CPL COMM Time:  29009.397 seconds       39.739 seconds/mday         5.96 myears/wday
As you can see, the ATM run time is 51.264 seconds. I also noticed that the PES_LEVEL is '2rp'. I tried ramping up the number of cores to get ATM to run faster, but it actually ran slower (see timing below):timing:  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        480         0        240    x 2       1      (1     )
  glc = sglc       360         0        180    x 2       1      (1     )
  wav = swav       360         0        180    x 2       1      (1     )
  lnd = clm        60          0        30     x 2       1      (1     )
  rof = rtm        120         0        60     x 2       1      (1     )
  ice = cice       480         60       240    x 2       1      (1     )
  atm = cam        1280        0        640    x 2       1      (1     )
  ocn = pop2       60          180      30     x 2       1      (1     )

  total pes active           : 1280
  pes per node               : 16
  pe count for cost estimate : 688

  Overall Metrics:
    Model Cost:            5022.65   pe-hrs/simulated_year
    Model Throughput:         3.29   simulated_years/day

    Init Time   :      71.674 seconds
    Run Time    :     360.018 seconds       72.004 seconds/day
    Final Time  :       0.080 seconds

    Actual Ocn Init Wait Time     :       0.000 seconds
    Estimated Ocn Init Run Time   :      16.866 seconds
    Estimated Run Time Correction :      16.866 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:     360.018 seconds       72.004 seconds/mday         3.29 myears/wday
    LND Run Time:      10.600 seconds        2.120 seconds/mday       111.66 myears/wday
    ROF Run Time:       0.539 seconds        0.108 seconds/mday      2195.85 myears/wday
    ICE Run Time:      15.844 seconds        3.169 seconds/mday        74.70 myears/wday
    ATM Run Time:     319.446 seconds       63.889 seconds/mday         3.71 myears/wday
    OCN Run Time:      84.330 seconds       16.866 seconds/mday        14.03 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:      20.090 seconds        4.018 seconds/mday        58.91 myears/wday
    CPL COMM Time:     24.674 seconds        4.935 seconds/mday        47.97 myears/wday

So then I compared to an optimized B compset for 1850 with WACCM that someone else is using and it uses a PES_LEVEL of '3rcm'. I have no idea what the PES_LEVEL is since the only description I can find is, 'pes level determined by automated initialization (DO NOT EDIT)'.Like a scientist, I edited it in my simulation (along with halving my PES_PER_NODE to 16), and now it runs a little faster, see the following timing (note I used many less pes for atm):  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        240         0        240    x 1       1      (1     )
  glc = sglc       1           0        1      x 1       1      (1     )
  wav = swav       1           0        1      x 1       1      (1     )
  lnd = clm        48          240      48     x 1       1      (1     )
  rof = rtm        30          0        30     x 1       1      (1     )
  ice = cice       240         0        240    x 1       1      (1     )
  atm = cam        320         0        320    x 1       1      (1     )
  ocn = pop2       32          320      32     x 1       1      (1     )

  total pes active           : 352
  pes per node               : 16
  pe count for cost estimate : 704

  Overall Metrics:
    Model Cost:            3425.68   pe-hrs/simulated_year
    Model Throughput:         4.93   simulated_years/day

    Init Time   :      54.028 seconds
    Run Time    :     239.968 seconds       47.994 seconds/day
    Final Time  :       0.146 seconds

    Actual Ocn Init Wait Time     :      54.340 seconds
    Estimated Ocn Init Run Time   :      12.721 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:     239.968 seconds       47.994 seconds/mday         4.93 myears/wday
    LND Run Time:       7.713 seconds        1.543 seconds/mday       153.45 myears/wday
    ROF Run Time:       0.441 seconds        0.088 seconds/mday      2683.81 myears/wday
    ICE Run Time:       9.087 seconds        1.817 seconds/mday       130.25 myears/wday
    ATM Run Time:     225.255 seconds       45.051 seconds/mday         5.25 myears/wday
    OCN Run Time:      63.605 seconds       12.721 seconds/mday        18.61 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:       4.826 seconds        0.965 seconds/mday       245.25 myears/wday
    CPL COMM Time:    133.160 seconds       26.632 seconds/mday         8.89 myears/wday

So my question is, what is PES_LEVEL? What else can I do to make this run faster? I know that ATM runs sequentially with LND and ICE, which all in turn run parallel to OCN. So I am really just trying to reduce ATM as much as possible. Thanks for any advice!-Dr. Ethan D. Peck
 

jedwards

CSEG and Liaisons
Staff member
The part you are missing is the ROOTPE settings.   THe ROOTPE_ICE should be the same as NTASKS_LND and ROOTPE_OCN should be the same asNTASKS_ATM.   When you increased the NTASKS_ATM without changing ROOTPE_OCN you caused ATM and OCN pe's to overlap and prevented thosecomponents from running concurrently.    
 

jedwards

CSEG and Liaisons
Staff member
PE_Levels was someones idea of how to provide several pe layouts for the same compset, it's not very well maintained.
 
So I tried adding the ROOTPES, this is the timing results:  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        240         0        240    x 1       1      (1     )
  glc = sglc       1           0        1      x 1       1      (1     )
  wav = swav       1           0        1      x 1       1      (1     )
  lnd = clm        48          240      48     x 1       1      (1     )
  rof = rtm        30          0        30     x 1       1      (1     )
  ice = cice       240         48       240    x 1       1      (1     )
  atm = cam        640         0        640    x 1       1      (1     )
  ocn = pop2       32          640      32     x 1       1      (1     )

  total pes active           : 672
  pes per node               : 16
  pe count for cost estimate : 1344

  Overall Metrics:
    Model Cost:            7001.57   pe-hrs/simulated_year
    Model Throughput:         4.61   simulated_years/day

    Init Time   :      55.585 seconds
    Run Time    :     256.907 seconds       51.381 seconds/day
    Final Time  :       0.099 seconds

    Actual Ocn Init Wait Time     :      62.345 seconds
    Estimated Ocn Init Run Time   :      12.756 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:     256.907 seconds       51.381 seconds/mday         4.61 myears/wday
    LND Run Time:       7.535 seconds        1.507 seconds/mday       157.08 myears/wday
    ROF Run Time:       0.422 seconds        0.084 seconds/mday      2804.65 myears/wday
    ICE Run Time:       9.049 seconds        1.810 seconds/mday       130.79 myears/wday
    ATM Run Time:     236.425 seconds       47.285 seconds/mday         5.01 myears/wday
    OCN Run Time:      63.780 seconds       12.756 seconds/mday        18.56 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:      12.279 seconds        2.456 seconds/mday        96.39 myears/wday
    CPL COMM Time:    141.907 seconds       28.381 seconds/mday         8.34 myears/wday

This did not go any faster than the last ones. Any other ideas to speed up the runs, or should I just quit with this?-Ethan
 

jedwards

CSEG and Liaisons
Staff member
In some versions of the model, in file env_mach_specific the variable MP_EAGER_LIMIT is set to 0.    Try commenting this out, it should helpperformance - however in some CESM configurations this creates a memory leak - watch the memory usage output in cpl.log  - it might be expected to grow a little overtime, but if you've triggered the memory leak it will grow very quickly.    Also - why did you turn threading off?   Setting each components NTHRDS=2 should give you an additional 10-15% through the use of the systems hyperthreading capability.   
 
Back
Top