HS94/PK02 Performance

aph28@cornell_edu · Jun 14, 2019

I am running HS94/PK02 configurations of CESM2 on a local machine and am finding that the Eulerian core is *much* slower than the Reading dry dynamical core ('IGCM1'; http://www.met.reading.ac.uk/~mike/dyn_models/igcm/)For a T42L60 configuration I need about 17.1 PE-hours/model year using CESM2, whereas for a very similar configuration I need only 1.46 PE-hours/model year with the Reading dry dynamical core on the same hardware (a fairly recent 80-core Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). Is this a reasonable timing for this simpler-model configuration of CESM2? What kind of throughputs do others see? Is there anything obvious I can do to increase the speed?Thanks,Peter

islas · Jun 17, 2019

Hi Peter, Could you also send along the "create_newcase" command you used and any xmlchange commands you used to configure your case? Thanks,Isla

goldy · Jun 17, 2019

Peter,It might also be helpful to see the CIME configuration for your machine. If it is a supported machine, please report the name of the machine from /cime/config/machines/config_machines.xml. Otherwise, please send the machine entry for your machine file (config_machines.xml probably in your $HOME/.cime directory.Thanks,--Steve

aph28@cornell_edu · Jun 17, 2019

Hi Isla, Steve - thanks for the quick replies.I've followed the steps here:http://www.cesm.ucar.edu/models/simpler-models/held-suarez.htmlto generate the cases that I've been running, but I have also included the code to read in the temerature relaxation profile from an external NETCDF file.I can easily run a pure HS94 with no modifications if that is useful.The machine is not supported; the relevant xml from the config_machines.xml follows.Thank you,Peter

Code:

<machine MACH="orm"><br />    <DESC><br />       80-core CentOS 7.4 machine hosted by BioHPC at Cornell<br />    </DESC><br />    <NODENAME_REGEX>cbsuorm.biohpc.cornell.edu</NODENAME_REGEX><br />    <OS>LINUX</OS><br />    <COMPILERS>gnu</COMPILERS><br />    <MPILIBS>mpich</MPILIBS><br />    <PROJECT>none</PROJECT><br />    <SAVE_TIMING_DIR> </SAVE_TIMING_DIR><br />    <CIME_OUTPUT_ROOT>/local/storage/$USER/cesm/scratch</CIME_OUTPUT_ROOT><br />    <DIN_LOC_ROOT>/local/storage/cesm/inputdata</DIN_LOC_ROOT><br />    <DIN_LOC_ROOT_CLMFORC>/local/storage/cesm/inputdata/lmwg</DIN_LOC_ROOT_CLMFORC><br />    <DOUT_S_ROOT>/local/storage/$USER/cesm/archive/$CASE</DOUT_S_ROOT><br />    <BASELINE_ROOT>/local/storage/cesm/cesm_baselines</BASELINE_ROOT><br />    <CCSM_CPRNC>/home/$USER/cesm/cime/tools/cprnc/cprnc</CCSM_CPRNC><br />    <GMAKE>gmake</GMAKE><br />    <GMAKE_J>8</GMAKE_J><br />    <BATCH_SYSTEM>none</BATCH_SYSTEM><br />    <SUPPORTED_BY>aph28@cornell.edu</SUPPORTED_BY><br />    <MAX_TASKS_PER_NODE>20</MAX_TASKS_PER_NODE><br />    <MAX_MPITASKS_PER_NODE>20</MAX_MPITASKS_PER_NODE><br />    <PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED><br />    <mpirun mpilib="default"><br />      <executable>mpiexec</executable><br />      <arguments><br />        <arg name="ntasks"> -np {{ total_tasks }} </arg><br />      </arguments><br />    </mpirun><br />    <module_system type="none"/><br />    <environment_variables><br />      <env name="OMP_STACKSIZE">256M</env><br />      <env name="NETCDF_HOME">/usr/local</env><br />    </environment_variables><br />    <resource_limits><br />      <resource name="RLIMIT_STACK">-1</resource><br />    </resource_limits><br />  </machine><br /><br /><br />

islas · Jun 17, 2019

Hi Peter, Steve, I have run out of the box Held-Suarez with no modifications and get 14pe-hrs/year for T42L30 resolution which is still, I think around 7 times more expensive than what Peter was finding with L60 (i.e., double the vertical resolution). Isla

aph28@cornell_edu · Jun 18, 2019

Hi Isla, Steve:I did the same thing last night and found 4.4 PE/model year at T42L30 -- which is significantly faster but still much slower than IGCM.Peter

goldy · Jun 18, 2019

Peter,I wonder if this is a scaling issue. When I run on a local CentOS cluster, the default layout is to use 48 tasks. With that layout, I get 24 PE-hours / simulated day. When I rebuild and rerun with 24 tasks, I get 4.5 PE-hours / simulated day. What is your setup? The output of ./preview_run would be informative. Try running with fewer tasks.--Steve

aph28@cornell_edu · Jun 18, 2019

Hi Steve,For the standard HS94 run at T42L30 that I just did, the output of preview_run is:CASE INFO: nodes: 1
total tasks: 20
tasks per node: 20
thread count: 1

BATCH INFO:
FOR JOB: case.run
    ENV:
      Setting Environment OMP_STACKSIZE=256M
      Setting Environment NETCDF_HOME=/usr/local
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      None

FOR JOB: case.st_archive
    ENV:
      Setting Environment OMP_STACKSIZE=256M
      Setting Environment NETCDF_HOME=/usr/local
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      None

MPIRUN:
mpiexec -np 20 /local/storage/aph28/cesm/scratch/hs3/bld/cesm.exe >> cesm.log.$LID 2>&1-------------------------Is the number of tasks simply the number of CPUs used? We have tried running with only 1 CPU and we only found a slight (10%) speed up in terms of PE-hours/model year, but we can explore this further.Edit: If this is not the case, how should I change the number of tasks?Thanks,Peter

goldy · Jun 19, 2019

Peter,The first section (CASE INFO) shows 20 tasks. Your mpiexec looks like is set to run 20 tasks so that is good.Is 20 tasks on a single node a good layout for this machine?To change the number of tasks, the command is:

Code:

./xmlchange NTASKS=xx

To rebuild in the same case, you have to reset

Code:

./case.build --clean-all

Code:

./case.setup --reset

Code:

./case.build

Code:

More info in the CIME manual: <a href="https://esmci.github.io/cime/">https://esmci.github.io/cime/<br /><br />-</a>-Steve

aph28@cornell_edu · Jun 20, 2019

Hi Steve,I'm not sure how to determine a good layout for this machine other than saying that there is only one node with the following processor (from lscpu):Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHzCPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2

from this perspective anything less than 80 tasks on the single node seems reasonable (?). I have tried running the base HS94 T42L30 case with NTASKS set to 1, 5, 20, and 40, and I am somewhat confused by the timing information output, but my impression is that the scaling is quite linear from 1 to 20 tasks, but that the efficiency fell off substantially at 40 tasks.For the 5 CPU case, the relevant snippet from the timing log file is

Code:

grid        : a%T42z30_l%null_oi%null_r%null_g%null_w%null_m%gx1v7
  compset     : 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV
  run_type    : startup, continue_run = FALSE (inittype = TRUE)
  stop_option : ndays, stop_n = 1000
  run_length  : 1000 days (999.986111111 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        5           0        5      x 1       1      (1     ) 
...

  total pes active           : 5 
  mpi tasks per node               : 20 
  pe count for cost estimate : 20

  Overall Metrics: 
    Model Cost:              20.71   pe-hrs/simulated_year 
    Model Throughput:        23.18   simulated_years/day 

    Init Time   :       0.344 seconds 
    Run Time    :   10212.713 seconds       10.213 seconds/day 
    Final Time  :       0.001 seconds

For the 20 CPU case:

Code:

grid        : a%T42z30_l%null_oi%null_r%null_g%null_w%null_m%gx1v7
  compset     : 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV
  run_type    : startup, continue_run = FALSE (inittype = TRUE)
  stop_option : ndays, stop_n = 2000
  run_length  : 2000 days (1999.98611111 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        20          0        20     x 1       1      (1     ) 
  atm = cam        20          0        20     x 1       1      (1     ) 
...
  total pes active           : 20 
  mpi tasks per node               : 20 
  pe count for cost estimate : 20 

  Overall Metrics: 
    Model Cost:               4.43   pe-hrs/simulated_year 
    Model Throughput:       108.44   simulated_years/day 

    Init Time   :       0.313 seconds 
    Run Time    :    4365.908 seconds        2.183 seconds/day 
    Final Time  :       0.000 seconds

And for the 40 CPU case:

Code:

grid        : a%T42z30_l%null_oi%null_r%null_g%null_w%null_m%gx1v7
  compset     : 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV
  run_type    : startup, continue_run = FALSE (inittype = TRUE)
  stop_option : ndays, stop_n = 1000
  run_length  : 1000 days (999.986111111 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        40          0        40     x 1       1      (1     ) 
  atm = cam        40          0        40     x 1       1      (1     ) 
...

  total pes active           : 40 
  mpi tasks per node               : 40 
  pe count for cost estimate : 40 

  Overall Metrics: 
    Model Cost:              10.59   pe-hrs/simulated_year 
    Model Throughput:        90.69   simulated_years/day 

    Init Time   :       0.469 seconds 
    Run Time    :    2610.075 seconds        2.610 seconds/day 
    Final Time  :       0.001 seconds

Code:

<br />Note that the 20 CPU case is run for 2000 days while the others ran for 1000 days. The wall time for the 20CPU case was about 4400s (or 2200s for 1000 days) and for 5CPUs it was just over 10200s, or ~4 times the wall clock time for 20CPUs, but the PE-hrs/simulated year stats aren't consistent with this. The 1CPU case is still running, but its progress looks consistent with the 5 and 20 CPU case.<br />On the other hand, the 40 CPU case took 2600s to run 1000 days, or longer than 20 CPUs despite using twice the cores.<br /><br />(For comparison, the wall-clock time for IGCM to run 1000 days was 14400s on a single core of the same machine, with twice the number of vertical levels and a slightly shorter timestep. Taking this into account I still get that CESM is running about 10 times slower.)<br /><br />So certainly the layout makes a difference in the above - I'm not sure why my 40 CPU run was so much less efficient. <br /><br />Could there be issues with compiler/MPI flags in the build that would be contributing to the slow speed?<br /><br />Thanks,<br /><br />Peter<br /><br />

goldy · Jun 20, 2019

Peter,Two points: Your 5 PE run was actually pretty efficient. The issue can be seen in this line from the timing file:pe count for cost estimate : 20The system thinks (correctly or not, I do not know) that you are being charged (or reserving) 20 PEs even though you are only running on 5.20 PEs seems to be a sweet spot for this machine.As far as 40 goes, I believe the issue is lack of strong scaling in the Eulerian dycore. I believe you would get better performance of 40 vs. 20 PES for a higher-resolution run such as T85L30.

aph28@cornell_edu · Jun 20, 2019

Steve,The lack of strong scaling in the 40 PE case makes sense.However, the 5PE case (and the 1PE and 20PE cases) are still extremely slow (an order of magnitude) relative to the other dynamical core that I've been using. I have a hard time believing that the code base is really that much slower. Can you suggest other avenues for improving performance? Thanks,Peter

HS94/PK02 Performance

aph28@cornell_edu

New Member

islas

Moderator

goldy

AMP / CGD / NCAR

aph28@cornell_edu

New Member

islas

Moderator

aph28@cornell_edu

New Member

goldy

AMP / CGD / NCAR

aph28@cornell_edu

New Member

goldy

AMP / CGD / NCAR

aph28@cornell_edu

New Member

goldy

AMP / CGD / NCAR

aph28@cornell_edu

New Member