Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

HS94/PK02 Performance

I am running HS94/PK02 configurations of CESM2 on a local machine and am finding that the Eulerian core is *much* slower than the Reading dry dynamical core ('IGCM1'; http://www.met.reading.ac.uk/~mike/dyn_models/igcm/)For a T42L60 configuration I need about 17.1 PE-hours/model year using CESM2, whereas for a very similar configuration I need only 1.46 PE-hours/model year with the Reading dry dynamical core on the same hardware (a fairly recent 80-core Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz). Is this a reasonable timing for this simpler-model configuration of CESM2? What kind of throughputs do others see? Is there anything obvious I can do to increase the speed?Thanks,Peter
 

islas

Moderator
Staff member
Hi Peter, Could you also send along the "create_newcase" command you used and any xmlchange commands you used to configure your case? Thanks,Isla
 

goldy

AMP / CGD / NCAR
Staff member
Peter,It might also be helpful to see the CIME configuration for your machine. If it is a supported machine, please report the name of the machine from /cime/config/machines/config_machines.xml. Otherwise, please send the machine entry for your machine file (config_machines.xml probably in your $HOME/.cime directory.Thanks,--Steve
 
Hi Isla, Steve - thanks for the quick replies.I've followed the steps here:http://www.cesm.ucar.edu/models/simpler-models/held-suarez.htmlto generate the cases that I've been running, but I have also included the code to read in the temerature relaxation profile from an external NETCDF file.I can easily run a pure HS94 with no modifications if that is useful.The machine is not supported; the relevant xml from the config_machines.xml follows.Thank you,Peter 
Code:
<machine MACH="orm"><br />    <DESC><br />       80-core CentOS 7.4 machine hosted by BioHPC at Cornell<br />    </DESC><br />    <NODENAME_REGEX>cbsuorm.biohpc.cornell.edu</NODENAME_REGEX><br />    <OS>LINUX</OS><br />    <COMPILERS>gnu</COMPILERS><br />    <MPILIBS>mpich</MPILIBS><br />    <PROJECT>none</PROJECT><br />    <SAVE_TIMING_DIR> </SAVE_TIMING_DIR><br />    <CIME_OUTPUT_ROOT>/local/storage/$USER/cesm/scratch</CIME_OUTPUT_ROOT><br />    <DIN_LOC_ROOT>/local/storage/cesm/inputdata</DIN_LOC_ROOT><br />    <DIN_LOC_ROOT_CLMFORC>/local/storage/cesm/inputdata/lmwg</DIN_LOC_ROOT_CLMFORC><br />    <DOUT_S_ROOT>/local/storage/$USER/cesm/archive/$CASE</DOUT_S_ROOT><br />    <BASELINE_ROOT>/local/storage/cesm/cesm_baselines</BASELINE_ROOT><br />    <CCSM_CPRNC>/home/$USER/cesm/cime/tools/cprnc/cprnc</CCSM_CPRNC><br />    <GMAKE>gmake</GMAKE><br />    <GMAKE_J>8</GMAKE_J><br />    <BATCH_SYSTEM>none</BATCH_SYSTEM><br />    <SUPPORTED_BY>aph28@cornell.edu</SUPPORTED_BY><br />    <MAX_TASKS_PER_NODE>20</MAX_TASKS_PER_NODE><br />    <MAX_MPITASKS_PER_NODE>20</MAX_MPITASKS_PER_NODE><br />    <PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED><br />    <mpirun mpilib="default"><br />      <executable>mpiexec</executable><br />      <arguments><br />        <arg name="ntasks"> -np {{ total_tasks }} </arg><br />      </arguments><br />    </mpirun><br />    <module_system type="none"/><br />    <environment_variables><br />      <env name="OMP_STACKSIZE">256M</env><br />      <env name="NETCDF_HOME">/usr/local</env><br />    </environment_variables><br />    <resource_limits><br />      <resource name="RLIMIT_STACK">-1</resource><br />    </resource_limits><br />  </machine><br /><br /><br />
 

islas

Moderator
Staff member
Hi Peter, Steve, I have run out of the box Held-Suarez with no modifications and get 14pe-hrs/year for T42L30 resolution which is still, I think around 7 times more expensive than what Peter was finding with L60 (i.e., double the vertical resolution). Isla
 
Hi Isla, Steve:I did the same thing last night and found 4.4 PE/model year at T42L30 -- which is significantly faster but still much slower than IGCM.Peter
 

goldy

AMP / CGD / NCAR
Staff member
Peter,I wonder if this is a scaling issue. When I run on a local CentOS cluster, the default layout is to use 48 tasks. With that layout, I get 24 PE-hours / simulated day. When I rebuild and rerun with 24 tasks, I get 4.5 PE-hours / simulated day. What is your setup? The output of ./preview_run would be informative. Try running with fewer tasks.--Steve
 
Hi Steve,For the standard HS94 run at T42L30 that I just did, the output of preview_run is:CASE INFO:  nodes: 1
  total tasks: 20
  tasks per node: 20
  thread count: 1

BATCH INFO:
  FOR JOB: case.run
    ENV:
      Setting Environment OMP_STACKSIZE=256M
      Setting Environment NETCDF_HOME=/usr/local
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      None

  FOR JOB: case.st_archive
    ENV:
      Setting Environment OMP_STACKSIZE=256M
      Setting Environment NETCDF_HOME=/usr/local
      Setting Environment OMP_NUM_THREADS=1
    SUBMIT CMD:
      None

MPIRUN:
  mpiexec  -np 20  /local/storage/aph28/cesm/scratch/hs3/bld/cesm.exe  >> cesm.log.$LID 2>&1-------------------------Is the number of tasks simply the number of CPUs used? We have tried running with only 1 CPU and we only found a slight (10%) speed up in terms of PE-hours/model year, but we can explore this further.Edit: If this is not the case, how should I change the number of tasks?Thanks,Peter
 

goldy

AMP / CGD / NCAR
Staff member
Peter,The first section (CASE INFO) shows 20 tasks. Your mpiexec looks like is set to run 20 tasks so that is good.Is 20 tasks on a single node a good layout for this machine?To change the number of tasks, the command is:
Code:
./xmlchange NTASKS=xx
To rebuild in the same case, you have to reset
Code:
./case.build --clean-all

Code:
./case.setup --reset

Code:
./case.build

Code:
More info in the CIME manual: <a href="https://esmci.github.io/cime/">https://esmci.github.io/cime/<br /><br />-</a>-Steve
 
Hi Steve,I'm not sure how to determine a good layout for this machine other than saying that there is only one node with the following processor (from lscpu):Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHzCPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2

from this perspective anything less than 80 tasks on the single node seems reasonable (?). I have tried running the base HS94 T42L30 case with NTASKS set to 1, 5, 20, and 40, and I am somewhat confused by the timing information output, but my impression is that the scaling is quite linear from 1 to 20 tasks, but that the efficiency fell off substantially at 40 tasks.For the 5 CPU case, the relevant snippet from the timing log file is
Code:
grid        : a%T42z30_l%null_oi%null_r%null_g%null_w%null_m%gx1v7
  compset     : 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV
  run_type    : startup, continue_run = FALSE (inittype = TRUE)
  stop_option : ndays, stop_n = 1000
  run_length  : 1000 days (999.986111111 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        5           0        5      x 1       1      (1     ) 
...

  total pes active           : 5 
  mpi tasks per node               : 20 
  pe count for cost estimate : 20

  Overall Metrics: 
    Model Cost:              20.71   pe-hrs/simulated_year 
    Model Throughput:        23.18   simulated_years/day 

    Init Time   :       0.344 seconds 
    Run Time    :   10212.713 seconds       10.213 seconds/day 
    Final Time  :       0.001 seconds
For the 20 CPU case:
Code:
grid        : a%T42z30_l%null_oi%null_r%null_g%null_w%null_m%gx1v7
  compset     : 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV
  run_type    : startup, continue_run = FALSE (inittype = TRUE)
  stop_option : ndays, stop_n = 2000
  run_length  : 2000 days (1999.98611111 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        20          0        20     x 1       1      (1     ) 
  atm = cam        20          0        20     x 1       1      (1     ) 
...
  total pes active           : 20 
  mpi tasks per node               : 20 
  pe count for cost estimate : 20 

  Overall Metrics: 
    Model Cost:               4.43   pe-hrs/simulated_year 
    Model Throughput:       108.44   simulated_years/day 

    Init Time   :       0.313 seconds 
    Run Time    :    4365.908 seconds        2.183 seconds/day 
    Final Time  :       0.000 seconds
And for the 40 CPU case:
Code:
grid        : a%T42z30_l%null_oi%null_r%null_g%null_w%null_m%gx1v7
  compset     : 2000_CAM%HS94_SLND_SICE_SOCN_SROF_SGLC_SWAV
  run_type    : startup, continue_run = FALSE (inittype = TRUE)
  stop_option : ndays, stop_n = 1000
  run_length  : 1000 days (999.986111111 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        40          0        40     x 1       1      (1     ) 
  atm = cam        40          0        40     x 1       1      (1     ) 
...

  total pes active           : 40 
  mpi tasks per node               : 40 
  pe count for cost estimate : 40 

  Overall Metrics: 
    Model Cost:              10.59   pe-hrs/simulated_year 
    Model Throughput:        90.69   simulated_years/day 

    Init Time   :       0.469 seconds 
    Run Time    :    2610.075 seconds        2.610 seconds/day 
    Final Time  :       0.001 seconds

Code:
<br />Note that the 20 CPU case is run for 2000 days while the others ran for 1000 days. The wall time for the 20CPU case was about 4400s (or 2200s for 1000 days) and for 5CPUs it was just over 10200s, or ~4 times the wall clock time for 20CPUs, but the PE-hrs/simulated year stats aren't consistent with this. The 1CPU case is still running, but its progress looks consistent with the 5 and 20 CPU case.<br />On the other hand, the 40 CPU case took 2600s to run 1000 days, or longer than 20 CPUs despite using twice the cores.<br /><br />(For comparison, the wall-clock time for IGCM to run 1000 days was 14400s on a single core of the same machine, with twice the number of vertical levels and a slightly shorter timestep. Taking this into account I still get that CESM is running about 10 times slower.)<br /><br />So certainly the layout makes a difference in the above - I'm not sure why my 40 CPU run was so much less efficient. <br /><br />Could there be issues with compiler/MPI flags in the build that would be contributing to the slow speed?<br /><br />Thanks,<br /><br />Peter<br /><br />
 

goldy

AMP / CGD / NCAR
Staff member
Peter,Two points: Your 5 PE run was actually pretty efficient. The issue can be seen in this line from the timing file:pe count for cost estimate : 20The system thinks (correctly or not, I do not know) that you are being charged (or reserving) 20 PEs even though you are only running on 5.20 PEs seems to be a sweet spot for this machine.As far as 40 goes, I believe the issue is lack of strong scaling in the Eulerian dycore. I believe you would get better performance of 40 vs. 20 PES for a higher-resolution run such as T85L30.

 
Steve,The lack of strong scaling in the 40 PE case makes sense.However, the 5PE case (and the 1PE and 20PE cases) are still extremely slow (an order of magnitude) relative to the other dynamical core that I've been using. I have a hard time believing that the code base is really that much slower. Can you suggest other avenues for improving performance? Thanks,Peter
 
Top