How to get the best load balancing for a 2.2.2 case

cristian-vasile.achim · May 2, 2025

Hello,

I hope I got the right section of the forum.
I am trying to get the cesm2.2.2 running on a new machine (Mahti, 128 cores per node, gnu compilers, openmpi). I was able to get a test case to run, but I am not able to get good load balance.

This is the output of ./describe_version:

Code:

$ ./describe_version
------------------------------------------------------------------------
git describe:
release-cesm2.2.2-0-g779b0a3
------------------------------------------------------------------------

We are missing some modules, but they do not matter for the cases that are important to us.

Have you made any changes to files in the source tree?
I changed the xml file to reflect the machine we are running (compilation, bash, pes)

Describe every step you took leading up to the problem:
I create a test case usingg:

Code:

/projappl/project_2008521/cesm2.2.2/cime/scripts/create_newcase --compset FWmadSD --res f09_f09_mg17 --case test_omp_atm_36_lnd_5_ice_1_thrds_2_cpt_4 --mach mahti

I have try different configuration with different number of nodes assigned to the LND and ICE components and I am not able to get higher than 2.86 simulated years per day.

Here are the best timings I got:

18 nodes for ATM, 64 tasks per node, 2 threads per task:

Code:

Case        : test_omp_atm_18_lnd_5_ice_1_thrds_2_cpt_2
  LID         : 4423577.250425-161416
  Machine     : mahti
  Caseroot    : /users/cristian/test_omp_atm_18_lnd_5_ice_1_thrds_2_cpt_2
  Timeroot    : /users/cristian/test_omp_atm_18_lnd_5_ice_1_thrds_2_cpt_2/Tools
  User        : cristian
  Curr Date   : Fri Apr 25 16:25:26 2025
  grid        : a%0.9x1.25_l%0.9x1.25_oi%0.9x1.25_r%r05_g%null_w%null_z%null_m%gx1v7
  compset     : HIST_CAM60%WCMD%SDYN_CLM50%SP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV_SIAC_SESP
  run type    : startup, continue_run = FALSE (inittype = TRUE)
  stop option : ndays, stop_n = 5
  run length  : 5 days (4.979166666666667 for ocean)

  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------ 
  cpl = cpl        4608        0        1152   x 2       1      (1     )
  atm = cam        4608        0        1152   x 2       1      (1     )
  lnd = clm        1280        0        320    x 2       1      (1     )
  ice = cice       256         1088     64     x 2       1      (1     )
  ocn = docn       256         1152     64     x 2       1      (1     )
  rof = mosart     1024        0        256    x 2       1      (1     )
  glc = sglc       256         0        64     x 2       1      (1     )
  wav = swav       256         0        64     x 2       1      (1     )
  iac = siac       2           0        1      x 1       1      (1     )
  esp = sesp       2           0        1      x 1       1      (1     )

  total pes active           : 4864
  mpi tasks per node               : 64
  pe count for cost estimate : 1216

  Overall Metrics:
    Model Cost:           10771.50   pe-hrs/simulated_year
    Model Throughput:         2.71   simulated_years/day

    Init Time   :     226.028 seconds
    Run Time    :     436.840 seconds       87.368 seconds/day
    Final Time  :       0.002 seconds

    Actual Ocn Init Wait Time     :     382.250 seconds
    Estimated Ocn Init Run Time   :       0.000 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:     436.840 seconds       87.368 seconds/mday         2.71 myears/wday
    CPL Run Time:       2.336 seconds        0.467 seconds/mday       506.66 myears/wday
    ATM Run Time:     416.642 seconds       83.328 seconds/mday         2.84 myears/wday
    LND Run Time:      12.373 seconds        2.475 seconds/mday        95.66 myears/wday
    ICE Run Time:       2.939 seconds        0.588 seconds/mday       402.71 myears/wday
    OCN Run Time:       0.058 seconds        0.012 seconds/mday     20321.21 myears/wday
    ROF Run Time:       0.817 seconds        0.163 seconds/mday      1448.67 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     26.208 seconds        5.242 seconds/mday        45.16 myears/wday
   NOTE: min:max driver timers (seconds/day):   
                            CPL (pes 0 to 1151)
                                                ATM (pes 0 to 1151)
                                                LND (pes 0 to 319)
                                                                                     ICE (pes 1088 to 1151)
                                                                                       OCN (pes 1152 to 1215)
                                                ROF (pes 0 to 255)
                                                GLC (pes 0 to 63)
                                                WAV (pes 0 to 63)
                                                IAC (pes 0 to 0)
                                                ESP (pes 0 to 0)

If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.

Describe your problem or question:
I am trying to run cesm2.2.2 as efficient as possible. Unfortunately I was no able to balance the time and resources. The LND component takes 12-14 s, while the ICE componet 2-4 s. I tried to use more nodes for the LND, but it ended up in worse performance. If there a way/strategy to improive the performance?

Cristian

jedwards · May 2, 2025

There is no reason to set different components to different task counts in an F case. All of the components run serially anyway.
So the best tuning is to optimize the task count for the atmosphere and use the same task count for the other components.
So do

./xmlchange ROOTPE=0
./xmlchange NTASKS=1152 or whatever maximum NTASKS you can achieve. There is generally little or no gain from using threads so
./xmlchange NTHRDS=1

cristian-vasile.achim · May 2, 2025

jedwards said:
There is no reason to set different components to different task counts in an F case. All of the components run serially anyway.
So the best tuning is to optimize the task count for the atmosphere and use the same task count for the other components.
So do

./xmlchange ROOTPE=0
./xmlchange NTASKS=1152 or whatever maximum NTASKS you can achieve. There is generally little or no gain from using threads so
./xmlchange NTHRDS=1

Thank you for your reply. I will follow your advice and do more tests with all ROOTPE=0.

Regarding the multithreading. I tested option with NTHRDS=1, 128 tasks per node and NTHRDS=2, 64 tasks per node. NTHRDS=2, 64 tasks per node gave me more than double performance.

Cristian

jedwards · May 2, 2025

I would be surprised if that was because of threading - more likely it has to do with reducing the memory footprint of the model.
you can confirm this by comparing performance using 64 tasks per node and NTHRDS=1.

cristian-vasile.achim · May 2, 2025

jedwards said:
I would be surprised if that was because of threading - more likely it has to do with reducing the memory footprint of the model.
you can confirm this by comparing performance using 64 tasks per node and NTHRDS=1.

I do not know CESM so well, but I do know our cluster and we have seen on many occasion that memory bounded application have benefitted from using less tasks per node. I also noted for some applications using MPI collectives that having less tasks per node decreases the communication time. In this case using some openmp did help to reduce the computation time.
As I said below I am not familiar with CESM and not sure how are things parallelized, but I have tests from last week:

NTHRDS=1, 128 tasks per node

Code:

  Overall Metrics:
    Model Cost:           24490.10   pe-hrs/simulated_year
    Model Throughput:         2.38   simulated_years/day

NTHRDS=1 64 tasks per node

Overall Metrics:
Model Cost: 13443.84 pe-hrs/simulated_year
Model Throughput: 2.17 simulated_years/day

NTHRDS=2 64 tasks per node.

Code:

Overall Metrics:
    Model Cost:           10771.50   pe-hrs/simulated_year
    Model Throughput:         2.71   simulated_years/day

The differences are not big, but it is almost 20% and I guess that while you said is true, it good to try to use all available cores. I wish more about CESM I could maybe find better running configurations.
At least I know now that in the F configurations the components run serially. I can tweak a little bit the running. Thanks for taking the time to look into my question and clarifying it.

Cristian

jedwards · May 2, 2025

Try a case with

xmlchange MAX_MPITASKS_PER_NODE=64
xmlchange NTHRDS=1 and the same TASK count as your NTHRDS=2 run above.

cristian-vasile.achim · May 5, 2025

jedwards said:
Try a case with

xmlchange MAX_MPITASKS_PER_NODE=64
xmlchange NTHRDS=1 and the same TASK count as your NTHRDS=2 run above.

These are my tests from before I started this post:

NTHRDS=1, 128 tasks per node

Code:

Model Throughput:         2.38   simulated_years/day

NTHRDS=1 64 tasks per node

Code:

Model Throughput: 2.17 simulated_years/day

NTHRDS=2 64 tasks per node.

Code:

Model Throughput:         2.71   simulated_years/day

So the NTHRDS=2 64 tasks per node case is the slowest.

The argument about memory is valid , but in this machine seems that it still needs as many as possible cores. I had similar results when using more nodes.

Cristian

jedwards · May 5, 2025

Thank you. I'm a little surprised by the result but it's good to know. You should be aware of the XML variable ESMF_AWARE_THREADING which when set to TRUE allows you to use different numbers of threads for different components. It might be worth experimenting with adding threads to the atmosphere while using only MPI tasks for ice and land models.

cristian-vasile.achim · May 5, 2025

jedwards said:
Thank you. I'm a little surprised by the result but it's good to know. You should be aware of the XML variable ESMF_AWARE_THREADING which when set to TRUE allows you to use different numbers of threads for different components. It might be worth experimenting with adding threads to the atmosphere while using only MPI tasks for ice and land models.

Thanks. I will do more tests.

How to get the best load balancing for a 2.2.2 case

cristian-vasile.achim

Cristian Achim

New Member

Attachments

jedwards

CSEG and Liaisons

cristian-vasile.achim

Cristian Achim

New Member

jedwards

CSEG and Liaisons

cristian-vasile.achim

Cristian Achim

New Member

jedwards

CSEG and Liaisons

cristian-vasile.achim

Cristian Achim

New Member

jedwards

CSEG and Liaisons

cristian-vasile.achim

Cristian Achim

New Member