Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Strong Scaling Discrepancy between Individual Models and Total

Jbuzan

Jonathan R. Buzan
Member
Code:
$ ./describe_version
            ccs_config at tag ccs_config_cesm0.0.109
M                      HEAD detached at 797acd7
                      Changes not staged for commit:
                        (use "git add <file>..." to update what will be committed)
                        (use "git restore <file>..." to discard changes in working directory)
                          modified:   machines/config_batch.xml
                          modified:   machines/config_machines.xml

                      no changes added to commit (use "git add" and/or "git commit -a")

                 share at tag share1.0.19
                  cime at tag cime6.0.246
                   mct at tag MCT_2.11.0
            mpi-serial at tag MPIserial_2.5.0
                   cam at tag cam6_3_162
M                      HEAD detached at ab476f9b
                      Changes not staged for commit:
                        (use "git add/rm <file>..." to update what will be committed)
                        (use "git restore <file>..." to discard changes in working directory)
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/shell_commands
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_cam
                          deleted:    cime_config/testdefs/testmods_dirs/cam/outfrq9s_waccm_ma_mam4/user_nl_clm

                      no changes added to commit (use "git add" and/or "git commit -a")

                   ww3 at tag ww3i_0.0.2
                   rtm at tag rtm1_0_79
                pysect at tag 3.2.2
                mosart at tag mosart1_0_49
             mizuroute at tag cesm-coupling.n02_v2.1.2
                   fms at tag fi_240516
            parallelio at tag pio2_6_2
                 cdeps at tag cdeps1.0.37
                 cmeps at tag cmeps0.14.63
                  cice at tag cesm_cice6_5_0_9
                  cism at tag cismwrap_2_2_001
                   clm at tag ctsm5.2.007
                   mom at tag mi_240522
    testfails = 0, local mods = 2, needs updates 0

    The submodules labeled with 'M' above are not in a clean state.
    The following are options for how to proceed:
    (1) Go into each submodule which is not in a clean state and issue a 'git status'
        Either revert or commit your changes so that the submodule is in a clean state.
    (2) use the --force option to git-fleximod
    (3) you can name the particular submodules to update using the git-fleximod command line
    (4) As a last resort you can remove the submodule (via 'rm -fr [directory]')
        then rerun git-fleximod update.





./create_newcase --case /glade/derecho/scratch/jbuzan/cases/Test_24_nodes --compiler intel --compset F2000climo --res ne30pg3_ne30pg3_mg17 --driver nuopc --mpilib mpich --run-unsupported --mach derecho

Changes to env_mach_pes (timing files below).

Describe your problem or question:
I am attempting to increase the number of cores for the F2000 case for scaling tests on Eiger and Derecho.
Eiger and Derecho are all but the same machine, and their performance is almost the same between the same processor and case setup.

However, when I increase the number of cores from 12 nodes to 24 nodes (with ATM = CPL), the ATM Run Time increases, but the TOT Run Time does barely changes. My understanding is that the model should strongly scale up to the "element" number of the grid.

Example:
24 nodes
TOT = 18.37 myears/wday ATM = 33.73 myears/wday
12 nodes
TOT = 14.79 myears/wday ATM = 19.18 myears/wday
6 nodes
TOT = 9.33 myears/wday ATM = 10.61 myears/wday

Is this the expected behavior for CESM? This is showing a strong disconnect between the ATM performance and the TOT model performance. Detailed timing files below.

Thank you for your help!

Cheers,
-Jonathan

Example on
Derecho with 24 nodes to demonstrate the same performance as Eiger.
Eiger with 24 nodes
Eiger with 12 nodes
Eiger with 6 nodes
 

Attachments

  • 6_nodes_eiger.txt
    4.9 KB · Views: 0
  • 12_nodes_eiger.txt
    4.9 KB · Views: 0
  • 24_nodes_derecho.txt
    4.5 KB · Views: 0
  • 24_nodes_eiger.txt
    4.9 KB · Views: 0

Jbuzan

Jonathan R. Buzan
Member
I have 2 scaling charts. The first chart is fully coupled. Granted the Fully Coupled is capped at 15 nodes for the Atmosphere (the last test I was able to conduct at the time). But, the atmosphere is still scaling strongly, as well as the total simulation time.

The figure second is fixed SST. Which stops scaling strongly at ~12 nodes.

Compset: BLT1850_v0c Grid: ne30pg3_t232

Compset: F2000climo Grid: ne30pg3_ne30pg3_mg17
 

Attachments

  • BCASE.png
    BCASE.png
    92.1 KB · Views: 1
  • FCASE.png
    FCASE.png
    87.6 KB · Views: 1
Top