Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Optimizing runtime

Hi all,
I could use help figuring out how to set up env_mach_pes.xml to optimize runtime. Currently the file has been changed from the default by changing NTASKS for each model component to 256, but I don't think this is ideal. It runs one model year in just under 4 hours with a core hour charge of about 1100 core hours per model year.

My problem is that my computing allocation ends at the end of the month, but we just found a flaw in the setup of my current simulation and need to restart it from the beginning. I want to get it as far as possible before Feb 28. I have plenty of core hour time so I don't mind being charged more per model year. This simulation had been running for months and will become the final chapter of my PhD dissertation which I am defending in a few months so I am now quite behind since finding the flaw, but also glad that I have found it and can fix it. I don't have time to do a bunch of timing tests and don't have a robust background in doing env_mach_pes.xml setup or optimizing runtime so any advice anyone can give for changes to make to get CESM1.2 to run faster would be greatly appreciated. Thanks!
 
Hi and welcome.

You can find different env_mach_pes.xml setups and timings for CESM1.2 on.


Chris
Thanks! This table is for Yellowstone, has someone made one for Cheyenne? I saw this thread from a few years ago asking about that but there doesn't seem to be one: CESM1.2.2 timing table for Cheyenne

I am running resolution f09_g16 so if the yellowstone timing data is reasonably close to what would occur on Cheyenne it looks like it would cost ~1500 core hours to run 10 years per day. with this layout (total_pes, tasks x threads, root_pe)


1200
600x20
420
210x20
420
210x20
780
390x2210
1200
600x20
1200
600x20
60
30x2600

I also found this timing table for CESM2 on Cheyenne CESM2 Timing, Performance & Load Balancing Data
which seems to say for the same resolution it runs 91.96 simulated_years/day at a cost of 573.15 pe-hrs/simulated_year but that seems substantially different. Could it really run 100 years per day at that resolution?
https://csegweb.cgd.ucar.edu/timing...iso_clm50d006_1deg_GSWP3V1_hist.180420-124850

Does this make sense? Or does an equivalent table exist for CESM1.2 on Cheyenne?
 

fischer

CSEG and Liaisons
Staff member
Hi,

Sorry, we don't have a cesm1.2.2 timing table for Cheyenne. We don't have the resources to create new timing tables for the older models. You're best approach would be to use the same layouts from Yellowstone on Cheyenne. Then run for a month and look at the timing files. You could also try doubling
the layout to get faster throughput. You shouldn't use the layouts from CESM2, the load balancing would be all wrong.

Chris
 
Hi,

Sorry, we don't have a cesm1.2.2 timing table for Cheyenne. We don't have the resources to create new timing tables for the older models. You're best approach would be to use the same layouts from Yellowstone on Cheyenne. Then run for a month and look at the timing files. You could also try doubling
the layout to get faster throughput. You shouldn't use the layouts from CESM2, the load balancing would be all wrong.

Chris
Thank you! Based on the yellowstone timing data this would be the layout to try?

./xmlchange NTASKS_CPL=600
./xmlchange NTHRDS_CPL=2
./xmlchange ROOTPE_CPL=0
./xmlchange NTASKS_LND=210
./xmlchange NTHRDS_LND=2
./xmlchange ROOTPE_LND=0
./xmlchange NTASKS_ROF=210
./xmlchange NTHRDS_ROF=2
./xmlchange ROOTPE_ROF=0
./xmlchange NTASKS_ICE=390
./xmlchange NTHRDS_ICE=2
./xmlchange ROOTPE_ICE=210
./xmlchange NTASKS_ATM=600
./xmlchange NTHRDS_ATM=2
./xmlchange ROOTPE_ATM=0
./xmlchange NTASKS_GLC=600
./xmlchange NTHRDS_GLC=2
./xmlchange ROOTPE_GLC=0
./xmlchange NTASKS_OCN=30
./xmlchange NTHRDS_OCN=2
./xmlchange ROOTPE_OCN=600

Or if doubled:
./xmlchange NTASKS_CPL=1200
./xmlchange NTHRDS_CPL=4
./xmlchange ROOTPE_CPL=0
./xmlchange NTASKS_LND=420
./xmlchange NTHRDS_LND=4
./xmlchange ROOTPE_LND=0
./xmlchange NTASKS_ROF=420
./xmlchange NTHRDS_ROF=4
./xmlchange ROOTPE_ROF=0
./xmlchange NTASKS_ICE=780
./xmlchange NTHRDS_ICE=4
./xmlchange ROOTPE_ICE=420
./xmlchange NTASKS_ATM=1200
./xmlchange NTHRDS_ATM=4
./xmlchange ROOTPE_ATM=0
./xmlchange NTASKS_GLC=1200
./xmlchange NTHRDS_GLC=4
./xmlchange ROOTPE_GLC=0
./xmlchange NTASKS_OCN=60
./xmlchange NTHRDS_OCN=4
./xmlchange ROOTPE_OCN=1200
 
Thank you! Based on the yellowstone timing data this would be the layout to try?

./xmlchange NTASKS_CPL=600
./xmlchange NTHRDS_CPL=2
./xmlchange ROOTPE_CPL=0
./xmlchange NTASKS_LND=210
./xmlchange NTHRDS_LND=2
./xmlchange ROOTPE_LND=0
./xmlchange NTASKS_ROF=210
./xmlchange NTHRDS_ROF=2
./xmlchange ROOTPE_ROF=0
./xmlchange NTASKS_ICE=390
./xmlchange NTHRDS_ICE=2
./xmlchange ROOTPE_ICE=210
./xmlchange NTASKS_ATM=600
./xmlchange NTHRDS_ATM=2
./xmlchange ROOTPE_ATM=0
./xmlchange NTASKS_GLC=600
./xmlchange NTHRDS_GLC=2
./xmlchange ROOTPE_GLC=0
./xmlchange NTASKS_OCN=30
./xmlchange NTHRDS_OCN=2
./xmlchange ROOTPE_OCN=600

Or if doubled:
./xmlchange NTASKS_CPL=1200
./xmlchange NTHRDS_CPL=4
./xmlchange ROOTPE_CPL=0
./xmlchange NTASKS_LND=420
./xmlchange NTHRDS_LND=4
./xmlchange ROOTPE_LND=0
./xmlchange NTASKS_ROF=420
./xmlchange NTHRDS_ROF=4
./xmlchange ROOTPE_ROF=0
./xmlchange NTASKS_ICE=780
./xmlchange NTHRDS_ICE=4
./xmlchange ROOTPE_ICE=420
./xmlchange NTASKS_ATM=1200
./xmlchange NTHRDS_ATM=4
./xmlchange ROOTPE_ATM=0
./xmlchange NTASKS_GLC=1200
./xmlchange NTHRDS_GLC=4
./xmlchange ROOTPE_GLC=0
./xmlchange NTASKS_OCN=60
./xmlchange NTHRDS_OCN=4
./xmlchange ROOTPE_OCN=1200
Update- I tried the first option but the model did not complete and gave the following error message:

392: Warning: Departure points out of bounds in remap
392: my_task, i, j = 182 6 7
392: dpx, dpy = -1492310.60043372 -4.15831637132760
392: HTN(i,j), HTN(i+1,j) = 55886.2986191883 55886.2986191883
392: HTE(i,j), HTE(i,j+1) = 59395.4550164216 59395.4550164216
392: istep1, my_task, iblk = 3 182 1
392: Global block: 316
392: Global i and j: 300 30
392:(shr_sys_abort) ERROR: remap transport: bad departure points
392:(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
454:(shr_sys_abort) ERROR: remap transport: bad departure points
454:(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
392:MPT ERROR: Rank 392(g:392) is aborting with error code 1001.
392: Process ID: 61461, Host: r5i5n21, Program: /glade/scratch/srogstad/RCP85PertCoupledBiasExt/bld/cesm.exe
392: MPT Version: HPE MPT 2.19 02/23/19 05:30:09
392:
392:MPT: --------stack traceback-------
454:MPT ERROR: Rank 454(g:454) is aborting with error code 1001.
454: Process ID: 2662, Host: r5i5n25, Program: /glade/scratch/srogstad/RCP85PertCoupledBiasExt/bld/cesm.exe
454: MPT Version: HPE MPT 2.19 02/23/19 05:30:09
454:
454:MPT: --------stack traceback-------
392:MPT: Attaching to program: /proc/61461/exe, process 61461
392:MPT: [New LWP 61478]
392:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/libm.so.6
392:MPT: Try: zypper install -C "debuginfo(build-id)=4e96cf37d52b9c2f3648e691878b682da5abfa42"
392:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/libdl.so.2
392:MPT: Try: zypper install -C "debuginfo(build-id)=5eb2f40ad3b0125943aba8f08dd08609351a2967"
392:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/libpthread.so.0
392:MPT: Try: zypper install -C "debuginfo(build-id)=4f3d05f200db29c6835a48e466e0378a8541fd36"
392:MPT: [Thread debugging using libthread_db enabled]
392:MPT: Using host libthread_db library "/glade/u/apps/ch/os/lib64/libthread_db.so.1".
392:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/librt.so.1
392:MPT: Try: zypper install -C "debuginfo(build-id)=b115bb26e97505a5bd3b56d70d20459aa1206ac9"
392:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/libc.so.6
392:MPT: Try: zypper install -C "debuginfo(build-id)=93c4deac1088eb84fbd01cf2a2c54399f516e9a7"
392:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/libgcc_s.so.1
392:MPT: Try: zypper install -C "debuginfo(build-id)=5f9ec139af58fa59c33f72d1b3e56f083f1613ae"
392:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/usr/lib64/libnuma.so.1
 
I'm going to keep updating my attempts here in case they help anyone else. The error above seemed to be related to the load balancing attempt so I went back to what I was doing before and just doubled the number of tasks. It did indeed work to speed it up. Here is part of the timing table:

component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 512 0 512 x 1 1 (1 )
glc = sglc 512 0 512 x 1 1 (1 )
wav = swav 256 0 256 x 1 1 (1 )
lnd = clm 512 0 512 x 1 1 (1 )
rof = rtm 512 0 512 x 1 1 (1 )
ice = cice 512 0 512 x 1 1 (1 )
atm = cam 512 0 512 x 1 1 (1 )
ocn = pop2 512 0 512 x 1 1 (1 )

total pes active : 512
pes per node : 36
pe count for cost estimate : 512

Overall Metrics:
Model Cost: 1220.07 pe-hrs/simulated_year
Model Throughput: 10.07 simulated_years/day
 

fischer

CSEG and Liaisons
Staff member
Try using a number of tasks that's divisible by 36 (number of cpus per node). Try something like 540 or 576.
 
Before seeing your last message I had tried a few more things.

Here are the results of the first test:
./xmlchange NTASKS_CPL=1024
./xmlchange NTHRDS_CPL=1
./xmlchange NTASKS_LND=1024
./xmlchange NTHRDS_LND=1
./xmlchange NTASKS_ROF=1024
./xmlchange NTHRDS_ROF=1
./xmlchange NTASKS_ICE=1024
./xmlchange NTHRDS_ICE=1
./xmlchange NTASKS_ATM=1024
./xmlchange NTHRDS_ATM=1
./xmlchange NTASKS_GLC=1024
./xmlchange NTHRDS_GLC=1
./xmlchange NTASKS_OCN=1024
./xmlchange NTHRDS_OCN=1

component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 1024 0 1024 x 1 1 (1 )
glc = sglc 1024 0 1024 x 1 1 (1 )
wav = swav 256 0 256 x 1 1 (1 )
lnd = clm 1024 0 1024 x 1 1 (1 )
rof = rtm 1024 0 1024 x 1 1 (1 )
ice = cice 1024 0 1024 x 1 1 (1 )
atm = cam 1024 0 1024 x 1 1 (1 )
ocn = pop2 1024 0 1024 x 1 1 (1 )

total pes active : 1024
pes per node : 36
pe count for cost estimate : 1024

Overall Metrics:
Model Cost: 1802.75 pe-hrs/simulated_year
Model Throughput: 13.63 simulated_years/day


Also this was the timing info from my original layout:
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 256 0 256 x 1 1 (1 )
glc = sglc 256 0 256 x 1 1 (1 )
wav = swav 256 0 256 x 1 1 (1 )
lnd = clm 256 0 256 x 1 1 (1 )
rof = rtm 256 0 256 x 1 1 (1 )
ice = cice 256 0 256 x 1 1 (1 )
atm = cam 256 0 256 x 1 1 (1 )
ocn = pop2 256 0 256 x 1 1 (1 )

total pes active : 256
pes per node : 36
pe count for cost estimate : 256

Overall Metrics:
Model Cost: 980.13 pe-hrs/simulated_year
Model Throughput: 6.27 simulated_years/day

Three other tests failed, this first one with an error:
./xmlchange NTASKS_CPL=1024
./xmlchange NTHRDS_CPL=2
./xmlchange NTASKS_LND=1024
./xmlchange NTHRDS_LND=2
./xmlchange NTASKS_ROF=1024
./xmlchange NTHRDS_ROF=2
./xmlchange NTASKS_ICE=1024
./xmlchange NTHRDS_ICE=2
./xmlchange NTASKS_ATM=1024
./xmlchange NTHRDS_ATM=2
./xmlchange NTASKS_GLC=1024
./xmlchange NTHRDS_GLC=2
./xmlchange NTASKS_OCN=1024
./xmlchange NTHRDS_OCN=2

557:MPT: from /glade/u/apps/opt/intel/2017u1/compilers_and_libraries/linux/lib/intel64/libiomp5.so
557:MPT: #9 0x00002ae5eb1f9b17 in __kmp_invoke_task_func (gtid=-1222237176)
557:MPT: at ../../src/kmp_runtime.c:7084
557:MPT: #10 0x00002ae5eb1fabd3 in __kmp_fork_call (loc=0xffffffffb7262408, gtid=0,
557:MPT: call_context=(unknown: 4132157544), argc=-1551200472,
557:MPT: microtask=0xffffffffeb4f90a0, invoker=0x7ffcf64bb880, ap=0x7ffcf64c4ec0)
557:MPT: at ../../src/kmp_runtime.c:2357
557:MPT: #11 0x00002ae5eb1d26f8 in __kmpc_fork_call (loc=0xffffffffb7262408, argc=0,
557:MPT: microtask=0x7ffcf64bb868) at ../../src/kmp_csupport.c:339
557:MPT: #12 0x000000000132f324 in ice_dyn_evp_mp_evp_ ()
692:MPT: #13 0x00000000014571bd in ice_step_mod_mp_step_dynamics_ ()
692:MPT: #14 0x0000000001310b84 in ice_comp_mct_mp_ice_run_mct_ ()
692:MPT: #15 0x000000000040c08c in ccsm_comp_mod_mp_ccsm_run_ ()
692:MPT: #16 0x000000000042aaa6 in MAIN__ ()
692:MPT: #17 0x0000000000408bde in main ()
692:MPT: (gdb) A debugging session is active.

and these two after seeing the suggestion of multiples of 36. Both of these just ran out the wall clock and didn't seem to do anything:
./xmlchange NTASKS_CPL=1224
./xmlchange NTHRDS_CPL=1
./xmlchange NTASKS_LND=1224
./xmlchange NTHRDS_LND=1
./xmlchange NTASKS_ROF=1224
./xmlchange NTHRDS_ROF=1
./xmlchange NTASKS_ICE=1224
./xmlchange NTHRDS_ICE=1
./xmlchange NTASKS_ATM=1224
./xmlchange NTHRDS_ATM=1
./xmlchange NTASKS_GLC=1224
./xmlchange NTHRDS_GLC=1
./xmlchange NTASKS_OCN=1224
./xmlchange NTHRDS_OCN=1

./xmlchange NTASKS_CPL=1152
./xmlchange NTHRDS_CPL=1
./xmlchange NTASKS_LND=1152
./xmlchange NTHRDS_LND=1
./xmlchange NTASKS_ROF=1152
./xmlchange NTHRDS_ROF=1
./xmlchange NTASKS_ICE=1152
./xmlchange NTHRDS_ICE=1
./xmlchange NTASKS_ATM=1152
./xmlchange NTHRDS_ATM=1
./xmlchange NTASKS_GLC=1152
./xmlchange NTHRDS_GLC=1
./xmlchange NTASKS_OCN=1152
./xmlchange NTHRDS_OCN=1

1056:2b5d720e1000-2b5d720e2000 -w-s 00202000 00:06 38081 /dev/infiniband/uverbs0
1056:2b5d720e2000-2b5d720e3000 -w-s 00203000 00:06 38081 /dev/infiniband/uverbs0
1056:2b5d720e3000-2b5d720e4000 -w-s 00204000 00:06 38081 /dev/infiniband/uverbs0
1056:2b5d720e4000-2b5d720e5000 -w-s 00205000 00:06 38081 /dev/infiniband/uverbs0
1056:2b5d720e5000-2b5d720e6000 -w-s 00206000 00:06 38081 /dev/infiniband/uverbs0
1056:2b5d720e6000-2b5d720e7000 -w-s 00207000 00:06 38081 /dev/infiniband/uverbs0
1056:2b5d720e7000-2b5d720f7000 rw-s 0fc10000 00:06 38081 /dev/infiniband/uverbs0

After the failures I tried the last thing that worked to see if it would run at all and it ran but put out a different timing file:
Tried again with 1024 since I know that works and got a diff time:
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 1024 0 1024 x 1 1 (1 )
glc = sglc 1024 0 1024 x 1 1 (1 )
wav = swav 256 0 256 x 1 1 (1 )
lnd = clm 1024 0 1024 x 1 1 (1 )
rof = rtm 1024 0 1024 x 1 1 (1 )
ice = cice 1024 0 1024 x 1 1 (1 )
atm = cam 1024 0 1024 x 1 1 (1 )
ocn = pop2 1024 0 1024 x 1 1 (1 )

total pes active : 1024
pes per node : 36
pe count for cost estimate : 1024

Overall Metrics:
Model Cost: 1734.61 pe-hrs/simulated_year
Model Throughput: 14.17 simulated_years/day


Any other suggestions? If not I'll probably just go with the ntasks=1024 option just so I can get this simulation restarted sooner.
 

fischer

CSEG and Liaisons
Staff member
I took the layout for a B compset on yellowstone for cesm2.1, and adjusted it for cheyenne. This is just an educated guess
on what might work better for you.

<NTASKS_ATM>900</NTASKS_ATM> <ROOTPE_ATM> 0</ROOTPE_ATM>
<NTASKS_LND>324</NTASKS_LND> <ROOTPE_LND> 0</ROOTPE_LND>
<NTASKS_ICE>576</NTASKS_ICE> <ROOTPE_ICE>324</ROOTPE_ICE>
<NTASKS_OCN>108</NTASKS_OCN> <ROOTPE_OCN>900</ROOTPE_OCN>
<NTASKS_CPL>900</NTASKS_CPL> <ROOTPE_CPL> 0</ROOTPE_CPL>
<NTASKS_GLC>900</NTASKS_GLC> <ROOTPE_GLC> 0</ROOTPE_GLC>
 
Thanks for the suggestion. I tried this configuration, but the model just ran out the wall clock and never executed.

I think for now I will stick with having ntasks=1024 for all components. I restarted my simulation and it is moving along at a pretty steady clip. I'm still getting about 15 model years/day at a charge of 1600 core hours/year.
 

dbailey

CSEG and Liaisons
Staff member
Sorry I just noticed this one. Did the departure points error go away? This is indicative of a CFL violation in the CICE model. I have created a new FAQ on this here:

 
Sorry I just noticed this one. Did the departure points error go away? This is indicative of a CFL violation in the CICE model. I have created a new FAQ on this here:

Hello, thanks for sharing this link. The errors did go away when I ran it as ntasks=1024 and nthreads=1 for all components.
 
Top