throughput optimization

pradalma · Dec 12, 2022

Hi

I have verified my port to the machine Rockfish (machine at JHU). (using the instructions for the 3 ensemble member, super fast : resulting in the 3 output files case.cesm_tag.uf.000, case.cesm_tag.uf.001 and case.cesm_tag.uf.002)
After verifying the port, I created a newcase with : create_newcase --case testb1850_01 --compset B1850 --res f09_g17_gl4 --machine rockfish --pesfile ~/.cime/config_pes.xml
then proceeded to case.setup
case.build
check_input_data
case.submit

As of right now the machine is set up as follows :

config_machines.xml
<init_path lang="perl">/data/apps/linux-centos8-cascadelake/gcc-9.3.0/lmod-8.3-tbez7qmxvu3dwikqbs3hdafke5vcsbxv/lmod/lmod/init/perl</init_path>
<init_path lang="python">/data/apps/linux-centos8-cascadelake/gcc-9.3.0/lmod-8.3-tbez7qmxvu3dwikqbs3hdafke5vcsbxv/lmod/lmod/init/env_modules_python.py</init_path>
<cmd_path lang="sh">module</cmd_path>
<cmd_path lang="csh">module</cmd_path>
<cmd_path lang="perl">/data/apps/linux-centos8-cascadelake/gcc-9.3.0/lmod-8.3-tbez7qmxvu3dwikqbs3hdafke5vcsbxv/lmod/lmod/libexec/lmod perl</cmd_path>
<cmd_path lang="python">module</cmd_path>
<modules>
<command name="purge"/>
<command name="load">standard/2020.10</command>
<command name="unload">openmpi/3.1.6</command>
</modules>
<modules compiler="gnu">
<command name="load">cesm/2.x</command>
</modules>
<modules compiler="intel">
<command name="load">intel/2022.0</command>
<command name="load">intel-mkl/2022.0</command>
<command name="load">cesm/2.x</command>
</modules>
</module_system>
<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="NETCDF_PATH">$ENV{NETCDF}</env>
<env name="OMP_NUM_THREADS">1</env>
</environment_variables>
</machine>
</config_machines>

config_pes.xml:

<ntasks_lnd>-1</ntasks_lnd>
<ntasks_rof>-1</ntasks_rof>
<ntasks_ice>-2</ntasks_ice>
<ntasks_ocn>-1</ntasks_ocn>
<ntasks_glc>-1</ntasks_glc>
<ntasks_wav>-1</ntasks_wav>
<ntasks_cpl>-3</ntasks_cpl>
</ntasks>
<nthrds>
<nthrds_atm>1</nthrds_atm>
<nthrds_lnd>1</nthrds_lnd>
<nthrds_rof>1</nthrds_rof>
<nthrds_ice>1</nthrds_ice>
<nthrds_ocn>1</nthrds_ocn>
<nthrds_glc>1</nthrds_glc>
<nthrds_wav>1</nthrds_wav>
<nthrds_cpl>1</nthrds_cpl>
</nthrds>
<rootpe>
<rootpe_atm>0</rootpe_atm>
<rootpe_lnd>0</rootpe_lnd>
<rootpe_rof>0</rootpe_rof>
<rootpe_ice>-1</rootpe_ice>
<rootpe_ocn>-3</rootpe_ocn>
<rootpe_glc>0</rootpe_glc>
<rootpe_wav>0</rootpe_wav>
<rootpe_cpl>0</rootpe_cpl>
</rootpe>
</pes>
</mach>
</grid>
</config_pes>

I have no experience with optimizing a run for throughput. I could you any help!!!
On our current run (testb1850_01)
the timing file shows this:
total pes active : 192
mpi tasks per node : 48
pe count for cost estimate : 192

Overall Metrics:
Model Cost: 15651.97 pe-hrs/simulated_year
Model Throughput: 0.29 simulated_years/day

Init Time : 210.770 seconds
Run Time : 4020.197 seconds 804.039 seconds/day
Final Time : 0.191 seconds

Actual Ocn Init Wait Time : 3334.348 seconds
Estimated Ocn Init Run Time : 4.644 seconds
Estimated Run Time Correction : 0.000 seconds
(This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time: 4020.197 seconds 804.039 seconds/mday 0.29 myears/wday
CPL Run Time: 208.698 seconds 41.740 seconds/mday 5.67 myears/wday
ATM Run Time: 3365.850 seconds 673.170 seconds/mday 0.35 myears/wday
LND Run Time: 186.269 seconds 37.254 seconds/mday 6.35 myears/wday
ICE Run Time: 37.430 seconds 7.486 seconds/mday 31.62 myears/wday
OCN Run Time: 557.270 seconds 111.454 seconds/mday 2.12 myears/wday
ROF Run Time: 18.114 seconds 3.623 seconds/mday 65.34 myears/wday

We can not afford to run this slow. We would need to get a throughput of at least 5 years per day.
I am not sure whether this can be accomplished by specifying a different pe layout, reducing the number of variables we output (we are currently using the default values when we create_newcase with this command: create_newcase --case testb1850_01 --compset B1850 --res f09_g17_gl4 --machine rockfish --pesfile ~/.cime/config_pes.xml )

the stat file in the timing directory shows this
***** GLOBAL STATISTICS ( 192 MPI TASKS) *****

$Id: gptl.c,v 1.157 2011-03-28 20:55:18 rosinski Exp $
'count' is cumulative. All other stats are max/min
'on' indicates whether the timer was active during output, and so stats are lower or upper bounds.

name on processes threads count walltotal wallmax (proc thrd ) wallmin (proc thrd )
"CPL:INIT" - 192 192 1.920000e+02 3.925829e+04 210.770 ( 72 0) 185.592 ( 190 0)
"CPL:cime_pre_init1" - 192 192 1.920000e+02 3.880440e+02 2.035 ( 55 0) 2.008 ( 160 0)
"CPL:ESMF_Initialize" - 192 192 1.920000e+02 4.800000e-02 0.001 ( 96 0) 0.000 ( 0 0)
"CPL:cime_pre_init2" - 192 192 1.920000e+02 6.767000e+00 0.049 ( 144 0) 0.028 ( 48 0)
"CPL:cime_init" - 192 192 1.920000e+02 3.886343e+04 208.707 ( 134 0) 183.534 ( 180 0)
"CPL:init_comps" - 192 192 1.920000e+02 3.237899e+04 168.995 ( 72 0) 167.567 ( 144 0)
"CPL:comp_init_pre_all" - 192 192 1.920000e+02 1.091075e-02 0.000 ( 169 0) 0.000 ( 22 0)
"CPL:comp_init_cc_atm" - 192 192 1.920000e+02 1.584994e+04 110.069 ( 5 0) 0.000 ( 170 0)
"CPL:comp_init_cc_lnd" - 192 192 1.920000e+02 2.919690e+03 20.298 ( 1 0) 0.000 ( 144 0)
"CPL:comp_init_cc_rof" - 192 192 1.920000e+02 1.637530e+02 1.142 ( 40 0) 0.000 ( 146 0)
"CPL:comp_init_cc_ocn" - 192 192 1.920000e+02 1.120859e+04 156.997 ( 158 0) 25.507 ( 121 0)
CPL:comp_init_cc_ice" - 192 192 1.920000e+02 3.657838e+02 2.541 ( 29 0) 0.000 ( 153 0)
"CPL:comp_init_cc_glc" - 192 192 1.920000e+02 7.810636e+02 5.426 ( 33 0) 0.000 ( 144 0)
"CPL:comp_init_cc_wav" - 192 192 1.920000e+02 3.913213e+01 0.273 ( 2 0) 0.000 ( 152 0)
"CPL:comp_init_cc_esp" - 192 192 1.920000e+02 3.140450e-02 0.000 ( 51 0) 0.000 ( 154 0)
"comp_init_cc_iac" - 192 192 1.920000e+02 2.992821e-02 0.000 ( 111 0) 0.000 ( 146 0)
"CPL:comp_init_cx_all" - 192 192 1.920000e+02 1.050946e+03 10.647 ( 184 0) 3.762 ( 22 0)
"CPL:comp_list_all" - 192 192 1.920000e+02 3.396273e-03 0.001 ( 0 0) 0.000 ( 151 0)
"CPL:init_maps" - 144 144 1.440000e+02 1.711666e+03 11.890 ( 18 0) 11.884 ( 126 0)
"CPL:init_aream" - 144 144 1.440000e+02 3.489308e+01 0.242 ( 138 0) 0.242 ( 61 0)
"CPL:init_domain_check" - 144 144 1.440000e+02 2.021436e+00 0.014 ( 0 0) 0.014 ( 97 0)
"CPL:init_areacor" - 192 192 1.920000e+02 1.042950e+03 14.538 ( 177 0) 2.397 ( 28 0)
"CPL:init_fracs" - 144 144 1.440000e+02 2.069594e+00 0.031 ( 69 0) 0.002 ( 3 0)
"CPL:init_aoflux" - 144 144 1.440000e+02 4.045177e-02 0.001 ( 132 0) 0.000 ( 106 0)

jedwards · Dec 12, 2022

The default out of the box pelayout is not going to be optimized for your system, you will need to do this.
A good place to start is here:
TOT Run Time: 4020.197 seconds 804.039 seconds/mday 0.29 myears/wday
CPL Run Time: 208.698 seconds 41.740 seconds/mday 5.67 myears/wday
ATM Run Time: 3365.850 seconds 673.170 seconds/mday 0.35 myears/wday
LND Run Time: 186.269 seconds 37.254 seconds/mday 6.35 myears/wday
ICE Run Time: 37.430 seconds 7.486 seconds/mday 31.62 myears/wday
OCN Run Time: 557.270 seconds 111.454 seconds/mday 2.12 myears/wday
ROF Run Time: 18.114 seconds 3.623 seconds/mday 65.34 myears/wday

Lets shoot for a target of 5 ypd. To get that we need to reduce the atm run time by a factor of about 13.
It's currently using 144 tasks, lets increase it to 1824. I usually keep the cpl ntasks the same as atm:
./xmlchange NTASKS_ATM=1824,NTASKS_CPL=1824
Now we need to double the ocn tasks so that it will keep up
./xmlchange NTASKS_OCN=-2 (negative values reflect the number of nodes so this is the same as NTASKS_OCN=96)
and reset the rootpe for the ocn so that it follows the atm tasks
./xmlchange ROOTPE_OCN=1824
finally we change the ICE, LND and ROF tasks to use all available
./xmlchange NTASKS_LND=-19,NTASKS_ROF=-19
./xmlchange ROOTPE_ICE=-19,NTASKS_ICE=-19

There is a test PFS that I use for tuning that you may want to try.
./create_test PFS.f09_g17_gl4.B1850.rockfish_gnu

Also the latest version of the intel compiler is available for download and will give much better performance than gnu.

Once you have done this initial run with the new tuning you can fine tune by balancing the ice and lnd+rof tasks and then
balance atm+ice with the ocn. Once you are completely happy with the performance you can save it to config_pes.xml and make it a default for your system.

pradalma · Dec 13, 2022

Hi Jim

Thank you so much for your reply.
First, I ran the xmlchange commands, followed by the case.setup, case.build, case.submit

here is what the model timing file shows

************ PROCESS 0 ( 0) ************

$Id: gptl.c,v 1.157 2011-03-28 20:55:18 rosinski Exp $
GPTL was built without threading
HAVE_MPI was true
HAVE_COMM_F2C was true
ENABLE_PMPI was false
HAVE_PAPI was false
Underlying timing routine was MPI_Wtime.
Per-call utr overhead est: 3.09944e-08 sec.
If overhead stats are printed, roughly half the estimated number is
embedded in the wallclock stats for each timer.
Print method was most_frequent.
If a '%_of' field is present, it is w.r.t. the first timer for thread 0.
If a 'e6_per_sec' field is present, it is in millions of PAPI counts per sec.

A '*' in column 1 below means the timer had multiple parents, though the
values printed are for all calls.
Further down the listing may be more detailed information about multiple
parents. Look for 'Multiple parent info'

Stats for thread 0:
On Called Recurse Wallclock max min UTR Overhead
"CPL:INIT" - 1 - 629.414429 629.414429 629.414429 0.000000
"CPL:cime_pre_init1" - 1 - 2.465000 2.465000 2.465000 0.000000
"CPL:ESMF_Initialize" - 1 - 0.000000 0.000000 0.000000 0.000000
"CPL:cime_pre_init2" - 1 - 0.138000 0.138000 0.138000 0.000000
"CPL:cime_init" - 1 - 626.811401 626.811401 626.811401 0.000000
"CPL:init_comps" - 1 - 257.582794 257.582794 257.582794 0.000000
"CPL:comp_init_pre_all" - 1 - 0.000025 0.000025 0.000025 0.000000
"CPL:comp_init_cc_atm" - 1 - 97.020538 97.020538 97.020538 0.000000

[...]
Overhead sum = 0.00278 wallclock seconds
Total calls = 44901

Multiple parent info for thread 0:
Columns are count and name for the listed child
Rows are each parent, with their common child being the last entry, which is indented.
Count next to each parent is the number of times it called the child.
Count next to child is total number of times it was called by the listed parents.

18 CPL:RESTART
18 CPL:RUN_LOOP
36 PIO:PIO_initdecomp_dof

1 CPL:RESTART
23 CPL:RUN_LOOP
24 PIO:PIO_closefile

thread 0 had some hash collisions:
hashtable[0][475] had 2 entries: CPL:BUDGET1 CPL:O2CT
hashtable[0][486] had 2 entries: l:begcnbal_col CPL:BUDGET2
hashtable[0][517] had 2 entries: l:shr_orb_decl CPL:a2c_atma2atmx
hashtable[0][590] had 2 entries: l:check_fields CPL:atmprep_ocn2atm
hashtable[0][883] had 2 entries: CPL:c2r_rofx2rofr CPL:r2c_rofr2rofx
hashtable[0][1165] had 2 entries: CPL:ATM_RUN CPL:w2c_wavw2wavx
hashtable[0][1403] had 2 entries: CPL:atmocnp_accum l:dyn_subgrid
hashtable[0][1777] had 3 entries: l:lc_clm2_adv_timestep CPL:C2A a:PIO:pio_write_darray
hashtable[0][1784] had 2 entries: l:bgp1 r:mosartr_tot
hashtable[0][1790] had 2 entries: l:bgp2 PIO:pio_read_nf
hashtable[0][1891] had 2 entries: CPL:W2C PIO:pio_get_var_1d_double
hashtable[0][1984] had 2 entries: r:mosartr_subcycling CPL:BUDGET
Total collisions thread 0 = 13
Entry information:
num_zero = 1807 num_one = 229 num_two = 11 num_more = 1
Most = 3

Thread 0 total memory usage = 109.912 KB
Hashmem = 32.768 KB
Regionmem = 73.152 KB (papimem portion = 0 KB)
Parent/child arrays = 3.992 KB

Total memory usage all threads = 109.912 KB

threadid[0] = 0

I am not sure what to make of this...
Overall the run took about half the time that it previously did, not the 1/13 we were aiming for. I think collisions can't be good. But I also don't understand why the model timing files does not have a line about the throughput

Any help greatly appreciated.

Marie

jedwards · Dec 13, 2022

What file are you looking at? It should be the one named cesm_timing.* in the timing subdirectory of the case directory.
It would also help if you use the PFS test as I suggested. Please show me the result of command ./pelayout in your case directory.

pradalma · Dec 13, 2022

the PFS test is still running.

For the testB1850_01,
I was looking in the OUTPUT subdirectory. My bad

Here is what the cesm_timing file shows

total pes active : 1920
mpi tasks per node : 48
pe count for cost estimate : 1920

Overall Metrics:
Model Cost: 255709.15 pe-hrs/simulated_year
Model Throughput: 0.18 simulated_years/day

Init Time : 629.453 seconds
Run Time : 6567.872 seconds 1313.574 seconds/day
Final Time : 0.129 seconds

Actual Ocn Init Wait Time : 5533.253 seconds
Estimated Ocn Init Run Time : 3.553 seconds
Estimated Run Time Correction : 0.000 seconds
(This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time: 6567.872 seconds 1313.574 seconds/mday 0.18 myears/wday
CPL Run Time: 327.368 seconds 65.474 seconds/mday 3.62 myears/wday
ATM Run Time: 5854.369 seconds 1170.874 seconds/mday 0.20 myears/wday
LND Run Time: 241.935 seconds 48.387 seconds/mday 4.89 myears/wday
ICE Run Time: 39.802 seconds 7.960 seconds/mday 29.74 myears/wday
OCN Run Time: 426.404 seconds 85.281 seconds/mday 2.78 myears/wday

jedwards · Dec 13, 2022

Your total run time went up, not down. Please send the output of ./pelayout
Also the test I recommend was with the gnu compiler, but if you have intel that would be better:
PFS.f09_g17_gl4.B1850.rockfish_intel

You may also want to install and use pnetcdf for io. Finally you may want to discuss this result with your system administration staff - there may be some subtility wrt your system and using the proper communication network that I am missing.

pradalma · Dec 13, 2022

Yes, I noticed that the throughput when from 0.29 to 0.18yr/day. yikes
here is the output of ./pelayout

Comp NTASKS NTHRDS ROOTPE
CPL : 1824/ 1; 0
ATM : 1824/ 1; 0
LND : 912/ 1; 0
ICE : 912/ 1; 912
OCN : 96/ 1; 1824
ROF : 912/ 1; 0
GLC : 48/ 1; 0
WAV : 48/ 1; 0
IAC : 1/ 1; 0
ESP : 1/ 1; 0

jedwards · Dec 13, 2022

I don't see any errors in your pelayout. I think as a next step you should use a less complicated F compset
to look at scaling on your system - I would recommend
PFS.f19_mg17.QPC6.rockfish_intel

run that test with different values of NTASKS: 48, 96, 144, 192, 384, 768
and look at the timing results, is it scaling as expected?

pradalma · Jan 19, 2023

jedwards said:
I don't see any errors in your pelayout. I think as a next step you should use a less complicated F compset
to look at scaling on your system - I would recommend
PFS.f19_mg17.QPC6.rockfish_intel

run that test with different values of NTASKS: 48, 96, 144, 192, 384, 768
and look at the timing results, is it scaling as expected?

I get an error message when trying to run the PFS.f19_mg17.QPC6 compset:

./create_test PFS.f19_mg17.QPC6.rockfish_intel
Testnames: ['PFS.f19_mg17.QPC6.rockfish_intel']
Using project from .cime/config: agnanad1
create_test will do up to 1 tasks simultaneously
create_test will use up to 60 cores simultaneously
Creating test directory /home/mpradal1/scr16_agnanad1/cesm/OUTPUT/PFS.f19_mg17.QPC6.rockfish_intel.20230119_125517_ylq87s
RUNNING TESTS:
PFS.f19_mg17.QPC6.rockfish_intel
Starting CREATE_NEWCASE for test PFS.f19_mg17.QPC6.rockfish_intel with 1 procs
Finished CREATE_NEWCASE for test PFS.f19_mg17.QPC6.rockfish_intel in 0.594539 seconds (FAIL). [COMPLETED 1 of 1]
Case dir: /home/mpradal1/scr16_agnanad1/cesm/OUTPUT/PFS.f19_mg17.QPC6.rockfish_intel.20230119_125517_ylq87s
Errors were:
ERROR: no alias f19_mg17 defined

Due to presence of batch system, create_test will exit before tests are complete.
To force create_test to wait for full completion, use --wait
At test-scheduler close, state is:
FAIL PFS.f19_mg17.QPC6.rockfish_intel (phase CREATE_NEWCASE)
Case dir: /home/mpradal1/scr16_agnanad1/cesm/OUTPUT/PFS.f19_mg17.QPC6.rockfish_intel.20230119_125517_ylq87s
test-scheduler took 0.6242105960845947 seconds

jedwards · Jan 19, 2023

Sorry - that case should be
PFS.f19_f19_mg17.QPC6.rockfish_intel

pradalma · Jan 19, 2023

Hi Jim,

We are not yet setup to use the intel compiler, so I run the test with gnu, here is what is returned in the timing file:
vi cesm_timing.PFS.f19_f19_mg17.QPC6.rockfish_gnu.20230119_131318_uert7u.11541762.230119-132859
stop option : ndays, stop_n = 20
run length : 20 days (19.979166666666668 for ocean)

component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 96 0 96 x 1 1 (1 )
atm = cam 96 0 96 x 1 1 (1 )
lnd = slnd 96 0 96 x 1 1 (1 )
ice = sice 96 0 96 x 1 1 (1 )
ocn = docn 96 0 96 x 1 1 (1 )
rof = srof 96 0 96 x 1 1 (1 )
glc = sglc 96 0 96 x 1 1 (1 )
wav = swav 96 0 96 x 1 1 (1 )
iac = siac 1 0 1 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )

total pes active : 96
mpi tasks per node : 48
pe count for cost estimate : 96

Overall Metrics:
Model Cost: 982.03 pe-hrs/simulated_year
Model Throughput: 2.35 simulated_years/day

Init Time : 10.678 seconds
Run Time : 2017.872 seconds 100.894 seconds/day
Final Time : 0.002 seconds

jedwards · Jan 19, 2023

run that test with different values of NTASKS: 48, 96, 144, 192, 384, 768
and look at the timing results, is it scaling as expected?

use ./xmlchange NTASKS=xx in the case directory to change the task count.

pradalma · Jan 25, 2023

It looks like if I use more than 1 core, speed slows down.
Do you mind taking a look at my config_compilers.xml file and see if anything looks odd?

mpradal1@login01 .cime]$ vi config_compilers.xml
<append compile_threaded="TRUE"> -fopenmp </append>

<append DEBUG="TRUE"> -g -Wall -Og -fbacktrace -ffpe-trap=zero,overflow -fcheck=bounds </append>
<append DEBUG="FALSE"> -O </append>
</FFLAGS>
<FFLAGS_NOOPT>
<base> -O0 </base>
</FFLAGS_NOOPT>
<FIXEDFLAGS>
<base> -ffixed-form </base>
</FIXEDFLAGS>
<FREEFLAGS>
<base> -ffree-form </base>
</FREEFLAGS>
<HAS_F2008_CONTIGUOUS>FALSE</HAS_F2008_CONTIGUOUS>
<LDFLAGS>
<append compile_threaded="TRUE"> -fopenmp </append>
</LDFLAGS>
<MPICC> mpicc </MPICC>
<MPICXX> mpicxx </MPICXX>
<MPIFC> mpif90 </MPIFC>
<NETCDF_PATH>$ENV{NETCDF_DIR}</NETCDF_PATH>
<SCC> gcc </SCC>
<SCXX> g++ </SCXX>
<SFC> gfortran </SFC>
<SUPPORTS_CXX>TRUE</SUPPORTS_CXX>
</compiler>
</config_compilers>

jedwards · Jan 25, 2023

More than one core or more than one node? If you slow down when using more than one node I would
guess that your network is not configured correctly or you are not compiling correctly to use it.

You might try running an MPI benchmark suite such as GitHub - intel/mpi-benchmarks
If that suite confirms a slowdown you will need to present it to your system administrators.

pradalma · Jan 26, 2023

Jim,

I don't understand the xmlchange process.
Here is what I change with xmlchange and what xmlquery returns
xmlquery COST_PES
COST_PES: 288
[mpradal1@login01 B1850_test_speed_1node]$ xmlchange COST_PES=48
For your changes to take effect, run:
./case.setup --reset
[mpradal1@login01 B1850_test_speed_1node]$ ./case.setup --reset
Successfully cleaned .case.run
Successfully cleaned env_mach_specific.xml
Successfully cleaned Macros.make
Successfully cleaned Macros.cmake
Setting Environment OMP_STACKSIZE=256M
Setting Environment NETCDF_PATH=/data/apps/extern/anaconda/envs/cesm/2.x
Setting Environment OMP_NUM_THREADS=1
/home/mpradal1/cesm2/cime/scripts/B1850_test_speed_1node/env_mach_specific.xml already exists, delete to replace
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
Creating batch scripts
Writing case.run script from input template /home/mpradal1/cesm2/cime/config/cesm/machines/template.case.run
Creating file .case.run
Writing case.st_archive script from input template /home/mpradal1/cesm2/cime/config/cesm/machines/template.st_archive
Creating file case.st_archive
If an old case build already exists, might want to run 'case.build --clean' before building
You can now run './preview_run' to get more info on how your case will be run
[mpradal1@login01 B1850_test_speed_1node]$ xmlquery COST_PES
COST_PES: 288
[mpradal1@login01 B1850_test_speed_1node]$

jedwards · Jan 26, 2023

COST_PES is not the variable that you want to change - that variable is only used for computing the timing table and has
no affect on the model configuration. What you want to change is NTASKS:

./xmlchange NTASKS=48 will set all components to use 48 tasks
./xmlchange NTASKS_ATM=48 will set the ATM component to use 48 tasks and leave the others alone.

pradalma · Jan 26, 2023

OK. I did indeed set NTASKS to 48,
I am running benchmarks tests to see what is the max speed we can get on our system.
If COST_PES is used for computing the timing tables, does it mean that the value of the model throughput may not reflect what it actually is. I am confused at what the value of COST_PES does

jedwards · Jan 26, 2023

COST_PES is an internal variable which is automatically internally updated. It doesn't affect model throughput calculation, it only affects
Model Cost.
You don't need to worry about it.

throughput optimization

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons

Marie-Aude Pradal

New Member

CSEG and Liaisons