Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CCSM3 on bluefire: Help with load balance

Dear CCSM3 experts

I setup CCSM3 to run on bluefire, by modifying the original scripts from bluevista.
I need your help and advice on how to produce a load balanced and efficient setup,
to save GAUs on my runs.

I am mostly interested in T85_gx1v3 resolution.
The original (TASK,THREAD) pairs (i.e. MPI tasks and OpenMP threads)
on bluevista for T85_gx1v3 resolution are listed below.
They are available in the latest downloadable version of CCSM3 (3.0.1_beta14),
on file "scripts/ccsm_utils/Machines/env.ibm.bluevista".
They seem to be the same for other older NCAR IBM machines (bluesky, blackforest,
eagle, and even "generic_ibm"):

Component -- TASK -- THREAD
atm -------------- 16 -------- 4
lnd ---------------- 28 -------- 1
ocn --------------- 20 -------- 1
ice ----------------- 8 --------- 1
cpl ---------------- 8 ---------- 1

(Total processors = 128 = 16x4 + 28x1 + 20x1 + 8x1 + 8x1 )

This requires 128 processors.
I ran this setup on blufire using two nodes with SMT.
It took about 38 minutes of wall time to produce 5 days of simulation.
This is very slow, and most likely not load balanced.
With this setup, it would take about 46 hours to simulate one year!
The ocean seems not to have enough processors, whereas the atmosphere
and land have perhaps too many.

I experimented with several different "(TASK,THREAD)" sets,
using 1, 2, and 4 nodes, both with SMT and with processor binding.
In my experience, using one node or using four nodes make things worse,
and so does processor binding.
However, with two nodes and different "(TASK,THREAD)" pairs,
I could reduce the wall time for 5 simulation days to about 8.5 minutes.
Much better than beofre,
but this is still slow, and would take about 10 hours to simulate one year.

Nevertheless, there are so many possible combinations of "(TASK,THREAD)"
pairs, total number of processors, SMT, and processor binding,
that I cannot possibly try them all to find the sweet spot.

Hence my question:

Can anybody there tell me the optimal setup for CCSM3 T85_gx1v3 on bluefire?

Even better:

Can anybody post a load balanced env.ibm.bluefire file on this bulletin board?

Any suggestions, or previous experiences with load balancing T85_gx1v3 (or other resolutions),
are greatly appreciated also.


Thank you very much.
Gus Correa
 
Did you figure this out? I just did some benchmarking for a T85_gx1v3 run and got around 10 simulated years per wall clock day. I can send you my configuration if you still need it.
 
Top