Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Why is CAM slower in CCSM3.0 than in CCSM2.0.1?

Hello CAM maintainers and user list

I posted this on the CCSM list, but maybe it is appropriate to post it here also,
since the questions are related to CAM's performance.

I installed CCSM2.0.1 and CCSM3.0 on our beowulf cluster at LDEO.
Somehow the new version runs about 47% slower than the old one.
A 1-year t42_gx1v3 full dynamic run, using 32 cpus,
takes about 21h 13min on CCSM2.0.1, but lingers for 31h 15min on CCSM3.0,
using the same cpu/component distribution.

FYI, our cluster nodes are Dual 1.2GHz Athlon, 1GB ram, with Myrinet 2000,
Linux kernel 2.4.18, MPICH 1.2, Gnu and PGI 5.2-4 compilers, and PBS.

The timing data provided by the coupler suggest that the atmosphere component
"cam" is much slower in CCSM3.0 than it was in CCSM2.0.1.
Here is a comparison of timers "t25" on CCSM3.0 and "t26" on CCSM2.0.1,
which correspond to the atm -> cpl communication:

CCSM3.0 cpl log file : (shr_timer_print) timer 27: 8760 calls, 70403.943s, id: t25
CCSM2.0.1 cpl log file: (shr_timer_print) timer 27: 8760 calls, 47996.592s, id: t26

Cam's MPI communication was significantly modified in CCSM3.0.
Would this be the reason for the drop in speed?

Cam's timing files suggest that this is the case.
They show that, compared to CCSM2.0.1:

1. The total time "cam" takes between send/recv to/from the coupler increased by about 67%;

---MODEL (cam component) ----- cam timer ----- No. of calls - wall time (s)
from CCSM3.0 cam's timing.0: ccsm_rcvtosnd --- 8761 ------ 69598.148
from CCSM3.0 cam's timing.0: ccsm_sndtorcv --- 8760 ------ 28763.375
--------------- CCSM3.0 cam total communication time w/ cpl: 98361.523

--MODEL (cam component) ----- cam timer ----- No. of calls - wall time (s)
from CCSM2.0.1 cam's timing.0: ccsm_rcvtosnd --- 8761 ------ 46653.520
from CCSM2.0.1 cam's timing.0: ccsm_sndtorcv --- 8760 ------ 12143.051
-------------- CCSM2.0.1 cam total communication time w/ cpl: 58769.571

2. The total time spent on all MPI routines (i.e. communication time)
increased by about 33%.
MODEL ---- Total wall time spent on MPI calls
CCSM3.0 : 18500 s
CCSM2.0.1: 13894 s

Most of the difference appears to be
due to the replacement of "mpi_sendrecv" by "mpi_alltoallv":


---MODEL (cam component) ---- MPI function -- No. of calls -- wall time (s)
from CCSM3.0 cam's timing.0: mpi_alltoallv -- 52562 ------- 16671.400
from CCSM2.0.1 cam's timing.0: mpi_sendrecv --762120 ------- 4625.800

3. Overall some of cam's most computationally intensive routines became significatnly slower:

(Wall time in seconds)
MODEL/ROUTINE phys_driver --- radctl --- dynpkg --- realloc4(a)
CCSM3.0 ----------------- 72292 ------ 35701 --- 38137 ----- 16870
CSSM2.0.1 --------------- 54795 ------ 19286 --- 20077 ----- 1827

___________________

Questions:

A) Is there a simple way in to improve the performance of cam in CCSM3.0?

B) Was there a significant increase in the calculations performed by cam's physics/dynamics algorithms,
which might justify the 47% increase in wall time?
Or is the MPI framework of the new "cam" in CCSM3.0 tuned to NCAR's shared memory machines,
but not optimized to distributed memory beowulf clusters?

( I tried increasing cam's cpus from 6 to 8, while decreasing the land (clm) cpus from 4 to 2.
However, CCSM3.0 (walltime 26h30min) is still 32% slower than CCSM2.0.1 (walltime 20h07min).
I guess the problem is beyond load balance, and CCSM3 is in fact slower than its predecessor. )

C) Is there a simple option (namelist option, macro definition for compilation, or other)
that would restore the style of cam's MPI communication to what it was in CCSM2.0.1,
which seems to be more efficient on beowulf clusters?

Thank you very much.

Gus Correa
 
Hi,
I am the author of much of the current MPI logic in CAM physics and the CAM spectral dycores.

1) You can run the code using mpi_sendrecv instead of mpi_alltoallv by adding the following lines to the CAM namelist ...

dyn_alltoall = 1
phys_alltoall = 1

2) Unless mpi_alltoallv is seriously broken in your MPI library, I doubt that MPI is the culprit here. Load imbalance shows up in the MPI timers, and the physics cost has increased since CAM2.0. How much, I don't know - you'll have to ask one of the science developers. CAM3 has load balancing options that work in CCSM3, but CAM2 load balancing was never validated in CCSM2 to my knowledge. If you compile CAM (2 or 3) with the the CPP flag TIMING_BARRIERS set (adding -DTIMING_BARRIERS to CPP_FLAGS) you can trap load imbalance in the timing barriers timer events. First, though, you might simply compare min and max timers for the events bc_physics and ac_physics across all processes for both CAM2 and CAM3 to check whether there are obvious differences.

3) There are LOTS of MPI tuning options in CAM3. However, in my experience, most of them don't make much difference. It might be worthwhile trying load balancing. A low communication option is

phys_loadbalance = 3
phys_alltoall = 1

The optimal physics load balance (and maximum communication cost) is

phys_loadbalance = 2
phys_alltoall = 0 or 1, depending on whether you want to use mpi_alltoallv or mpi_sendrecv to implement the alltoallv

There are also options to use non blocking send/recv commands if you think that these would be better than mpi_sendrecv. I'd wait until you look at the load imbalance and sendrecv vs. alltoallv issues.

Hope that this helps.

Pat Worley
 

pjr

Member
I can comment a bit on the physics changes between CAM2 and CAM3.
We have added significantly to the physics cost by putting in explicit transport of two more water species (ice and liquid), sedimentation, and phase changes between condensate species, explicit representation for about 10 species of aerosols, significant improvements in radiative transfer, and cloud overlap, and code modifications that have occasionally sacrificed performance for usability. There is no doubt that CAM3 is more expensive than CAM2.


Phil
 
Hi, CAM users:

I have run many times the CCSM3 by setting monthly run in env_run and got 10 years model output in monthly mean in only one day walltime. But when I run the stand-alone CAM3.1 for other climate study purpose, I found that with the same T31 resolution and 30 cpu processors (actually I used 32 cpus for CAM3.1), I could only get one year monthly mean output in one day run, which is 10 times slower than CCSM3.

Is that because in CAM3.1 one has no choice to set "daily", "monthly" or "yearly" like in CCSM3 env_run but to set nelapse=-3650 "daily" for 10 years as an example? Does my CAM3.1 run result, one year model output costing one day to run, make sense? And is it possible that I can set "monthly" run in CAM3.1 like CCSM3? I tested to make dtime 10 times bigger but model stopped.

I would appreciate for a quick answer, especially from software engineers who are involved in developing CAM3.1.

John
from Purdue University
School of Earth and Atmosphere Sciences
Indiana
 

pjr

Member
John,

There must be something wrong in your run setup. One obvious thing to check is whether you are running with "debug" enabled. That will certainly do it.
Also, things like whether you are using the same number of processors, and
have MPI and OPENMP enabled are going to make a difference to your throughput. I would look around the department for somebody with more experience than yourself and ask for their advice.

Since CAM3 is a subset of CCSM there is no reason that it could take longer to run, given the same level of optimzation, number of CPUs, etc. I bet that the modification to make it perform well will take less than 5 lines of changes to the build and run procedure, but I cant advise you on that remotely.

Phil
 
Thank you, Phil, for the reply. I have sorted it out and now I can run my CAM3.1 about 18 years in one day. Several factors caused slow run: 1) openMP does not scale, ought to be turned off in intel compilor options; 2) mpirun -v -np 32 $bldrun/cam
 
Top