gus@ldeo_columbia_edu
Member
Hello CCSM maintainers and user list
I installed CCSM2.0.1 and CCSM3.0 on our beowulf cluster at LDEO.
Somehow the new version runs about 47% slower than the old one.
A 1-year t42_gx1v3 full dynamic run, using 32 cpus,
takes about 21h 13min on CCSM2.0.1, but lingers for 31h 15min on CCSM3.0,
using the same cpu/component distribution.
FYI, our cluster nodes are Dual 1.2GHz Athlon, 1GB ram, with Myrinet 2000,
Linux kernel 2.4.18, MPICH 1.2, Gnu and PGI 5.2-4 compilers, and PBS.
The timing data provided by the coupler suggest that the atmosphere component
"cam" is much slower in CCSM3.0 than it was in CCSM2.0.1.
Here is a comparison of timers "t25" on CCSM3.0 and "t26" on CCSM2.0.1,
which correspond to the atm -> cpl communication:
CCSM3.0 cpl log file : (shr_timer_print) timer 27: 8760 calls, 70403.943s, id: t25
CCSM2.0.1 cpl log file: (shr_timer_print) timer 27: 8760 calls, 47996.592s, id: t26
Cam's MPI communication was significantly modified in CCSM3.0.
Would this be the reason for the drop in speed?
Cam's timing files suggest that this is the case.
They show that, compared to CCSM2.0.1:
1. The total time "cam" takes between send/recv to/from the coupler increased by about 67%;
---MODEL (cam component) ----- cam timer ----- No. of calls - wall time (s)
from CCSM3.0 cam's timing.0: ccsm_rcvtosnd --- 8761 ------ 69598.148
from CCSM3.0 cam's timing.0: ccsm_sndtorcv --- 8760 ------ 28763.375
--------------- CCSM3.0 cam total communication time w/ cpl: 98361.523
--MODEL (cam component) ----- cam timer ----- No. of calls - wall time (s)
from CCSM2.0.1 cam's timing.0: ccsm_rcvtosnd --- 8761 ------ 46653.520
from CCSM2.0.1 cam's timing.0: ccsm_sndtorcv --- 8760 ------ 12143.051
-------------- CCSM2.0.1 cam total communication time w/ cpl: 58769.571
2. The total time spent on all MPI routines (i.e. communication time)
increased by about 33%.
MODEL ---- Total wall time spent on MPI calls
CCSM3.0 : 18500 s
CCSM2.0.1: 13894 s
Most of the difference appears to be
due to the replacement of "mpi_sendrecv" by "mpi_alltoallv":
---MODEL (cam component) ---- MPI function -- No. of calls -- wall time (s)
from CCSM3.0 cam's timing.0: mpi_alltoallv -- 52562 ------- 16671.400
from CCSM2.0.1 cam's timing.0: mpi_sendrecv --762120 ------- 4625.800
3. Overall some of cam's most computationally intensive routines became significatnly slower:
(Wall time in seconds)
MODEL/ROUTINE phys_driver --- radctl --- dynpkg --- realloc4(a)
CCSM3.0 ----------------- 72292 ------ 35701 --- 38137 ----- 16870
CSSM2.0.1 --------------- 54795 ------ 19286 --- 20077 ----- 1827
___________________
Questions:
A) Is there a simple way in to improve the performance of cam in CCSM3.0?
B) Was there a significant increase in the calculations performed by cam's physics/dynamics algorithms,
which might justify the 47% increase in wall time?
Or is the MPI framework of the new "cam" in CCSM3.0 tuned to NCAR's shared memory machines,
but not optimized to distributed memory beowulf clusters?
( I tried increasing cam's cpus from 6 to 8, while decreasing the land (clm) cpus from 4 to 2.
However, CCSM3.0 (walltime 26h30min) is still 32% slower than CCSM2.0.1 (walltime 20h07min).
I guess the problem is beyond load balance, and CCSM3 is in fact slower than its predecessor. )
C) Is there a simple option (namelist option, macro definition for compilation, or other)
that would restore the style of cam's MPI communication to what it was in CCSM2.0.1,
which seems to be more efficient on beowulf clusters?
Thank you very much.
Gus Correa
I installed CCSM2.0.1 and CCSM3.0 on our beowulf cluster at LDEO.
Somehow the new version runs about 47% slower than the old one.
A 1-year t42_gx1v3 full dynamic run, using 32 cpus,
takes about 21h 13min on CCSM2.0.1, but lingers for 31h 15min on CCSM3.0,
using the same cpu/component distribution.
FYI, our cluster nodes are Dual 1.2GHz Athlon, 1GB ram, with Myrinet 2000,
Linux kernel 2.4.18, MPICH 1.2, Gnu and PGI 5.2-4 compilers, and PBS.
The timing data provided by the coupler suggest that the atmosphere component
"cam" is much slower in CCSM3.0 than it was in CCSM2.0.1.
Here is a comparison of timers "t25" on CCSM3.0 and "t26" on CCSM2.0.1,
which correspond to the atm -> cpl communication:
CCSM3.0 cpl log file : (shr_timer_print) timer 27: 8760 calls, 70403.943s, id: t25
CCSM2.0.1 cpl log file: (shr_timer_print) timer 27: 8760 calls, 47996.592s, id: t26
Cam's MPI communication was significantly modified in CCSM3.0.
Would this be the reason for the drop in speed?
Cam's timing files suggest that this is the case.
They show that, compared to CCSM2.0.1:
1. The total time "cam" takes between send/recv to/from the coupler increased by about 67%;
---MODEL (cam component) ----- cam timer ----- No. of calls - wall time (s)
from CCSM3.0 cam's timing.0: ccsm_rcvtosnd --- 8761 ------ 69598.148
from CCSM3.0 cam's timing.0: ccsm_sndtorcv --- 8760 ------ 28763.375
--------------- CCSM3.0 cam total communication time w/ cpl: 98361.523
--MODEL (cam component) ----- cam timer ----- No. of calls - wall time (s)
from CCSM2.0.1 cam's timing.0: ccsm_rcvtosnd --- 8761 ------ 46653.520
from CCSM2.0.1 cam's timing.0: ccsm_sndtorcv --- 8760 ------ 12143.051
-------------- CCSM2.0.1 cam total communication time w/ cpl: 58769.571
2. The total time spent on all MPI routines (i.e. communication time)
increased by about 33%.
MODEL ---- Total wall time spent on MPI calls
CCSM3.0 : 18500 s
CCSM2.0.1: 13894 s
Most of the difference appears to be
due to the replacement of "mpi_sendrecv" by "mpi_alltoallv":
---MODEL (cam component) ---- MPI function -- No. of calls -- wall time (s)
from CCSM3.0 cam's timing.0: mpi_alltoallv -- 52562 ------- 16671.400
from CCSM2.0.1 cam's timing.0: mpi_sendrecv --762120 ------- 4625.800
3. Overall some of cam's most computationally intensive routines became significatnly slower:
(Wall time in seconds)
MODEL/ROUTINE phys_driver --- radctl --- dynpkg --- realloc4(a)
CCSM3.0 ----------------- 72292 ------ 35701 --- 38137 ----- 16870
CSSM2.0.1 --------------- 54795 ------ 19286 --- 20077 ----- 1827
___________________
Questions:
A) Is there a simple way in to improve the performance of cam in CCSM3.0?
B) Was there a significant increase in the calculations performed by cam's physics/dynamics algorithms,
which might justify the 47% increase in wall time?
Or is the MPI framework of the new "cam" in CCSM3.0 tuned to NCAR's shared memory machines,
but not optimized to distributed memory beowulf clusters?
( I tried increasing cam's cpus from 6 to 8, while decreasing the land (clm) cpus from 4 to 2.
However, CCSM3.0 (walltime 26h30min) is still 32% slower than CCSM2.0.1 (walltime 20h07min).
I guess the problem is beyond load balance, and CCSM3 is in fact slower than its predecessor. )
C) Is there a simple option (namelist option, macro definition for compilation, or other)
that would restore the style of cam's MPI communication to what it was in CCSM2.0.1,
which seems to be more efficient on beowulf clusters?
Thank you very much.
Gus Correa