Parallel efficiency of CESM1_0_3 on a Beowulf linux cluster

zrj@ustc_edu_cn · Sep 1, 2011

Hi, everybody.

I've set up a Beowulf linux cluster with 4 PCs. Every PC consists of 2 CPUs (Intel Xeon CPU E5620 @ 2.40 GHz), and every CPU has 4 cores, 8 threads. So total CPU number is 8, total CORE number is 32, total THREAD number is 64. (more information see attachment CPU.txt)
View attachment 73

The cluster system is CentOS 5.5 with intel compilers (ifort 11.1 and icc 11.1), openmpi 1.4.3 and netcdf 4.1.1 are successfully installed. CESM1_0_3 can be compiled and run on the cluster successfully.

I ran a case EX1(FW, f45_f45, NTASKS*=8, NTHRDS*=1, ROOTPE*=0, mpirun -np 8 ./ccsm.exe, 5days) on one node, it took about 18 minutes. However, when I ran another case EX0(FW, f45_f45, NTASKS*=32, NTHRDS*=1, ROOTPE*=0, mpirun -np 32 -H node00,node01,node02,node03 ./ccsm.exe, 5days) on all nodes, it took more time, about 20 minutes. Why is it slower when using 4 nodes than 1 node?? I have no idea.The timing files of EX0 & EX1 can be found in the attachments. Could you help me to improve the parallel efficiency of CESM on my cluster? How should I set the PE layout?

Thanks in advance.

BTW: I test a simple parallel program pi.f90, 4node is significantly faster than 1node.

tcraig · Sep 1, 2011

according to the timing files, the time on 8 tasks and on 32 tasks is
nearly exactly the same, as you noted. the timing files also suggest that
99% of the time is spent in the atmosphere model. what can you tell
us about the interconnect and the ability of this system to scale performance
on real applications? i suggest you run an additional test which would be
to set the ntasks=8 but then place 2 tasks on each node. it will be
interesting to compare that timing with the two other results.

zrj@ustc_edu_cn · Sep 2, 2011

Thanks for replying.

I ran another case EX2(FW, f45_f45, NTASKS*=8, NTHRDS*=1, ROOTPE*=0, mpirun -np 8 -H node00,node01,node02,node03 ./ccsm.exe, 5days). It took about 22 minutes. The timing file can be found in the attachments.

I ran a simple program (pi.f90, source code in the attachments), the following is the result:

[zrj@node00 tmp]$ mpif90 pi.f90 -shared-intel -ip
[zrj@node00 tmp]$ mpirun ./a.out
Estimate of Pi is 3.14159265358970
Elapsed seconds = 12.1631829738617
[zrj@node00 tmp]$ mpirun -np 4 ./a.out
Estimate of Pi is 2.89050019437654
Elapsed seconds = 3.07132005691528
[zrj@node00 tmp]$ mpirun -np 8 ./a.out
Estimate of Pi is 2.97514860332297
Elapsed seconds = 1.60712099075317
[zrj@node00 tmp]$ mpirun -np 16 ./a.out
Estimate of Pi is 2.99378456940981
Elapsed seconds = 1.50256085395813
[zrj@node00 tmp]$ mpirun -np 8 -H node00,node01,node02,node03 ./a.out
Estimate of Pi is 2.97514860332297
Elapsed seconds = 1.69279789924622
[zrj@node00 tmp]$ mpirun -np 16 -H node00,node01,node02,node03 ./a.out
Estimate of Pi is 2.99378456940981
Elapsed seconds = 0.870617866516113
[zrj@node00 tmp]$ mpirun -np 32 -H node00,node01,node02,node03 ./a.out
Estimate of Pi is 3.12345321446372
Elapsed seconds = 0.452975988388062

Is there any way to improve CESM running speed?

THANKS.

jacob@mcs_anl_gov · Sep 2, 2011

What kind of network is connecting your 4 PCs? CESM sends a lot of messages and a slow network will slow down the speed. pi.f90 sends almost no messages and is not a good test for scaling.

zrj@ustc_edu_cn · Sep 6, 2011

My 4 PCs are connected by a 1G switcher. The following are the cluster network settings:

View attachment 76

I use iperf to test net speed between every two nodes. The results are:

[ 4] local 192.168.0.100 port 5001 connected with 192.168.0.101 port 55030
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec
[ 5] local 192.168.0.100 port 5001 connected with 192.168.0.102 port 45353
[ 5] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec
[ 4] local 192.168.0.100 port 5001 connected with 192.168.0.103 port 47410
[ 4] 0.0-10.0 sec 1.10 GBytes 941 Mbits/sec

I use NetPIPE to test openmpi. I run mpirun -np 2 -H node00,node01 ./NPmpi and mpirun -np 2 ./NPmpi respectively, and find that mpi transfer between two nodes is significantly slower. The testing results are in the attachments. So I think the key to parallel efficiency is openmpi. What can I do to accelerate openmpi transfer?

jacob@mcs_anl_gov · Oct 4, 2011

Hi rjzhou,

I'm afraid its not the openmpi software. Its the 1G switch. The hardware latency in ethernet switches is to high. You'll need to replace it with an Infiniband switch. No amount of optimizing the MPI software will get around that.

Rob

Parallel efficiency of CESM1_0_3 on a Beowulf linux cluster

zrj@ustc_edu_cn

New Member

tcraig

Member

zrj@ustc_edu_cn

New Member

jacob@mcs_anl_gov

Rob Jacob

New Member

zrj@ustc_edu_cn

New Member

jacob@mcs_anl_gov

Rob Jacob

New Member