zrj@ustc_edu_cn
New Member
Hi, everybody.
I've set up a Beowulf linux cluster with 4 PCs. Every PC consists of 2 CPUs (Intel Xeon CPU E5620 @ 2.40 GHz), and every CPU has 4 cores, 8 threads. So total CPU number is 8, total CORE number is 32, total THREAD number is 64. (more information see attachment CPU.txt)
View attachment 73
The cluster system is CentOS 5.5 with intel compilers (ifort 11.1 and icc 11.1), openmpi 1.4.3 and netcdf 4.1.1 are successfully installed. CESM1_0_3 can be compiled and run on the cluster successfully.
I ran a case EX1(FW, f45_f45, NTASKS*=8, NTHRDS*=1, ROOTPE*=0, mpirun -np 8 ./ccsm.exe, 5days) on one node, it took about 18 minutes. However, when I ran another case EX0(FW, f45_f45, NTASKS*=32, NTHRDS*=1, ROOTPE*=0, mpirun -np 32 -H node00,node01,node02,node03 ./ccsm.exe, 5days) on all nodes, it took more time, about 20 minutes. Why is it slower when using 4 nodes than 1 node?? I have no idea.The timing files of EX0 & EX1 can be found in the attachments. Could you help me to improve the parallel efficiency of CESM on my cluster? How should I set the PE layout?
Thanks in advance.
BTW: I test a simple parallel program pi.f90, 4node is significantly faster than 1node.
I've set up a Beowulf linux cluster with 4 PCs. Every PC consists of 2 CPUs (Intel Xeon CPU E5620 @ 2.40 GHz), and every CPU has 4 cores, 8 threads. So total CPU number is 8, total CORE number is 32, total THREAD number is 64. (more information see attachment CPU.txt)
View attachment 73
The cluster system is CentOS 5.5 with intel compilers (ifort 11.1 and icc 11.1), openmpi 1.4.3 and netcdf 4.1.1 are successfully installed. CESM1_0_3 can be compiled and run on the cluster successfully.
I ran a case EX1(FW, f45_f45, NTASKS*=8, NTHRDS*=1, ROOTPE*=0, mpirun -np 8 ./ccsm.exe, 5days) on one node, it took about 18 minutes. However, when I ran another case EX0(FW, f45_f45, NTASKS*=32, NTHRDS*=1, ROOTPE*=0, mpirun -np 32 -H node00,node01,node02,node03 ./ccsm.exe, 5days) on all nodes, it took more time, about 20 minutes. Why is it slower when using 4 nodes than 1 node?? I have no idea.The timing files of EX0 & EX1 can be found in the attachments. Could you help me to improve the parallel efficiency of CESM on my cluster? How should I set the PE layout?
Thanks in advance.
BTW: I test a simple parallel program pi.f90, 4node is significantly faster than 1node.