CAM 5.3 Performance Analysis on Institute HPC Machine

rambhari0123@gmail_com · Dec 26, 2013

Hello, We have perform parallel test run cesm1.2.0 (CAM5.3) as CAM standalone on Linux machine with PGI CDK 13.7 (Inbuild MPI2 library) compiler and parallely compiled NeTCDF4.2 libraries. We had performed 1 day, 30 days and 365 days test run for resolution 1.9x2.5 with varying the number of processors 1, 12, 24, 30, 36, 48 and 60.So, When we are running the model with 1, 12,24 and 30 cores the performance is as per expectations means compute time decreases .these test run taking by default decomposition. But when we are running the model with 36 48 and 60 cores the model the compute time incraeses and model taking more time to complete.We testsed different npr_yz combinations for 36 ,48 and 60 but we didn't find much improvment. So, here I ma attaching the Performance table for the various test runs of 1 day, 30 days and 365 days. We are not able to figure out why this compute time increses when we are increasing no. of cores. With less cores we are getting right compute time. Thus I also checked with help of the machine admin but we didn't find anything. So, I request you to please commnet on our test runs performance and suggest the needful to improve it. The Details about the machine specific i am using to run the model are as followings:No. of nodes -1 master node 9 compute nodes and 12 processor per node(hexacore), 24 GB RAM P/N. where Master Node -Fujitsu Primergy RX 300S7, Intel Xeon ES260@ 2GHz, 24GB RAM, 8TB HDD and compute node(0-8)- Fujitsu Primergy RX 200S7, Intel Xeon ES-2620 @ 2GHz,24GB RAM, 500GB HDD. it is Rocks Cluster 6 with Torque/PBS job scheduler and Compiler- PGI CDK 13.7. It is a Operating System Linux (CentOS 6.2). Thankyou for your anticipation. Regards:Ram IIT DELHI

eaton · Dec 26, 2013

I agree that the scaling looks reasonable out to 30 tasks. You could go to 32 tasks and still use the default 1D decomposition. Beyond that you must use the 2D decomposition via the npr_yz variable as you have done. This does increase the mpi overhead, but it is surprising to see such bad performance. I don't have any experience with tuning MPI installations, so cannot offer any advice. But a workaround is to try and get more performance by using fewer tasks and running threads in each task. You should be able to verify whether or not this will help by rerunning the 1 node test using a couple of different threading configurations. For example, instead of running pure mpi with 12 tasks, try using 6 tasks with 2 threads per task and then try 4 tasks with 4 threads per task. If you get similar performance from the pure mpi and the hybrid configurations there is a good chance this will help significantly at the higher core counts (>30 cpus).

rambhari0123@gmail_com · Jan 10, 2014

Hi,Thanks for your illustrative reply.We installed openmpi. when we are running the model with openmpi in 1D- decomposition we are getting almost similar compute timings in fact a liitle improved, but for 2-D decomposition and higher core (>32) count model run stuck on the below line (when giving ntask and nthread): ---------------------------------------------- get_rsf: numalb, numcolo3, numsza, nump = 6 20 24 151 get_rsf: size of rsf_tab = 67 151 24 20 6 ------------------------------------------------------------ We have configured the model with below cammands #/home/opt/app/cesm1_2_0/models/atm/cam/bld/configure -fc_type pgi -fc mpif90 -cc mpicc -dyn fv -hgrid 1.9x2.5 -ntasks 6 -nthreads 7 -test#gmake#/home/opt/app/cesm1_2_0/models/atm/cam/bld/build-namelist -test -config /home/test/NewPerformanceTest/Test9_42/bld/config_cache.xml -namelist "&camexp npr_yz=6,7,7,6 /" Also, I m giving torque PBS script.------------------------------------------------------------------#!/bin/bash#PBS -q batch#PBS -N CAM#PBS -l walltime=24:00:00#PBS -l nodes=6:ppn=7#cd $HOME/NewPerformanceTest/Test9_42/run mpiexec -np 42 -hostfile $PBS_NODEFILE $HOME/NewPerformanceTest/Test9_42/bld/cam >& cam.log-------------------------------------------------------------------------------------------------------- So, Now we want to run the model with -ntask and -nthread for higher core count >32CPUs. I am not able to understand how to decide the distribution of ntask and nthread (for 36 , 42 and 48 cores). Also , we have doubt how to distribute the nodes and ppn in PBS/torque run script for the above distribution.So , Please can you tell us how to manage task and threads and also nodes and ppn in run script. Thank you in anticipation. Ram

eaton · Jan 10, 2014

I can't tell whether you've had any success doing the simple tests with OMP. That is the place to start. Since you have 12 cpus on a node the first test is whether you can get similar performance using a hybrid parallel configuration to the performance you get using pure MPI. For example:Test 1: 1 node, 6 tasks, 2 threads per task. To set this run up you first run configure with the argument "-ntasks 6 -nthreads 2", and in your run script, before the call to build-namelist, set the environment variable OMP_NUM_THREADS to 2. This value is used by the build-namelist utility to properly set the threading in the namelist file. The CAM executable ignores the OMP_NUM_THREADS variable and sets up threading by looking at namelist variables. The PBS command for setting the tasks will look like "-l nodes=1:ppn=6". You'll need to check system specific information to see whether the mpiexec command has arguments that control how the threads are assigned to the tasks. Hopefully the default arrangement will work well.Once the model is run the log file will contain information similar to the following which will tell you whether you've successfully configured both tasks and threads. Note that this output refers to tasks as "npes"(seq_comm_setcomm) initialize ID ( 7 ATM ) pelist = 0 5 1 ( npes = 6) ( nthreads = 2)

If this test gives a performance that's similar to the pure mpi test with 12 tasks then there is hope that using threads will be beneficial. To see whether using more threads per task might be beneficial do further tests using, for example, 4 tasks and 3 threads per task, then try 3 tasks w/ 4 threads per task, and finally 2 tasks w/ 6 threads per task. Note that each test involves rerunning the configure command, rebuilding, and rerunning the build-namelist command. Also modifications to the PBS and the mpiexec commands are needed. Also, these tests are all using a single node. Once you figure out the best way to do threading on a single node, then start scaling it up for multiple nodes. The idea is to be using a task/thread combination on each node that fully utilizes the cpus on the node.

CAM 5.3 Performance Analysis on Institute HPC Machine

rambhari0123@gmail_com

Member

eaton

CSEG and Liaisons

rambhari0123@gmail_com

Member

eaton

CSEG and Liaisons