Jobs slow with more nodes

thompscs@purdue_edu · Jan 31, 2011

While attempting to run CESM on the Purdue cluster Steele using openmpi (1.5) as the MPI library and the Intel compiler (11.1.072), we encountered a strange situation in which adding more nodes to a job actually slowed the performance of the job dramatically.

Running a case with 8 nodes had it complete in 128 minutes. Running the same job with 16 nodes, the performance slowed dramatically to 230 minutes (and actually hit the 4 hour walltime limit on the job under two earlier invocations, causing it to get evicted from the cluster). This was repeatable. On our other cluster, Coates, this behavior was not experienced using the same compiler & libraries. On our other cluster we saw the job behavior fairly comparable with 8 nodes and speed up with 16 nodes, as expected.

What kind of bottleneck could be causing this, and is there a good way to debug a situation like this?

eaton · Feb 5, 2011

The scaling curve can turn over due to communication bottlenecks. This depends on the network bandwidth and possibly on the mpi lib configuration. I don't have any experience with analyzing either of those things. But if you are running an identical model on similar systems and getting very different performance perhaps your system experts can help.

Jobs slow with more nodes

thompscs@purdue_edu

New Member

eaton

CSEG and Liaisons