Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Jobs slow with more nodes

While attempting to run CESM on the Purdue cluster Steele using openmpi (1.5) as the MPI library and the Intel compiler (11.1.072), we encountered a strange situation in which adding more nodes to a job actually slowed the performance of the job dramatically.

Running a case with 8 nodes had it complete in 128 minutes. Running the same job with 16 nodes, the performance slowed dramatically to 230 minutes (and actually hit the 4 hour walltime limit on the job under two earlier invocations, causing it to get evicted from the cluster). This was repeatable. On our other cluster, Coates, this behavior was not experienced using the same compiler & libraries. On our other cluster we saw the job behavior fairly comparable with 8 nodes and speed up with 16 nodes, as expected.

What kind of bottleneck could be causing this, and is there a good way to debug a situation like this?
 

eaton

CSEG and Liaisons
The scaling curve can turn over due to communication bottlenecks. This depends on the network bandwidth and possibly on the mpi lib configuration. I don't have any experience with analyzing either of those things. But if you are running an identical model on similar systems and getting very different performance perhaps your system experts can help.
 
Top