Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

performance on AMD EPYC cluster

l_vankampenhout@uu_nl

Leo van Kampenhout
Member
Hello,

I'm in the process of porting CESM2 to the new Dutch supercomputer Snellius, which is comprised of AMD EPYC 7H12 processors. Each node has 2 CPU's with 64 cores each, making 128 cores per node in total. Nodes are tied together with Infiniband HDR100 (100Gbps), fat tree topology.

So far, the performance I'm seeing is disappointing, with a throughput that is worse than the previous Intel-based machine from 2015, using the same number of cores. I've been playing around with compiler flags etc. but these don't seem to make a whole lot of difference.

Is there anyone with experience with these kind of AMD systems? Given the large number of cores per node, should I aim for hybrid parallelization (OpenMPI + MPI) rather than using pure MPI , as I do now? If so, any advice on how to configure that would be helpful, so far I haven't been able to get this running with good results.

Leo
 

jedwards

CSEG and Liaisons
Staff member
I have ported to a very similar machine - lonestar6 at TACC. I find the performance is pretty good. Generally hybrid parallelization doesn't work well until all opportunities to use pure MPI have been exhausted. What compiler are you using on your system?
 

l_vankampenhout@uu_nl

Leo van Kampenhout
Member
The performance is around 5000 (pe-hours/simulated_year) on 1280 cores for a 1-degree 5-day test run. Timing file attached.

By the way, thanks for responding Jim.
 

Attachments

  • cesm_timing.snellius_scaling_n10f.216015.211201-102826.txt
    9.8 KB · Views: 5

jedwards

CSEG and Liaisons
Staff member
GCC is not known as a performant compiler. I would suggest trying the Intel compiler, it's now available without charge.
Also your timing file indicates that some load balancing could improve performance.

Using the same total pe count You should better balance ice and lnd/rof by changing:
NTASKS_ICE=128,NTASKS_LND=896,NTASKS_ROF=896,ROOTPE_ICE=896
 
Top