Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Buying a cluster: infiniband vs gigabit ethernet

kshell

New Member
I will be purchasing a small cluster (~$40K) to run CAM and CCSM. Has anyone done CAM/CCSM benchmarks comparing infiniband to gigabit ethernet? Specifically, does infiniband increase performance more than going with ethernet and spending the extra money on more nodes? I realize this all varies based on the chip, # of CPUs per node, # of nodes, switch type, compiler, etc., but any experience you have with these sorts of comparisons would be useful.

Thanks,
-Karen
 

mmoore

New Member
Oddly enough, yes.

The testing we did was using CAM 3.1.1 on a 16-node, dual 3.0GHZ Xeon cluster,
each node had 2 GB of memory. The tests were performed about 2.5 years ago.
I have a print out of the graphs, but I'm not sure I can find the original file.

In a nutshell, at 8 CPUs (not nodes) we were getting 1.8 years/day for IB, vs
1.5 years/day. At 16 CPUs, that wen to 3.55 years/day for IB, vs 2.6 days/year.
After 16 CPUs, the Gig-e started falling apart.

Spend the money on the processors and storage for a small cluster.

Mark Moore, mmoore@ucar.edu
Systems Administrator
 
Hi Karen,

A lot will depend on the exact configuration you need, especially in terms of extras like storage, a rack, etc. Ignoring those extra costs right now, you could probably get 8 nodes (each 8 cores / 16 GB of RAM) with IB for under $40K, and that'd be my suggestion. If you're mainly running CAM and CCSM, you might also wish to consider the AMD processors if you're not already doing so - they're miserable on most serial codes, but for parallel applications they tend to be quite good, especially those like CAM and CCSM which need lots of bandwidth.

The above assumes 'cheap' SDR infiniband, which is still quite a bit better than gigabit ethernet. Take a look at: http://www.clustermonkey.net//content/view/222/1/

I'll respectfully disagree with Mark's comment, since his own numbers on only 16 processors show IB scaling much better - at 16 processors (8 nodes), he was getting 3.55 years/day on IB and only 2.6 days/year on gigE. That's a difference of 1.365x for what should be (these days) a difference in price of only 1.1x if you can get the cheap SDR cards / network listed above. Also, bear in mind that Mark's data is using gigE with a 2:1 core to interconnect ratio whereas these days it'll more likely be 8:1 with dual-socket nodes, which will make gigE scale even worse. Also, the processors are faster these days, so for the same model, communication will occur more frequently... all these things say IB is probably better, in my opinion.

Finally, just bear in mind that it's close - you're pretty much at the cross-over point for your applications. If the difference were between buying 5 nodes with gigE or 4 nodes with IB, it's hard to say which would be best. Hope that's all clear, and if you aren't buying for another 5-6 weeks, I have some hardware coming in soon that'll have both IB and gigE links, and I can do an apples-to-apples comparison for you. Just let me know.

Cheers,
- Brian

(PS. Heh, sorry for the length of this!)
 

kshell

New Member
Mark and Brian,

Thanks for the input. I'd been leaning towards gigE, but I hadn't seen the cheap SDR Infiniband numbers before; that might change the tipping point somewhat. I just got access to a cluster with gigE and Infiniband here at Oregon State, and I'll be doing some benchmarks on it. I'll post the results when I have something useful.

-Karen
 

kshell

New Member
Just to follow up. I ended up buying an 18-node cluster, with 2 quad core intel chips (E5420) per node, plus a head node of the same configuration. I went with IB rather than GigE, since the benchmarks indicated a significant speed-up with IB when using even just a few nodes. I'm using Rocks for cluster management and intel compilers. So far, I've been happy with it. I'll post the compiler options for CCSM in another thread.

-Karen
 
Top