Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

linux cluster success stories?

Even though the model still does not run in the linux cluster here at Yale, I am still clinging to some small bit of hope that it will be able to run once I fix some small error that I have overlooked thus far. Right now it appears that CAM3.0 does not like the MPICH that is currently installed, which is version 1.2.6.

I am asking that, anybody who has successfully been able to run CAM3.0 on a linux cluster, could you please email me and tell me what version of MPICH you are using?

Thanks,
Cathy
 
Hope this isn't premature, but our port/install went fairly well. Over the last few days we assembled a new linux cluster and got CAM 3.0p1 up and running relatively painlessly.

Our initial set up is:
8 P4/3.06GHZ nodes, 1GB ram, 40GB local drives, 2x500GB HDD and 2GB ram on the master node.
gigbit ethernet (netgear gs116 switch)
Fedora Core 3, full install on master, just basics on the nodes.
Lahey Fortran 6.2 Pro.
mpich-1.2.5 (distributed on the Lahey CD)

Notes/gotchas:
1) The default security settings for Fedora can be a pain - have to edit pam.d, specifically install and turn on rsh. Also to make the c shell default. Wrote a script to do that on each node.
2) Had some library problems until installed all the Lahey libs on every node. (not just the run-time libs).
3) For mpich, had to comment out a couple of external decs in mpif.h, as noted in that file, to prevent problems with the Lahey compiler bombing on missing external subs.
4) Ultimately used -tp4, -O2 options for compile.

For the perturbation growth test, we ran with the compile options in the supplied build scripts and -tp4 -O2. The second was closer to the NCAR IBM T31 run - can post links to the graphs if anybody interested. We ran that test stand alone, then 2, 4, 6, and 8 nodes and compared to ensure identical results. Run times for that test:

Stand alone (no -spmd): 7m 49s
2 nodes: 4m 13s
4 notes: 2m 37s
6 nodes: 1m 58s
8 nodes: 1m 49s

At T-31, the 8 node config is taking a bit less than 1 minute per model day for longer runs.
We just finished some multi-year runs, and they seem to be OK. Haven't finished formally comparing with the controls runs yet. Next is to add the rest of the nodes (another 8, for 16 total. Will also probably try the Intel compiler - on some of our in house CFD codes, we get as much as a 5% improvement in run times as compared with Lahey.

Regards,

Chuck
 
Top