Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Porting CCSM3 to a Linux/AMD Opteron cluster

Hello,

I am trying to port CCSM3 to a Linux cluster based on AMD Opteron processors using PGI compilers. I have managed to build the software, and it runs for some time (about 5 minutes) on 28 nodes before it stops by segmentation violation. The output to stdout/stderr just before it stops is:

(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(flux_atmOcn) FYI: this routine is not threaded
print_memusage iam 4 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 21282 21280 859 5957 0
... (the above two lines are repeated 12 times, one time from each of the cam allocated nodes)
--- mpimon --- Aborting run after process-3 terminated abnormally Childprocess 873 got signal SIGSEGV(11): segmentation violation ---

I am not sure how to track down this error, and I hope you could give me some advice.

Best regards,

Egil Støren
The Norwegian Meteorological Institute
Norway
 

gcarr@ucar_edu

New Member
We are working on getting CCSM3 to run on Opteron clusters. We now have one at NCAR which will facilitate our work.

There have been two specific kinds of issues to date: compiler version and Myrinet version/configuration. There may also be an additional complication that we have not yet flushed out involving 32 bit vs 64 bit as the Opteron supports both.

The Opteron cluster that I previously worked on was running 32 bit with PGI 5.1-6 and Myrinet (mpich-gm).

NOTE: If you are running with an ethernet network, not Myrinet, there are source code changes that will be needed to support use of the p4 driver in mpich that are not yet in the repository.

With regard to mpich-gm, with mpich-1.2.5..12 Myricom added "--enable-sharedlib" to their suggested build for pgi compilers in mpich.make.pgi. We needed to delete this flag to get things to run. On an older mpich-1.2.5..10 this was not an issue.

Our work to date has shown that no code changes have been needed to get CCSM3 to build and run using PGI 5.1-6. However, this was not able to pass the restart tests.

Work with the newer 5.2-4 compiler has started as has work in 64 bit but results are a ways off yet. We are also going to test the Pathscale compiler and may test other options. No work on running with OpenMP has been done.

There have been no build changes (except for explicit path names). So you should be able to use the "jazz" files in the scripts/cc*utils/Machines subdirectory to create new files for your machine with appropriate changes for your machine name, file paths, batch method, archive scripts, etc. In addition, you may want to change the models/bld/Macros.Linux file if you wish to specify the netcdf path as was done for "jazz". A simple modification in scripts/ccsm_utils/Tools/check_machine is needed to add your machine name. This should be enough to get things started.

More will be posted and tagged in the repository when we get it working on our NCAR machine "lightning".
 
Thanks for the information on your work with Opteron clusters. I think at least I have managed to define my machine (blizzard) into the system, and built the software for a T31_gx3v5 case with no complaints from the compiler.

The problem I mentioned in my previous mail (segmentation violition) occured in the CSIM program at the first call to subroutine construct_fields in ice_transport_remap.F. I was able to eliminate this problem by removing the -Mrecursive compiler option when compiling this file. But unfortunately a similar problem occured later in the run, this time in the coupler.

Since removing recursion was so successful, I am inclined to remove it for all fortran compilations. But I am not sure if recursion is in fact used in the software. If so, I would maybe introduce errors that is not easy to track down. The recursion compiler option is set in the Macros.Linux file, and I suppose it is put there for a reason. Do you know if it is safe to turn off recursion? If recursion is in fact used, is it possible to produce a list of files where this is the case, so that I could turn on the recursion option only for those files?

Best regards,

Egil
 

gcarr@ucar_edu

New Member
Even on our Xeon clusters we have had difficulty with compiler options and correct results with the PGI compiler. If you check the files and scripts for the "jazz" machine you will see that we use a very limited number of options. There are known problems with "fast" and may be issues with other options such as the "sse" options. We do use the -Mrecursive option with PGI 5.1-3 and 5.1-6 with our Xeon clusters. We are still working on our Opteron cluster.
 
We have now managed to run the CCSM3 model on our Linux cluster based on AMD Opteron processors. We have also successfully run the recommended validation tests described in ch. 7 of the user guide.

For the benefit of others that may struggle with similar probems as we have encountered, I will briefly summarise our experiences:

The problems with segmentation violation were solved when we discovered that the default stack size on our CPU's were set too low. For various reasons, we use mpimon from Scali as a susbtitute for mpirun from mpich. In the $CASE.$MACH.run script we used the following commands to start up the executables on the different nodes:

limit stacksize unlimited
mpimon -inherit_limits ...

When running the tests (using the create_test script) we had to make a small change in the script in order to aviod problems with too long file pathes:

The line:
set casebase = T$testcase.$grid.$compset.$mach

was substituted with:
set casebase = T$testcase

Best regards,

Egil
 
Top