Compiling CCSM pop with IBM XLF11 compiler and -qhot option

njn01 · May 29, 2008

Recently, we have become aware of incorrect compilation of the pop2 code with the IBM XLF11 compiler when using the -qhot option. The following summary, written by Keith Lindsay, provides additional information about what is known about the problem.

Notes on IBM XLF11 Compilation of CCSM pop Code with the -qhot Option
============================================================

All of my tests have been conducted with the ccsm3_5_beta20 tag,
in gx3v5 ocean-only (pop2) cases.

I have not run any tests on any pre-ccsm3_5 code, i.e., code that
is based on pop1.4. So I cannot say if older, pop1.4-based code is
being compiled correctly or not when -qhot is used with xlf11. In
the transition from pop1.4 to pop2, every line of code changed. Many
arrays gained a new new block dimension. So I would think that from
the compiler's point of view, pop1.4 and pop2 are completely different
pieces of code, and conclusions about pop2 compilation would not
apply to the pop1.4 code.

If you've run cases on bluevista that compiled ocean code with xlf11
and -qhot, I strongly advise you to rerun a portion without -qhot
and compare the results carefully to the run with -qhot.

One approach is to do three very short branch runs, generating
pop tavg files every timestep:
A) without -qhot
B) without -qhot, reduce the barotropic convergence
criteria by a factor of 10
C) with -qhot

If the divergence between C) and A) does not grow faster than the
divergence between B) and A), then your existing bluevista runs are
probably fine.

Another important finding in a reduced size test case that miscompiles
is that the miscompilation happens for some array sizes and not for
others. So if you find that you're comfortable with results using -qhot
in a particular configuration, you should not conclude that results
are safe in cases in which there are changes to your configuration
that affect array sizes, such as changing PE count or resolution. This
is one of the main reasons that I don't want to personally get into
attempting to validate pop1.4 code. There are too many caveats that
could lead to a false sense of security.

More details on the tests that I have done:

I ran cases in which pop2 was compiled with and without -qhot. Both
cases ran for two days, and in each case, the ocean model wrote double
precision tavg files at every timestep.

There were big differences between these runs. One of the signals looks
like it is due to a miscompilation of tidal mixing code, which is a new
vertical mixing subparameterization in ccsm3_5. The model's mixing is
happening incorrectly near the sea floor, which generates a surface
signal near shallow topography. There may be other miscompilations
going on, but once I found such a large signal, I didn't look deeper.

I also have some longer runs, in the same configurations, that show
that removing -qhot does affect the large scale circulation in a
non-trivial way.

In the configuration that I was running, xlf10 and xlf11 generate
identical results when -qhot is not used.

Keith Lindsay

Compiling CCSM pop with IBM XLF11 compiler and -qhot option

njn01

Member