jsewall@es_ucsc_edu
New Member
We are trying (and sort of have) to get CCSM3 (3.0 version beta22 to be exact) running here in the Netherlands on and SGI Origin 3800 that is in a CC-NUMA architecture.
I say we "sort of" have the model running (fully coupled system) because we continue to encounter strange "random" errors. A sample of those are:
If we try to run the ocean on less than four processors, there are "random" (occurrence will change with the number of debugging print statements we have in the code) issues with the use of global_gather which results in the ocean model reading the first two ghost cells when moving from the computational grid (with ghost cells) to the global domain (without ghost cells). This shift causes the ocean to "lose" the islands and bomb.
The ice model will sometimes not run on more than one processor. I don't know why, but on one, it works, on two or four, it bombs. BUT, the entire system will eventually bomb. With careful setting of MPI env variables (I think, this is out of my league) CSIM will run on four processors and the entire system works much better.
Our total distribution is over 22 processors (the only configuration that will run and restart that we have found) with the atmosphere on 8, ocean on 8, ice on 4 and land and coupler on 1 each.
The model will build and run if the following compiler flags and MPI environment variables (and only these! other combinations cause the islands to disappear, or the landfraction between CLM and CAM to disagree etc.) are set with the distribution of processors I outline above:
MPI env vars:
setenv _DSM_WAIT SPIN
setenv MPI_OPENMP_INTEROP
setenv MPI_DSM_VERBOSE
setenv MPI_STATIC_NO_MAP
setenv TRAP_FPE "UNDERFL=FLUSH_ZERO; OVERFL=ABORT,TRACE; DIVZERO=ABORT,TRACE"
setenv OMP_DYNAMIC FALSE
setenv MPC_GANG OFF
setenv _DSM_VERBOSE ON
setenv _DSM_PLACEMENT ROUND_ROBIN
setenv MPI_BUFS_PER_HOST 512
setenv MPI_BUFS_PER_PROC 1024
Compiler flags:
CPPDEFS := -DIRIX64 -DSGI
CC := cc
CFLAGS := -c -64
FIXEDFLAGS :=
FREEFLAGS :=
FC := f90
FFLAGS := -c -64 -mips4 -O2 -r8 -i4 -show -extend_source
MOD_SUFFIX := mod
LD := $(FC)
LDFLAGS := -64 -mips4 -O2 -r8 -i4 -show -mp
We are running the hybrid OpenMP/MPI with CAM. We have now moved to trying the entire system with only MPI.
In an attempt to debug some of our problems, and use debuggers, we have tried compiling with -g and other options. Unfortunately, the model will not run if compiled with these options. ie, in attempting to debug you end up debugging errors that don't occur when you aren't debugging.
I thought we had a stable configuration and had the entire system running and restarting out for 3 years. At 3 years (without rebuilding and using the same land surface file as previously worked) on a restart the land fraction in CLM and CAM suddenly disagrees.
I moved the model back a year to a place it certainly agreed and ran previously, and got the same error.
The land fraction suddenly disappearing, the islands disappearing in the ocean at some times, seems like an array out of bounds (which we have/are looking for but can't find since it seems to change location). I think this has something to do (since it seems sensitive to it) with the way we are distributing the model over processors and how MPI is working on the machine. Does this sound plausible?
I am hoping that someone might be able to shed some light (MPI environment variables, OpenMP vs MPI, compiler flags, processor distributions that we are missing) such that we can get the model running stably on this architecture.
Thanks for any help,
Jake
I say we "sort of" have the model running (fully coupled system) because we continue to encounter strange "random" errors. A sample of those are:
If we try to run the ocean on less than four processors, there are "random" (occurrence will change with the number of debugging print statements we have in the code) issues with the use of global_gather which results in the ocean model reading the first two ghost cells when moving from the computational grid (with ghost cells) to the global domain (without ghost cells). This shift causes the ocean to "lose" the islands and bomb.
The ice model will sometimes not run on more than one processor. I don't know why, but on one, it works, on two or four, it bombs. BUT, the entire system will eventually bomb. With careful setting of MPI env variables (I think, this is out of my league) CSIM will run on four processors and the entire system works much better.
Our total distribution is over 22 processors (the only configuration that will run and restart that we have found) with the atmosphere on 8, ocean on 8, ice on 4 and land and coupler on 1 each.
The model will build and run if the following compiler flags and MPI environment variables (and only these! other combinations cause the islands to disappear, or the landfraction between CLM and CAM to disagree etc.) are set with the distribution of processors I outline above:
MPI env vars:
setenv _DSM_WAIT SPIN
setenv MPI_OPENMP_INTEROP
setenv MPI_DSM_VERBOSE
setenv MPI_STATIC_NO_MAP
setenv TRAP_FPE "UNDERFL=FLUSH_ZERO; OVERFL=ABORT,TRACE; DIVZERO=ABORT,TRACE"
setenv OMP_DYNAMIC FALSE
setenv MPC_GANG OFF
setenv _DSM_VERBOSE ON
setenv _DSM_PLACEMENT ROUND_ROBIN
setenv MPI_BUFS_PER_HOST 512
setenv MPI_BUFS_PER_PROC 1024
Compiler flags:
CPPDEFS := -DIRIX64 -DSGI
CC := cc
CFLAGS := -c -64
FIXEDFLAGS :=
FREEFLAGS :=
FC := f90
FFLAGS := -c -64 -mips4 -O2 -r8 -i4 -show -extend_source
MOD_SUFFIX := mod
LD := $(FC)
LDFLAGS := -64 -mips4 -O2 -r8 -i4 -show -mp
We are running the hybrid OpenMP/MPI with CAM. We have now moved to trying the entire system with only MPI.
In an attempt to debug some of our problems, and use debuggers, we have tried compiling with -g and other options. Unfortunately, the model will not run if compiled with these options. ie, in attempting to debug you end up debugging errors that don't occur when you aren't debugging.
I thought we had a stable configuration and had the entire system running and restarting out for 3 years. At 3 years (without rebuilding and using the same land surface file as previously worked) on a restart the land fraction in CLM and CAM suddenly disagrees.
I moved the model back a year to a place it certainly agreed and ran previously, and got the same error.
The land fraction suddenly disappearing, the islands disappearing in the ocean at some times, seems like an array out of bounds (which we have/are looking for but can't find since it seems to change location). I think this has something to do (since it seems sensitive to it) with the way we are distributing the model over processors and how MPI is working on the machine. Does this sound plausible?
I am hoping that someone might be able to shed some light (MPI environment variables, OpenMP vs MPI, compiler flags, processor distributions that we are missing) such that we can get the model running stably on this architecture.
Thanks for any help,
Jake