sang-ki_lee@noaa_gov
New Member
Our lab bought a new linux machine: 32 cpus (speed = 2612.051 MHz) Quad-Core AMD Opteron(tm)
Processor 8382 with total shared memory of 132GB. The motherboard has 8 blades and each
contains 4 cpus. CCSM3 is compiled in the new machine with mpich2(1.1.1) without error
message (ifort and gcc are used to install both ccsm3 and mpi libraries).
CCSM3 run is executed interacticvely (no batch system) for T31_gx3v5 compset B:
/usr/local/mpich2/bin/mpirun -np 4 cpl : -np 2 csim : -np 4 clm : -np 4 pop : -np 8 cam
I get the following error messages shortly after the model run is executed
Fatal error in MPI_Comm_dup: Invalid communicator, error stack:
MPI_Comm_dup(167): MPI_Comm_dup("comm=0x0", new_comm=0x9f7758) failed
MPI_Comm_dup(95).: Invalid communicator
Fatal error in MPI_Comm_dup: Invalid communicator, error stack:
MPI_Comm_dup(167): MPI_Comm_dup(comm=0x0, new_comm=0x357ec98) failed
MPI_Comm_dup(95).: Invalid communicator
......
rank 21 in job 17 phodmod.aoml.noaa.gov_51674 caused collective abort of all ranks
My Macros.Linux looks like this:
INCLDIR := -I$(INCROOT) -I/usr/local/include -I/usr/local/mpich2/include
SLIBS := -L/usr/local/lib -lnetcdf -L/usr/local/mpich2/lib -lmpich
ULIBS := -L$(LIBROOT) -lesmf -lmct -lmpeu -lmph
CPP := NONE
CPPFLAGS := -DLINUX -DPGF90 -DNO_SHR_VMATH
CPPDEFS := -DLINUX
CC := /usr/local/mpich2/bin/mpicc
CFLAGS := -c
FIXEDFLAGS :=
FREEFLAGS := -FR
FC := /usr/local/mpich2/bin/mpif90
FFLAGS := -c -r8 -i4 -extend_source -assume byterecl
MOD_SUFFIX := mod
LD := $(FC)
LDFLAGS := -L/usr/lib64 -lrdmacm -libverbs -libumad -lpthread
AR := ar
ifeq ($(MODEL),pop)
CPPDEFS := $(CPPDEFS) -DPOSIX -Dimpvmix -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY)
FIXEDFLAGS := -convert big_endian
endif
ifeq ($(MODEL),csim)
CPPDEFS := $(CPPDEFS) -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY) -D_MPI
FIXEDFLAGS := -convert big_endian
endif
ifeq ($(THREAD),TRUE)
CPPDEFS := $(CPPDEFS) -D_OPENMP -DTHREADED_OMP
FREEFLAGS := $(FREEFLAGS) -mp
LDFLAGS := $(LDFLAGS) -mp
endif
ifeq ($(DEBUG),TRUE)
endif
I also compiled CCSM3 with openmpi(1.3.3)
/usr/local/bin/mpirun -np 4 cpl : -np 2 csim : -np 4 clm : -np 4 pop : -np 8 cam
My error message looks like thsi:
(main) -------------------------------------------------------------------------
(main) contract init: establish domain & router for lnd
(main) -------------------------------------------------------------------------
(cpl_contract_init) cpl-recv-lnd
[phodmod:04785] *** Process received signal ***
[phodmod:04785] Signal: Segmentation fault (11)
[phodmod:04785] Signal code: Address not mapped (1)
[phodmod:04785] Failing at address: 0x1a254eb9b0
.....
[phodmod:04783] [ 0] /lib64/libpthread.so.0 [0x3905a0e4c0]
[phodmod:04783] [ 1] clm(decompmod_mp_initdecomp_+0x2257) [0x4f50c7]
[phodmod:04783] [ 2] clm(initializemod_mp_initialize_+0x342) [0x536f82]
[phodmod:04783] [ 3] clm(MAIN__+0x8b) [0x58539b]
[phodmod:04783] [ 4] clm(main+0x3c) [0x423ddc]
[phodmod:04783] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3904e1d974]
[phodmod:04783] [ 6] clm [0x423ce9]
[phodmod:04783] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 9 with PID 4785 on node phodmod.aoml.noaa.gov exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any help would be appreciated.
Thank you.
Sang-Ki
Processor 8382 with total shared memory of 132GB. The motherboard has 8 blades and each
contains 4 cpus. CCSM3 is compiled in the new machine with mpich2(1.1.1) without error
message (ifort and gcc are used to install both ccsm3 and mpi libraries).
CCSM3 run is executed interacticvely (no batch system) for T31_gx3v5 compset B:
/usr/local/mpich2/bin/mpirun -np 4 cpl : -np 2 csim : -np 4 clm : -np 4 pop : -np 8 cam
I get the following error messages shortly after the model run is executed
Fatal error in MPI_Comm_dup: Invalid communicator, error stack:
MPI_Comm_dup(167): MPI_Comm_dup("comm=0x0", new_comm=0x9f7758) failed
MPI_Comm_dup(95).: Invalid communicator
Fatal error in MPI_Comm_dup: Invalid communicator, error stack:
MPI_Comm_dup(167): MPI_Comm_dup(comm=0x0, new_comm=0x357ec98) failed
MPI_Comm_dup(95).: Invalid communicator
......
rank 21 in job 17 phodmod.aoml.noaa.gov_51674 caused collective abort of all ranks
My Macros.Linux looks like this:
INCLDIR := -I$(INCROOT) -I/usr/local/include -I/usr/local/mpich2/include
SLIBS := -L/usr/local/lib -lnetcdf -L/usr/local/mpich2/lib -lmpich
ULIBS := -L$(LIBROOT) -lesmf -lmct -lmpeu -lmph
CPP := NONE
CPPFLAGS := -DLINUX -DPGF90 -DNO_SHR_VMATH
CPPDEFS := -DLINUX
CC := /usr/local/mpich2/bin/mpicc
CFLAGS := -c
FIXEDFLAGS :=
FREEFLAGS := -FR
FC := /usr/local/mpich2/bin/mpif90
FFLAGS := -c -r8 -i4 -extend_source -assume byterecl
MOD_SUFFIX := mod
LD := $(FC)
LDFLAGS := -L/usr/lib64 -lrdmacm -libverbs -libumad -lpthread
AR := ar
ifeq ($(MODEL),pop)
CPPDEFS := $(CPPDEFS) -DPOSIX -Dimpvmix -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY)
FIXEDFLAGS := -convert big_endian
endif
ifeq ($(MODEL),csim)
CPPDEFS := $(CPPDEFS) -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY) -D_MPI
FIXEDFLAGS := -convert big_endian
endif
ifeq ($(THREAD),TRUE)
CPPDEFS := $(CPPDEFS) -D_OPENMP -DTHREADED_OMP
FREEFLAGS := $(FREEFLAGS) -mp
LDFLAGS := $(LDFLAGS) -mp
endif
ifeq ($(DEBUG),TRUE)
endif
I also compiled CCSM3 with openmpi(1.3.3)
/usr/local/bin/mpirun -np 4 cpl : -np 2 csim : -np 4 clm : -np 4 pop : -np 8 cam
My error message looks like thsi:
(main) -------------------------------------------------------------------------
(main) contract init: establish domain & router for lnd
(main) -------------------------------------------------------------------------
(cpl_contract_init) cpl-recv-lnd
[phodmod:04785] *** Process received signal ***
[phodmod:04785] Signal: Segmentation fault (11)
[phodmod:04785] Signal code: Address not mapped (1)
[phodmod:04785] Failing at address: 0x1a254eb9b0
.....
[phodmod:04783] [ 0] /lib64/libpthread.so.0 [0x3905a0e4c0]
[phodmod:04783] [ 1] clm(decompmod_mp_initdecomp_+0x2257) [0x4f50c7]
[phodmod:04783] [ 2] clm(initializemod_mp_initialize_+0x342) [0x536f82]
[phodmod:04783] [ 3] clm(MAIN__+0x8b) [0x58539b]
[phodmod:04783] [ 4] clm(main+0x3c) [0x423ddc]
[phodmod:04783] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3904e1d974]
[phodmod:04783] [ 6] clm [0x423ce9]
[phodmod:04783] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 9 with PID 4785 on node phodmod.aoml.noaa.gov exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any help would be appreciated.
Thank you.
Sang-Ki