Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CCSM3 on 32 cpus Quad-Core AMD Opteron

Our lab bought a new linux machine: 32 cpus (speed = 2612.051 MHz) Quad-Core AMD Opteron(tm)
Processor 8382 with total shared memory of 132GB. The motherboard has 8 blades and each
contains 4 cpus. CCSM3 is compiled in the new machine with mpich2(1.1.1) without error
message (ifort and gcc are used to install both ccsm3 and mpi libraries).

CCSM3 run is executed interacticvely (no batch system) for T31_gx3v5 compset B:
/usr/local/mpich2/bin/mpirun -np 4 cpl : -np 2 csim : -np 4 clm : -np 4 pop : -np 8 cam

I get the following error messages shortly after the model run is executed

Fatal error in MPI_Comm_dup: Invalid communicator, error stack:
MPI_Comm_dup(167): MPI_Comm_dup("comm=0x0", new_comm=0x9f7758) failed
MPI_Comm_dup(95).: Invalid communicator
Fatal error in MPI_Comm_dup: Invalid communicator, error stack:
MPI_Comm_dup(167): MPI_Comm_dup(comm=0x0, new_comm=0x357ec98) failed
MPI_Comm_dup(95).: Invalid communicator
......
rank 21 in job 17 phodmod.aoml.noaa.gov_51674 caused collective abort of all ranks

My Macros.Linux looks like this:

INCLDIR := -I$(INCROOT) -I/usr/local/include -I/usr/local/mpich2/include
SLIBS := -L/usr/local/lib -lnetcdf -L/usr/local/mpich2/lib -lmpich
ULIBS := -L$(LIBROOT) -lesmf -lmct -lmpeu -lmph
CPP := NONE
CPPFLAGS := -DLINUX -DPGF90 -DNO_SHR_VMATH
CPPDEFS := -DLINUX
CC := /usr/local/mpich2/bin/mpicc
CFLAGS := -c
FIXEDFLAGS :=
FREEFLAGS := -FR
FC := /usr/local/mpich2/bin/mpif90
FFLAGS := -c -r8 -i4 -extend_source -assume byterecl
MOD_SUFFIX := mod
LD := $(FC)
LDFLAGS := -L/usr/lib64 -lrdmacm -libverbs -libumad -lpthread
AR := ar
ifeq ($(MODEL),pop)
CPPDEFS := $(CPPDEFS) -DPOSIX -Dimpvmix -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY)
FIXEDFLAGS := -convert big_endian
endif
ifeq ($(MODEL),csim)
CPPDEFS := $(CPPDEFS) -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY) -D_MPI
FIXEDFLAGS := -convert big_endian
endif
ifeq ($(THREAD),TRUE)
CPPDEFS := $(CPPDEFS) -D_OPENMP -DTHREADED_OMP
FREEFLAGS := $(FREEFLAGS) -mp
LDFLAGS := $(LDFLAGS) -mp
endif
ifeq ($(DEBUG),TRUE)
endif

I also compiled CCSM3 with openmpi(1.3.3)
/usr/local/bin/mpirun -np 4 cpl : -np 2 csim : -np 4 clm : -np 4 pop : -np 8 cam

My error message looks like thsi:

(main) -------------------------------------------------------------------------
(main) contract init: establish domain & router for lnd
(main) -------------------------------------------------------------------------
(cpl_contract_init) cpl-recv-lnd
[phodmod:04785] *** Process received signal ***
[phodmod:04785] Signal: Segmentation fault (11)
[phodmod:04785] Signal code: Address not mapped (1)
[phodmod:04785] Failing at address: 0x1a254eb9b0
.....
[phodmod:04783] [ 0] /lib64/libpthread.so.0 [0x3905a0e4c0]
[phodmod:04783] [ 1] clm(decompmod_mp_initdecomp_+0x2257) [0x4f50c7]
[phodmod:04783] [ 2] clm(initializemod_mp_initialize_+0x342) [0x536f82]
[phodmod:04783] [ 3] clm(MAIN__+0x8b) [0x58539b]
[phodmod:04783] [ 4] clm(main+0x3c) [0x423ddc]
[phodmod:04783] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3904e1d974]
[phodmod:04783] [ 6] clm [0x423ce9]
[phodmod:04783] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 9 with PID 4785 on node phodmod.aoml.noaa.gov exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any help would be appreciated.

Thank you.

Sang-Ki
 
There are a total of 8 sockets (not blades), each containing
a Quad-core chip for a total of 32 cores. Blades infer a cluster
and that is not the case here.
 
Top