Trouble porting CESM to a generic_IBM cluster

gchiodo@fis_ucm_es · Aug 2, 2011

Dear all,

we recently managed to compile CESM on a generic_IBM cluster called MareNostrum (Barcelona Supercomputing Center). To build the model successfully, I followed your advise in using MPI-2 libraries. Unfortunately, we are still encountering problems to get cesm-1.0.3 running. This time we get a run-time error.

We use the XLF compiler, and the 64-bit version of the 1.3.1..9 MPICH2 module. To compile the code correctly, we deleted the "disable-mpi2" flag from the PIO_CONFIG_OPTS, and commented out the L112-114 (flag option SLIBS += -L$(LIB_MPI) -lmpi --> this flag is already loaded by the mpif90 wrapper) of the Makefile in the $CASEROOT/Tools folder.

Unfortunately, the binary file cannot be executed on the MareNostrum machine due to some MPI-related errors. More specifically, the run script fails in the "mpiexec" command. After a few lines in the ccsm.log file, the execution fails with the following error

(seq_comm_setcomm) initialize ID ( 7 GLOBAL ) pelist = 0 63 1 ( npes = 64) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 2 ATM ) pelist = 0 63 1 ( npes = 64) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 1 LND ) pelist = 0 63 1 ( npes = 64) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 4 ICE ) pelist = 0 63 1 ( npes = 64) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 5 GLC ) pelist = 0 63 1 ( npes = 64) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 3 OCN ) pelist = 0 63 1 ( npes = 64) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 6 CPL ) pelist = 0 63 1 ( npes = 64) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 8 CPLATM ) join IDs = 6 2 ( npes = 64) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 9 CPLLND ) join IDs = 6 1 ( npes = 64) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 10 CPLICE ) join IDs = 6 4 ( npes = 64) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 11 CPLOCN ) join IDs = 6 3 ( npes = 64) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 12 CPLGLC ) join IDs = 6 5 ( npes = 64) ( nthreads = 1)

[s16c5b01:5068] *** An error occurred in MPI_Gather
[s16c5b01:5068] *** on communicator MPI COMMUNICATOR 5 CREATE FROM 0
[s16c5b01:5068] *** MPI_ERR_TYPE: invalid datatype
[s16c5b01:5068] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

We also tried to compile the code with the brand-new openMPI libraries, but we got a similar error at runtime.

At first sight, one could interprete the error as some inconsistency issue in the MPI libraries. That is, the code is compiled with MPI libraries in some routines, and with MPI-2 in some other parts. However, a look at all respective logs reveals that all modules are compiled correctly with MPICH-2 libraries, and with the same mpif90 wrapper (i.e. the whole path is correct...).

After an exensive revision of the code, the BSC staff told me that there are a few conflictive directives in the mpeu source code. If we understand it correctly, in B1850WCN/mct/mpeu there is some sort of interface which should translate the MP_INTEGER, MP_REAL, ... variables to MPI_INTEGER, MPI_REAL,...

It seems plausible that the error on this particular machine is due to some wrong translation of the above mentioned variables into MPI variables... and that could cause the invalid-data-type runtime error. I am currently running this model on Finisterrae; but the code was compiled with ifort there (not with xlf as in MareNostrum), and that machine is very different.

Did you encounter some similar runtime issue on your machine?

Maybe there is some additional flag we could add to prevent this runtime issue to occur?

The MPI environmental variables are set as follows:

setenv MP_RC_USE_LMC yes
setenv LAPI_DEBUG_RC_WAIT_ON_QP_SETUP yes
setenv MP_INFOLEVEL 2
setenv MP_EUIDEVICE sn_all
setenv MP_SHARED_MEMORY yes
setenv LAPI_USE_SHM yes
setenv MP_EUILIB mx
#setenv MP_EAGER_LIMIT 32k
setenv MP_BULK_MIN_MSG_SIZE 64k
setenv MP_POLLING_INTERVAL 20000000
setenv MEMORY_AFFINITY MCM
setenv LAPI_DEBUG_ENABLE_AFFINITY YES
setenv LAPI_DEBUG_BINDPROC_AFFINITY YES
setenv MP_SYNC_QP YES
setenv MP_RFIFO_SIZE 16777216
setenv MP_SHM_ATTACH_THRESH 500000
setenv MP_EUIDEVELOP min
setenv MP_USE_BULK_XFER yes
setenv MP_BUFFER_MEM 64M

The compilation command in the log files reads as follows:

MCT:

mpif90 -c /gpfs/apps/OPENMPI/1.5.3/XLC/64/include/ -WF,-DSYSLINUX,-DCPRXLF -O2 -qarch=auto -qsuffix=f=F90:cpp=F90 m_stdio.F90
/gpfs/apps/OPENMPI/1.5.3/XLC/64/bin/mpif90 -q64 -c /gpfs/apps/OPENMPI/1.5.3/XLC/64/include/ -WF,-DSYSLINUX,-DCPRXLF -O2 -qarch=auto -qsuffix=f=F90:cpp=F90 m_stdio.F90

PIO:

/gpfs/apps/OPENMPI/1.5.3/XLC/64/bin/mpif90 -q64 -c -WF,-DMCT_INTERFACE -WF,-DHAVE_MPI -WF,-DCO2A -WF,-DAIX -WF,-DSEQ_ -WF,-DFORTRAN_SAME -O3 -qstrict -qarch=ppc970 -qtune=ppc970 -qcache=auto -q64 -g -O2 -qstrict -Q -qinitauto -WF,-DSYSLINUX,-DLINUX,-DCPRXLF -WF,-DSPMD,-DHAVE_MPI,-DUSEMPIIO,-D_NETCDF,-D_NOPNETCDF,-D_NOUSEMCT,-D_USEBOX,-DPIO_GPFS_HINTS -I/gpfs/apps/NETCDF/64/include pio_kinds.F90

CCSM:

/gpfs/apps/OPENMPI/1.5.3/XLC/64/bin/mpif90 -q64 -c -I. -I/usr/include -I/gpfs/apps/NETCDF/64/include -I/gpfs/apps/NETCDF/64/include -I/gpfs/apps/OPENMPI/1.5.3/XLC/64/include -I. -I/gpfs/projects/ucm18/ucm18119/B1850WCN/SourceMods/src.drv -I/home/ucm18/ucm18119/code/cesm1_0_3/models/drv/driver -I/gpfs/projects/ucm18/ucm18119/B1850WCN/lib/include -WF,-DMCT_INTERFACE -WF,-DHAVE_MPI -WF,-DCO2A -WF,-DAIX -WF,-DSEQ_ -WF,-DFORTRAN_SAME -O3 -qstrict -qarch=ppc970 -qtune=ppc970 -qcache=auto -q64 -g -O2 -qstrict -Q -qinitauto -qsuffix=f=f90:cpp=F90 -qfree=f90 /home/ucm18/ucm18119/code/cesm1_0_3/models/drv/driver/ccsm_driver.F90

/gpfs/apps/OPENMPI/1.5.3/XLC/64/bin/mpif90 -q64 -o /gpfs/projects/ucm18/ucm18119/B1850WCN/run/ccsm.exe ccsm_comp_mod.o ccsm_driver.o map_atmatm_mct.o map_atmice_mct.o map_atmlnd_mct.o map_atmocn_mct.o map_glcglc_mct.o map_iceice_mct.o map_iceocn_mct.o map_lndlnd_mct.o map_ocnocn_mct.o map_rofocn_mct.o map_rofrof_mct.o map_snoglc_mct.o map_snosno_mct.o mrg_x2a_mct.o mrg_x2g_mct.o mrg_x2i_mct.o mrg_x2l_mct.o mrg_x2o_mct.o mrg_x2s_mct.o seq_avdata_mod.o seq_diag_mct.o seq_domain_mct.o seq_flux_mct.o seq_frac_mct.o seq_hist_mod.o seq_rearr_mod.o seq_rest_mod.o -L/gpfs/projects/ucm18/ucm18119/B1850WCN/lib -latm -llnd -lice -locn -lglc -L/gpfs/projects/ucm18/ucm18119/B1850WCN/lib -lcsm_share -lmct -lmpeu -lpio -L/opt/ibmcmp/xlmass/5.0/lib64/ -lmassvp6_64 -L/gpfs/apps/NETCDF/64/lib -lnetcdf

We run the executable with the following command:

/gpfs/apps/OPENMPI/1.5.3/bin/mpiexec ./ccsm.exe >&! ccsm.log.$LID

Should we change some flag, and compile the model again? If yes, how?

Thank you!

gchiodo@fis_ucm_es · Aug 23, 2011

I solved the problem. For some strange reason, there were 2 subroutines from the MCT/Mpeu source code, which had been compiled previously with the standard mpif90 wrapper (without MPICH-2 support) during the first compilation attempts. The clean-build script did not remove the .obj files, so that each new compilation skipped these 2 routines, so while the rest of the code was compiled with MPICH-2, these 2 routines were not, and for this inherent inconsistency the executions were failing.

CESM103 does now run on the MareNostrum cluster, but it still has to be optimized. I have been running short tests with pure MPI, and hybrid openMP+MPI configuration. Some flags have been tuned to the MN specific architecture, but the timing of the executions is still poor compared to that obtained on another cluster with the same MPI tasks. I know that this may partly be due to the specific architecture of this cluster, but I have been checking the I/O datastream with care during the execution, and discovered that for pure MPI exeuctions, it hangs for 20-30 minutes every time the ocean module writes some binary ".do." file. The problem gets worse in the case of executions with hybrid parallel configuration; the whole code runs much faster than in pure MPI, but every time the ocean module writes a restart file; the whole execution hangs at that point and stops there (without any runtime error, though...)

I think there may be some I/O bottleneck in the ocean code, which could be solved by changing the compilation flags for the POP module. The flags I am using on this machine are the following ones;

ifeq ($(strip $(MODEL)),pop2)
FFLAGS := $(FPPFLAGS) -O3 -qstrict -q64 -g -qfullpath -qsigtrap=xl__trcedump -qarch=ppc970
-d -qmaxmem=-1 -qtune=ppc970 -qalias=noaryovrlp -qnosave

Could you please give me some advise on how to optimize them, in order to solve this bottleneck problem?

On the other machine I am not experiencing this, but that architecture is very different, and the RAM-memory / node is much bigger.

Thank you

Trouble porting CESM to a generic_IBM cluster

gchiodo@fis_ucm_es

Member

gchiodo@fis_ucm_es

Member