m_kliphuis@phys_uu_nl
New Member
Dear Sir/Miss,
I put this message on the "ORNL Porting and Performance" section a few weeks ago but I did not get any response. Hopefully someone here can help me out with one or more of my questions below ;-)
I am trying to inprove the performance of CCSM3.0 on an
SGI Origin 3800 supercomputer. Could you please help me
out with some burning questions ?
Perhaps you have heard of the "Dutch Challenge project".
For this project I used the CSM1.4 model to generate 62
ensemble members of each 140 years on an SGI Origin 3800
supercomputer. I then got the best Years/day/cpu when I
used 8 processors (3 x atm, 2 x ocn, 1 x lnd, 1 x ice and 1x cpl).
With this setup the computer was able to generate 4 years
per 24 hours.
I am now trying to find the most efficient setup (in terms of
Years/day/cpu) for the CCSM3.0 model. When I put 16 processors
on the atm component I found out that I can match ocn
to atm processing time by putting only 2 processors on the
ocn component. I then get the best Years/day/cpu when I also
put 2 processors on the cpl and 1 on the ice and lnd component.
I ran the model for 10 days. Could you please check out the times in the file table.data that I got after running the getTiming.csh script ?
COMMON,atm,lnd,ice,ocn,cpl
node,1*16,1*1,1*1,2*1,2*1
cpu,16,1,1,2,2
atm,total, 580.403
lnd,total, 602.665
ice,total, 602.59
ocn,total, 600.72
cpl,total, 568.957
atm,send, 0.064
lnd,send, 0.026
ice,send, 4.90
ocn,send, 0.10
cpl,send, 5.517
atm,recv, 2.158
lnd,recv, 33.550
ice,recv, 221.36
ocn,recv, 280.59
cpl,recv, 15.889
atm,s_r, 0.802
lnd,s_r, 1.370
ice,s_r, 299.72
ocn,s_r, 0.00
cpl,s_r, 10.862
atm,r_s, 326.531
lnd,r_s, 1.459
ice,r_s, 68.14
ocn,r_s, 320.02
cpl,r_s, 6.412
ENDRECORDER
The CPL main time for this 10 day run was 569 seconds. This means that I
can generate 4 years again but I now need 22 processors instead of 8 ;-(
Question 1:
Is it possible that CCSM3.0 is a factor 3 slower than CSM1.4 ?
Question 2:
I always get ice processing times (ice,r_s) that are much higher than the lnd processing times (lnd,r_s), even if I put 10 times more (than on lnd) processors on the ice component.
In the article "An introduction to load balancing CCSM3 components" from
mr. G.Carr even for 48xlnd and 8xice the lnd processing times were 3 times higher than the ice processing times.
I don't understand why the lnd processing times are so low in my case. Do you think that something is wrong ?
Question 3:
If the ice,recv and ocn,recv times were not so high, the CPL main time and thus the performance would be much higher. Is it normal for the recv times of these components to be soo high ?
Question 4:
From the batch job I get an error file that gives the following error messages:
print_memusage iam 0 spetru_uv. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 132016 -1 -1 0
print_memusage iam 0 post-inidat. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 137824 -1 -1 0
print_memusage iam 0 Start aerosol_initialize. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 139168 -1 -1 0
print_memusage iam 0 End aerosol_initialize. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 159104 -1 -1 0
print_memusage iam 0 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 361296 -1 -1 0
print_memusage iam 0 End stepon. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 364272 -1 -1 0
I first thought that the problem could be a stacksize problem, but the stacksize is 1 Gb per processor and I guess this should be enough.
The model just continues so it does not seem to be such a big problem. Do you know how to solve this problem ? And could this problem affect the performance ?
Question 5:
For the 10 day run, for the atm component I set NTASKS=1
and NTHRDS=16. I guess you would say that for instance
NTASKS=2 and NTHRDS=8 would give a better performance, but
it didn't, it was even worse. Could there be something wrong
with my Macros.IRIX64 file ? (see below)
#================================================= ==============================
# Makefile macros for "teras," an SGI O3800 system at SARA (Netherlands)
#
# Notes: (for details, see man pages for f90, ld, & debug_group)
# -64 => 64 bit object code (memory addressing)
# -show => prints name of linker being used
# -v => prints name of linker, libs linked in, ...
# -extend_source => 132 char line length for fixed format lines
# -mp => recognize multiprocessing directives
# -r8 -i4 => default is 8-bit reals, 4-bit integers
# -C => array bounds checking (same as -DEBUG:subscript_check)
# -DEBUG:... => activates various options, see man debug_group
#================================================= ==============================
INCLDIR := -I ${MPT_SGI}/usr/include -I /usr/include -I /usr/local/include -I${INCROOT}
SLIBS := -lfpe -lnetcdf -lscs
ULIBS := -L${LIBROOT} -lesmf -lmct -lmpeu -lmph -lmpi
CPP := /lib/cpp
CPPFLAGS :=
CPPDEFS := -DIRIX64 -DSGI
ifeq ($(MACH),chinook)
SLIBS := -lfpe -lmpi
-L/usr/local/lib64/r8i4 -lmss
-L/usr/local/lib64/r4i4 -lnetcdf -lscs
CPPDEFS := $(CPPDEFS) -DMSS
endif
ifeq ($(MACH),guyot)
INCLDIR := -I $(NETCDF_INC) $(INCLDIR)
endif
CC := cc
CFLAGS := -c -64
FIXEDFLAGS :=
FREEFLAGS :=
FC := f90
FFLAGS := -c -64 -mips4 -O2 -r8 -i4 -show -extend_source
MOD_SUFFIX := mod
LD := $(FC)
LDFLAGS := -64 -mips4 -O2 -r8 -i4 -show -mp
AR := ar
# start kliphuis
INC_NETCDF := /usr/local/opt/netcdf/include
INCLDIR := -I $(INC_NETCDF) $(INCLDIR)
LIB_NETCDF := -L /usr/local/opt/netcdf/lib -l netcdf
#SLIBS := $(LIB_NETCDF) -lfpe -lscs
SLIBS := $(LIB_NETCDF) -lfpe
# end kliphuis
ifeq ($(MODEL),pop)
CPPDEFS := $(CPPDEFS) -DPOSIX -Dimpvmix -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY)
endif
ifeq ($(MODEL),csim)
CPPDEFS := $(CPPDEFS) -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY) -D_MPI
endif
ifeq ($(THREAD),TRUE)
ULIBS := -L${LIBROOT} -lesmf -lmct -lmpeu -lmph -lmp -lmpi
# CPPFLAGS := $(CPPFLAGS) -D_OPENMP
CPPDEFS := $(CPPDEFS) -D_OPENMP -DTHREADED_OMP
FFLAGS := $(FFLAGS) -mp
endif
ifeq ($(DEBUG),TRUE)
FFLAGS := $(FFLAGS) -C -DEBUG:trap_uninitialized:verbose_runtime
endif
Once again, thanks a lot for your time .......
Kind regards,
Michael Kliphuis
Institute for Marine and Atmospheric Research Utrecht
email: M.Kliphuis@phys.uu.nl
Edit/Delete Message
I put this message on the "ORNL Porting and Performance" section a few weeks ago but I did not get any response. Hopefully someone here can help me out with one or more of my questions below ;-)
I am trying to inprove the performance of CCSM3.0 on an
SGI Origin 3800 supercomputer. Could you please help me
out with some burning questions ?
Perhaps you have heard of the "Dutch Challenge project".
For this project I used the CSM1.4 model to generate 62
ensemble members of each 140 years on an SGI Origin 3800
supercomputer. I then got the best Years/day/cpu when I
used 8 processors (3 x atm, 2 x ocn, 1 x lnd, 1 x ice and 1x cpl).
With this setup the computer was able to generate 4 years
per 24 hours.
I am now trying to find the most efficient setup (in terms of
Years/day/cpu) for the CCSM3.0 model. When I put 16 processors
on the atm component I found out that I can match ocn
to atm processing time by putting only 2 processors on the
ocn component. I then get the best Years/day/cpu when I also
put 2 processors on the cpl and 1 on the ice and lnd component.
I ran the model for 10 days. Could you please check out the times in the file table.data that I got after running the getTiming.csh script ?
COMMON,atm,lnd,ice,ocn,cpl
node,1*16,1*1,1*1,2*1,2*1
cpu,16,1,1,2,2
atm,total, 580.403
lnd,total, 602.665
ice,total, 602.59
ocn,total, 600.72
cpl,total, 568.957
atm,send, 0.064
lnd,send, 0.026
ice,send, 4.90
ocn,send, 0.10
cpl,send, 5.517
atm,recv, 2.158
lnd,recv, 33.550
ice,recv, 221.36
ocn,recv, 280.59
cpl,recv, 15.889
atm,s_r, 0.802
lnd,s_r, 1.370
ice,s_r, 299.72
ocn,s_r, 0.00
cpl,s_r, 10.862
atm,r_s, 326.531
lnd,r_s, 1.459
ice,r_s, 68.14
ocn,r_s, 320.02
cpl,r_s, 6.412
ENDRECORDER
The CPL main time for this 10 day run was 569 seconds. This means that I
can generate 4 years again but I now need 22 processors instead of 8 ;-(
Question 1:
Is it possible that CCSM3.0 is a factor 3 slower than CSM1.4 ?
Question 2:
I always get ice processing times (ice,r_s) that are much higher than the lnd processing times (lnd,r_s), even if I put 10 times more (than on lnd) processors on the ice component.
In the article "An introduction to load balancing CCSM3 components" from
mr. G.Carr even for 48xlnd and 8xice the lnd processing times were 3 times higher than the ice processing times.
I don't understand why the lnd processing times are so low in my case. Do you think that something is wrong ?
Question 3:
If the ice,recv and ocn,recv times were not so high, the CPL main time and thus the performance would be much higher. Is it normal for the recv times of these components to be soo high ?
Question 4:
From the batch job I get an error file that gives the following error messages:
print_memusage iam 0 spetru_uv. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 132016 -1 -1 0
print_memusage iam 0 post-inidat. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 137824 -1 -1 0
print_memusage iam 0 Start aerosol_initialize. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 139168 -1 -1 0
print_memusage iam 0 End aerosol_initialize. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 159104 -1 -1 0
print_memusage iam 0 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 361296 -1 -1 0
print_memusage iam 0 End stepon. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= -1 364272 -1 -1 0
I first thought that the problem could be a stacksize problem, but the stacksize is 1 Gb per processor and I guess this should be enough.
The model just continues so it does not seem to be such a big problem. Do you know how to solve this problem ? And could this problem affect the performance ?
Question 5:
For the 10 day run, for the atm component I set NTASKS=1
and NTHRDS=16. I guess you would say that for instance
NTASKS=2 and NTHRDS=8 would give a better performance, but
it didn't, it was even worse. Could there be something wrong
with my Macros.IRIX64 file ? (see below)
#================================================= ==============================
# Makefile macros for "teras," an SGI O3800 system at SARA (Netherlands)
#
# Notes: (for details, see man pages for f90, ld, & debug_group)
# -64 => 64 bit object code (memory addressing)
# -show => prints name of linker being used
# -v => prints name of linker, libs linked in, ...
# -extend_source => 132 char line length for fixed format lines
# -mp => recognize multiprocessing directives
# -r8 -i4 => default is 8-bit reals, 4-bit integers
# -C => array bounds checking (same as -DEBUG:subscript_check)
# -DEBUG:... => activates various options, see man debug_group
#================================================= ==============================
INCLDIR := -I ${MPT_SGI}/usr/include -I /usr/include -I /usr/local/include -I${INCROOT}
SLIBS := -lfpe -lnetcdf -lscs
ULIBS := -L${LIBROOT} -lesmf -lmct -lmpeu -lmph -lmpi
CPP := /lib/cpp
CPPFLAGS :=
CPPDEFS := -DIRIX64 -DSGI
ifeq ($(MACH),chinook)
SLIBS := -lfpe -lmpi
-L/usr/local/lib64/r8i4 -lmss
-L/usr/local/lib64/r4i4 -lnetcdf -lscs
CPPDEFS := $(CPPDEFS) -DMSS
endif
ifeq ($(MACH),guyot)
INCLDIR := -I $(NETCDF_INC) $(INCLDIR)
endif
CC := cc
CFLAGS := -c -64
FIXEDFLAGS :=
FREEFLAGS :=
FC := f90
FFLAGS := -c -64 -mips4 -O2 -r8 -i4 -show -extend_source
MOD_SUFFIX := mod
LD := $(FC)
LDFLAGS := -64 -mips4 -O2 -r8 -i4 -show -mp
AR := ar
# start kliphuis
INC_NETCDF := /usr/local/opt/netcdf/include
INCLDIR := -I $(INC_NETCDF) $(INCLDIR)
LIB_NETCDF := -L /usr/local/opt/netcdf/lib -l netcdf
#SLIBS := $(LIB_NETCDF) -lfpe -lscs
SLIBS := $(LIB_NETCDF) -lfpe
# end kliphuis
ifeq ($(MODEL),pop)
CPPDEFS := $(CPPDEFS) -DPOSIX -Dimpvmix -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY)
endif
ifeq ($(MODEL),csim)
CPPDEFS := $(CPPDEFS) -Dcoupled -DNPROC_X=$(NX) -DNPROC_Y=$(NY) -D_MPI
endif
ifeq ($(THREAD),TRUE)
ULIBS := -L${LIBROOT} -lesmf -lmct -lmpeu -lmph -lmp -lmpi
# CPPFLAGS := $(CPPFLAGS) -D_OPENMP
CPPDEFS := $(CPPDEFS) -D_OPENMP -DTHREADED_OMP
FFLAGS := $(FFLAGS) -mp
endif
ifeq ($(DEBUG),TRUE)
FFLAGS := $(FFLAGS) -C -DEBUG:trap_uninitialized:verbose_runtime
endif
Once again, thanks a lot for your time .......
Kind regards,
Michael Kliphuis
Institute for Marine and Atmospheric Research Utrecht
email: M.Kliphuis@phys.uu.nl
Edit/Delete Message