Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Problem running CAM4 in parallel mode

Hi,

I am trying to run CAM4 in parallel mode, I am using spmd for it.
For configure I have given command
./configure -dyn fv -hgrid 10x15 -ntasks 6 -nosmp

configure has been executed successfully but gmake has been failed.
Its giving some this kind of error.


/opt/mpich2_intel/lib/libmpich.a(type_struct.o): In function `MPI_Type_struct':
type_struct.c:(.text+0x41d): undefined reference to `pthread_getspecific'
type_struct.c:(.text+0x485): undefined reference to `pthread_setspecific'
type_struct.c:(.text+0x55c): undefined reference to `pthread_setspecific'
gmake: *** [/shome/2009asz8230/CAM4a/test_spmd/cam] Error 1


Please suggest me what to do.

Thanx Ram
 

eaton

CSEG and Liaisons
It is the mpi library that has the unsatisfied external (pthread_getspecific), so I would expect that this external is in another mpi library. You can try looking for it yourself in the directory where the mpi library is (/opt/mpich2_intel/lib). If you find it there then just add that library to the link line. This is most easily done by using the ldflags arg of configure.

It is often easier to get the correct mpi libraries by using the mpif90 wrapper in the link phase of the build. To do this make sure it's location is in your path, then add the argument

-linker mpif90

to the configure command.

Also see the post http://bb.cgd.ucar.edu/showthread.php?t=1670 for another approach using the command "mpif90 -show" to display the mpi libs that need to be linked.
 
Hi Brian,

Thnx for replying. I have proceeded according to your suggestions and also followed the link that you have posted.

now my problem is that gmake is done successfully but run has failed.

for configure I have issued the command

/shome/2009ast3222/CAM4/models/atm/cam/bld/configure -dyn fv -hgrid 10x15 -ntasks 6 -nosmp -linker mpif90 -fc ifort -cc icc -test

and in the second way

/shome/2009ast3222/CAM4/models/atm/cam/bld/configure -dyn fv -hgrid 10x15 -ntasks 6 -nosmp -ldflags "-lmpichf90 -lmpichf90 -lmpich -lpthread -lrt" -fc ifort -cc icc -test

for both the approach configure and gmake has been done successfully.
But run terminates......

I am posting last part of the log file of run(log file in both the approaches is same)

Domain Information

Horizontal domain: nx = 24
ny = 19
No. of categories: nc = 1
No. of ice layers: ni = 4
No. of snow layers:ns = 1
Processors: total = 1
Processor shape: square-pop
Distribution type: cartesian
Distribution weight: latitude
max_blocks = 1
Number of ghost cells: 1

CalcWorkPerBlock: Total blocks: 6 Ice blocks: 6 IceFree blocks: 0 Land blocks: 0
Processors (X x Y) = 1 x 1
Active processors: 1
(shr_sys_abort) ERROR: ice: no. blocks exceed max: increase max to 6
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cam 0000000000F329AE Unknown Unknown Unknown
cam 0000000000CF6D74 Unknown Unknown Unknown
cam 0000000000D39715 Unknown Unknown Unknown
cam 00000000007A1EE7 Unknown Unknown Unknown
cam 000000000079973E Unknown Unknown Unknown
cam 00000000007E973D Unknown Unknown Unknown
cam 0000000000427B91 Unknown Unknown Unknown
cam 000000000077BA99 Unknown Unknown Unknown
cam 000000000056DCBA Unknown Unknown Unknown
cam 0000000000404D82 Unknown Unknown Unknown
libc.so.6 0000003BF8A1D974 Unknown Unknown Unknown
cam 0000000000404CA9 Unknown Unknown Unknown


ans similarly when I have tried in smp mode, configure and gmake has been done successfully but run terminates

last part of log file of run in SMP mode

(seq_mct_drv) : Setting fractions
(seq_mct_drv) : Initializing atm/ocn flux component
(seq_flux_atmocn_mct) computing only ocn albedos
(seq_mct_drv) : Calling map_lnd2atm_mct
(seq_mct_drv) : Calling map_ocn2atm_mct for mapping o2x_ox to o2x_ax
(seq_mct_drv) : Calling map_ocn2atm_mct for mapping xao_ox to xao_ax
(seq_mct_drv) : Calling map_ice2atm_mct for mapping i2x_ix to i2x_ax
(seq_mct_drv) : Calling mrg_x2a_run_mct
(seq_mct_drv) : Calling atm_init_mct
FV subcycling - n2 nsplit = 1 1
Divergence damping: use 2nd order damping
nstep, te 0 0.32829430336840811E+10 0.32829430336840811E+10 0.00000000000000000E+00 0.98518018633902189E+05
Segmentation fault


so what to do next.....??

thnx
Ram
 

eaton

CSEG and Liaisons
The problem appears to be that when you run the job it is not actually using 6 mpi tasks. The output that you provided shows this:

Processors: total = 1

At the top of the log file there is information about the tasks and threads assigned to each component. A run using 6 tasks and 1 thread will have output that looks like this:

(seq_comm_setcomm) initialize ID ( 7 GLOBAL ) pelist = 0 5 1 ( npes = 6) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 2 ATM ) pelist = 0 5 1 ( npes = 6) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 1 LND ) pelist = 0 5 1 ( npes = 6) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 4 ICE ) pelist = 0 5 1 ( npes = 6) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 5 GLC ) pelist = 0 5 1 ( npes = 6) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 3 OCN ) pelist = 0 5 1 ( npes = 6) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 6 CPL ) pelist = 0 5 1 ( npes = 6) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 8 CPLATM ) join IDs = 6 2 ( npes = 6) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 9 CPLLND ) join IDs = 6 1 ( npes = 6) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 10 CPLICE ) join IDs = 6 4 ( npes = 6) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 11 CPLOCN ) join IDs = 6 3 ( npes = 6) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 12 CPLGLC ) join IDs = 6 5 ( npes = 6) ( nthreads = 1)

This problem is generally due to a problem with the mpirun command (or whatever command it is you are using to launch the mpi job). You'll need to get help from your system administrator to resolve this.
 
Hi,

Actually I have tried to give run using this command...

mpirun -np 16 /shome/2009ast3222/CAM4/t_spmd/bld_1/cam

But run is still terminated, I am posting the log file of run.


[2009ast3222@ajaymeru run4]$ mpirun -np 16 /shome/2009ast3222/CAM4/t_spmd/bld_1/cam
(seq_comm_setcomm) initialize ID ( 7 GLOBAL ) pelist = 0 15 1 ( npes = 16) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 2 ATM ) pelist = 0 15 1 ( npes = 16) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 1 LND ) pelist = 0 15 1 ( npes = 16) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 4 ICE ) pelist = 0 15 1 ( npes = 16) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 5 GLC ) pelist = 0 15 1 ( npes = 16) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 3 OCN ) pelist = 0 15 1 ( npes = 16) ( nthreads = 1)
(seq_comm_setcomm) initialize ID ( 6 CPL ) pelist = 0 15 1 ( npes = 16) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 8 CPLATM ) join IDs = 6 2 ( npes = 16) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 9 CPLLND ) join IDs = 6 1 ( npes = 16) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 10 CPLICE ) join IDs = 6 4 ( npes = 16) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 11 CPLOCN ) join IDs = 6 3 ( npes = 16) ( nthreads = 1)
(seq_comm_joincomm) initialize ID ( 12 CPLGLC ) join IDs = 6 5 ( npes = 16) ( nthreads = 1)

(seq_comm_printcomms) ID layout : global pes vs local pe for each ID
gpe LND ATM OCN ICE GLC CPL GLOBAL CPLATM CPLLND CPLICE CPLOCN CPLGLC nthrds
--- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
0 : 0 0 0 0 0 0 0 0 0 0 0 0 1
1 : 1 1 1 1 1 1 1 1 1 1 1 1 1
2 : 2 2 2 2 2 2 2 2 2 2 2 2 1
3 : 3 3 3 3 3 3 3 3 3 3 3 3 1
4 : 4 4 4 4 4 4 4 4 4 4 4 4 1
5 : 5 5 5 5 5 5 5 5 5 5 5 5 1
6 : 6 6 6 6 6 6 6 6 6 6 6 6 1
7 : 7 7 7 7 7 7 7 7 7 7 7 7 1
8 : 8 8 8 8 8 8 8 8 8 8 8 8 1
9 : 9 9 9 9 9 9 9 9 9 9 9 9 1
10 : 10 10 10 10 10 10 10 10 10 10 10 10 1
11 : 11 11 11 11 11 11 11 11 11 11 11 11 1
12 : 12 12 12 12 12 12 12 12 12 12 12 12 1
13 : 13 13 13 13 13 13 13 13 13 13 13 13 1
14 : 14 14 14 14 14 14 14 14 14 14 14 14 1
(seq_mct_drv) USE_ESMF_LIB is NOT set, using esmf_wrf_timemgr
(seq_mct_drv) : ------------------------------------------------------------
(seq_mct_drv) : NCAR CPL7 Community Climate System Model (CCSM)
(seq_mct_drv) : ------------------------------------------------------------
(seq_mct_drv) : (Online documentation is available on the CCSM
(seq_mct_drv) : Models page: http://www.ccsm.ucar.edu/models/
(seq_mct_drv) : License information is available as a link from above
(seq_mct_drv) : ------------------------------------------------------------
(seq_mct_drv) : DATE 09/24/10 TIME 10:24:48
(seq_mct_drv) : ------------------------------------------------------------


(t_initf) Read in prof_inparm namelist from: drv_in
15 : 15 15 15 15 15 15 15 15 15 15 15 15 1

forrtl: No such file or directory
forrtl: severe (29): file not found, unit 10, file /shome/2009ast3222/CAM4/t_spmd/run4/drv_in
Image PC Routine Line Source
cam 000000000106701A Unknown Unknown Unknown
cam 000000000106621A Unknown Unknown Unknown
cam 000000000100D5CA Unknown Unknown Unknown
cam 0000000000FB93B6 Unknown Unknown Unknown
cam 0000000000FB89D2 Unknown Unknown Unknown
cam 0000000000FCB55F Unknown Unknown Unknown
cam 0000000000C4BEF7 Unknown Unknown Unknown
cam 0000000000574506 Unknown Unknown Unknown
cam 0000000000404D82 Unknown Unknown Unknown
libc.so.6 0000003BF8A1D974 Unknown Unknown Unknown
cam 0000000000404CA9 Unknown Unknown Unknown
rank 0 in job 13 ajaymeru.cas.iitd.ernet.in_40366 caused collective abort of all ranks
exit status of rank 0: return code 29


So is still there is problem regarding MPI installation only.
and 1 thing more can I install MPI in my own account.

Thanx
Ram
 

eaton

CSEG and Liaisons
The problem appears to be that the namelist input files are not being found:

forrtl: severe (29): file not found, unit 10, file /shome/2009ast3222/CAM4/t_spmd/run4/drv_in

You need to execute the build-namelist command in the run directory to have the namelist files be produced there (that is my preferred way of doing things). If you executed the build-namelist command in the build directory then all the files it produces need to be copied to the run directory.

 
Hi Brian,

Thanx for solving my queries, my problem with tasks is solved.
But now I am facing problem with threads i.e. smp mode. I have issued 8 threads in configure command
./configure -dyn fv -hgrid 1.9x2.5 -nospmd -nthreads 8 -fc ifort -cc icc -test

Starting lines of my log file,

(seq_comm_setcomm) initialize ID ( 7 GLOBAL ) pelist = 0 0 1 ( npes = 1) ( nthreads = 1 )
(seq_comm_setcomm) initialize ID ( 2 ATM ) pelist = 0 0 1 ( npes = 1) ( nthreads = 8 )
(seq_comm_setcomm) initialize ID ( 1 LND ) pelist = 0 0 1 ( npes = 1) ( nthreads = 8 )
(seq_comm_setcomm) initialize ID ( 4 ICE ) pelist = 0 0 1 ( npes = 1) ( nthreads = 8 )
(seq_comm_setcomm) initialize ID ( 5 GLC ) pelist = 0 0 1 ( npes = 1) ( nthreads = 1 )
(seq_comm_setcomm) initialize ID ( 3 OCN ) pelist = 0 0 1 ( npes = 1) ( nthreads = 8 )
(seq_comm_setcomm) initialize ID ( 6 CPL ) pelist = 0 0 1 ( npes = 1) ( nthreads = 8 )
(seq_comm_joincomm) initialize ID ( 8 CPLATM ) join IDs = 6 2 ( npes = 1) ( nthreads = 8 )
(seq_comm_joincomm) initialize ID ( 9 CPLLND ) join IDs = 6 1 ( npes = 1) ( nthreads = 8 )
(seq_comm_joincomm) initialize ID ( 10 CPLICE ) join IDs = 6 4 ( npes = 1) ( nthreads = 8 )
(seq_comm_joincomm) initialize ID ( 11 CPLOCN ) join IDs = 6 3 ( npes = 1) ( nthreads = 8 )
(seq_comm_joincomm) initialize ID ( 12 CPLGLC ) join IDs = 6 5 ( npes = 1) ( nthreads = 8 )

(seq_comm_printcomms) ID layout : global pes vs local pe for each ID
gpe LND ATM OCN ICE GLC CPL GLOBAL CPLATM CPLLND CPLICE CPLOCN CPLGLC nthrds
--- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
0 : 0 0 0 0 0 0 0 0 0 0 0 0 8



the last part of my log file is,

(seq_mct_drv) : Setting fractions
(seq_mct_drv) : Initializing atm/ocn flux component
(seq_flux_atmocn_mct) computing only ocn albedos
(seq_mct_drv) : Calling map_lnd2atm_mct
(seq_mct_drv) : Calling map_ocn2atm_mct for mapping o2x_ox to o2x_ax
(seq_mct_drv) : Calling map_ocn2atm_mct for mapping xao_ox to xao_ax
(seq_mct_drv) : Calling map_ice2atm_mct for mapping i2x_ix to i2x_ax
(seq_mct_drv) : Calling mrg_x2a_run_mct
(seq_mct_drv) : Calling atm_init_mct
FV subcycling - n2 nsplit = 1 4
Divergence damping: use 2nd order damping
nstep, te 0 0.33264096717818675E+10 0.33264096717818675E+10 0.00000000000000000E+00 0.98525963359206798E+05
Segmentation fault

As for tasks you have suggested some flags, so I assume there is again some problem with flags. Am I correct????

And could you also suggest me where to look for these flags and all (means some documents). And also are flags system specific or compiler specific. I know I have too many queries , but do guide me.

I am working on Red Hat Linux, supported by operating system CENTOS and I am using ifort.

again Thanx,

Ram
 

eaton

CSEG and Liaisons
There is nothing in the output you posted that indicates what the problem might be.

How many cores are in a shared memory node in your cluster? And is hyperthreading turned on? Generally the maximum thread count will be the same as the number of cores in a shared memory node, or twice that number if hyperthreading is on.

Are you able to run successfully with a single thread? That is, configure with "-nthreads 1" and set the environment variable OMP_NUM_THREADS to 1.
 
Hi Brian,

I have given run with 1 thread as per your suggestions and it has run successfully.

The cluster I am working on is dual socket quad core Intel Xeon processor. So there are 16 nodes in shared memory, all 16 nodes are same configuration of CPU. And hyperthreading is on.

Thanx
Ram
 

eaton

CSEG and Liaisons
If you can run with threading on, then the thing to try is just increasing the number of threads until it fails. And check that the answers are identical independent of the thread count. That check is important evidence that threading is working correctly. Of coarse the other important check is that the wall time to complete the run is scaling appropriately for the increased compute resource. Your nodes sound plenty big enough to run 8 threads. Sometimes though a per thread stack size limit can be exceeded and this can typically be overcome by setting an environment variable. Check your compiler documentation for information on this.

One other point is that we generally find the best performance at small core counts to come from either pure mpi configurations, or from hybrid configurations with a small number of threads per task. If your system is dual socket quad core chips, then that's 8 cores per node, and with hyperthreading you can assign up to 16 processes to the node. I would recommend looking at the performance of pure mpi with 16 tasks first, then moving to hybrid configurations with 8 tasks and 2 threads per task, then try 4 tasks with 4 threads per task. I think it's unlikely that using 8 threads per task will perform well on this system.
 
Top