Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

parallel run: Fatal error in MPI_Irecv

Dear Helper,        My machine is x84, compiler is ifort. Here is my configure for mpi run. $camcfg/configure -fc mpif90 -fc_type intel -cc mpicc -dyn fv -hgrid 10x15 -ntasks 6 -nosmp -test -v -debug      compile ok. when it runs, it stops with following error message:----------r(seq_comm_setcomm)  initialize ID (  1 GLOBAL          ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_setcomm)  initialize ID (  2 CPL             ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_setcomm)  initialize ID ( 15 ATM             ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 16 CPLATM          ) join IDs =     2    15       ( npes =     6) ( nthreads =  1)
(seq_comm_jcommarr) initialize ID (  3 ALLATMID        ) join multiple comp IDs       ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID (  9 CPLALLATMID     ) join IDs =     2     3       ( npes =     6) ( nthreads =  1)
(seq_comm_setcomm)  initialize ID ( 17 LND             ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 18 CPLLND          ) join IDs =     2    17       ( npes =     6) ( nthreads =  1)
(seq_comm_jcommarr) initialize ID (  4 ALLLNDID        ) join multiple comp IDs       ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 10 CPLALLLNDID     ) join IDs =     2     4       ( npes =     6) ( nthreads =  1)
(seq_comm_setcomm)  initialize ID ( 19 OCN             ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 20 CPLOCN          ) join IDs =     2    19       ( npes =     6) ( nthreads =  1)
(seq_comm_jcommarr) initialize ID (  5 ALLOCNID        ) join multiple comp IDs       ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 11 CPLALLOCNID     ) join IDs =     2     5       ( npes =     6) ( nthreads =  1)
(seq_comm_setcomm)  initialize ID ( 21 ICE             ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 22 CPLICE          ) join IDs =     2    21       ( npes =     6) ( nthreads =  1)
(seq_comm_jcommarr) initialize ID (  6 ALLICEID        ) join multiple comp IDs       ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 12 CPLALLICEID     ) join IDs =     2     6       ( npes =     6) ( nthreads =  1)
(seq_comm_setcomm)  initialize ID ( 23 GLC             ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 24 CPLGLC          ) join IDs =     2    23       ( npes =     6) ( nthreads =  1)
(seq_comm_jcommarr) initialize ID (  7 ALLGLCID        ) join multiple comp IDs       ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 13 CPLALLGLCID     ) join IDs =     2     7       ( npes =     6) ( nthreads =  1)
(seq_comm_setcomm)  initialize ID ( 25 ROF             ) pelist   =     0     5     1 ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 26 CPLROF          ) join IDs =     2    25       ( npes =     6) ( nthreads =  1)
(seq_comm_jcommarr) initialize ID (  8 ALLROFID        ) join multiple comp IDs       ( npes =     6) ( nthreads =  1)
(seq_comm_joincomm) initialize ID ( 14 CPLALLROFID     ) join IDs =     2     8       ( npes =     6) ( nthreads =  1)
Fatal error in MPI_Gather: Invalid datatype, error stack:
MPI_Gather(761): MPI_Gather(sbuf=0x7fffb06c9990, scount=1, INVALID DATATYPE, rbuf=0xa0a8a20, rcount=1, INVALID DATATYPE, root=0, comm=0x84000002) failed
MPI_Gather(663): Null Datatype pointer
Fatal error in MPI_Gather: Invalid datatype, error stack:
MPI_Gather(761): MPI_Gather(sbuf=0x7fff28c78c10, scount=1, INVALID DATATYPE, rbuf=0xa0a8a20, rcount=1, INVALID DATATYPE, root=0, comm=0x84000002) failed
MPI_Gather(663): Null Datatype pointer
Fatal error in MPI_Gather: Invalid datatype, error stack:
MPI_Gather(761): MPI_Gather(sbuf=0x7fff83c6c990, scount=1, INVALID DATATYPE, rbuf=0xa0a8a20, rcount=1, INVALID DATATYPE, root=0, comm=0x84000002) failed
MPI_Gather(663): Null Datatype pointer
Fatal error in MPI_Gather: Invalid datatype, error stack:
MPI_Gather(761): MPI_Gather(sbuf=0x7fffc24dc710, scount=1, INVALID DATATYPE, rbuf=0xa0a8a20, rcount=1, INVALID DATATYPE, root=0, comm=0x84000002) failed
MPI_Gather(663): Null Datatype pointer
Fatal error in MPI_Gather: Invalid datatype, error stack:
MPI_Gather(761): MPI_Gather(sbuf=0x7fff2fc44610, scount=1, INVALID DATATYPE, rbuf=0xa0a8a20, rcount=1, INVALID DATATYPE, root=0, comm=0x84000002) failed
MPI_Gather(663): Null Datatype pointer

-----------------       Appreciate if you have suggestions how to go about it. Thanks a lot.
 

eaton

CSEG and Liaisons
I suspect the problem is in the MPI installation.  Can you run some simple tests to verify that mpi is working correctly on your system? 
 
Dear Eaton,     I tested a program. It runs ok, but performace is very poor. also, I am testing on SMP platform. I wonder if this is an ok test for MPI. I am pretty new on parallel computing. Any help is very appreciated. the source code is here. and the results follow after.! *****************************************************************************
! FILE: mpi_array.f
! DESCRIPTION:
!   MPI Example - Array Assignment - Fortran Version
!   This program demonstrates a simple data decomposition. The master task
!   first initializes an array and then distributes an equal portion that
!   array to the other tasks. After the other tasks receive their portion
!   of the array, they perform an addition operation to each array element.
!   They also maintain a sum for their portion of the array. The master task
!   does likewise with its portion of the array. As each of the non-master
!   tasks finish, they send their updated portion of the array to the master.
!   An MPI collective communication call is used to collect the sums
!   maintained by each task.  Finally, the master task displays selected
!   parts of the final array and the global sum of all array elements.
!   NOTE: the number of MPI tasks must be evenly disible by 4.
! AUTHOR: Blaise Barney
! LAST REVISED: 01/24/09
! **************************************************************************

      program array
      include 'mpif.h'

      integer   ARRAYSIZE, MASTER
      parameter (ARRAYSIZE = 160000000)
      parameter (MASTER = 0)

      integer  numtasks, taskid, ierr, dest, offset, i, tag1,
     &         tag2, source, chunksize
      real*4   mysum, sum, data(ARRAYSIZE)
      integer  status(MPI_STATUS_SIZE)
      common   /a/ data

! ***** Initializations *****
      call MPI_INIT(ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
      i = MOD(numtasks, 4)
      if (i .ne. 0) then
        call MPI_Abort(MPI_COMM_WORLD,ierr)
        stop
      end if
      call MPI_COMM_RANK(MPI_COMM_WORLD, taskid, ierr)
      write(*,*)'MPI task',taskid,'has started...'
      chunksize = (ARRAYSIZE / numtasks)
      tag2 = 1
      tag1 = 2

!***** Master task only ******
      if (taskid .eq. MASTER) then

!       Initialize the array
        sum = 0.0
        do i=1, ARRAYSIZE
        do j=1,200
          data(i) = i * 1.0 *2.0 / 2.0
          sum = sum + data(i)
        end do
        end do
        write(*,20) sum

!       Send each task its portion of the array - master keeps 1st part
        offset = chunksize + 1
        do dest=1, numtasks-1
          call MPI_SEND(offset, 1, MPI_INTEGER, dest, tag1,
     &      MPI_COMM_WORLD, ierr)
          call MPI_SEND(data(offset), chunksize, MPI_REAL, dest,
     &      tag2, MPI_COMM_WORLD, ierr)
          write(*,*) 'Sent',chunksize,'elements to task',dest,
     &      'offset=',offset
          offset = offset + chunksize
        end do

!       Master does its part of the work
        offset = 1
        call update(offset, chunksize, taskid, mysum)

!       Wait to receive results from each task
        do i=1, numtasks-1
          source = i
          call MPI_RECV(offset, 1, MPI_INTEGER, source, tag1,
     &      MPI_COMM_WORLD, status, ierr)
          call MPI_RECV(data(offset), chunksize, MPI_REAL,
     &      source, tag2, MPI_COMM_WORLD, status, ierr)
        end do

!       Get final sum and print sample results
        call MPI_Reduce(mysum, sum, 1, MPI_REAL, MPI_SUM, MASTER,
     &    MPI_COMM_WORLD, ierr)
        print *, 'Sample results:'
        offset = 1
        do i=1, numtasks
          write (*,30) data(offset:offset+4)
          offset = offset + chunksize
        end do
        write(*,40) sum

      end if


!***** Non-master tasks only *****

      if (taskid .gt. MASTER) then

!       Receive my portion of array from the master task */
        call MPI_RECV(offset, 1, MPI_INTEGER, MASTER, tag1,
     &    MPI_COMM_WORLD, status, ierr)
        call MPI_RECV(data(offset), chunksize, MPI_REAL, MASTER,
     &    tag2, MPI_COMM_WORLD, status, ierr)

        call update(offset, chunksize, taskid, mysum)

!       Send my results back to the master
        call MPI_SEND(offset, 1, MPI_INTEGER, MASTER, tag1,
     &    MPI_COMM_WORLD, ierr)
        call MPI_SEND(data(offset), chunksize, MPI_REAL, MASTER,
     &    tag2, MPI_COMM_WORLD, ierr)

        call MPI_Reduce(mysum, sum, 1, MPI_REAL, MPI_SUM, MASTER,
     &    MPI_COMM_WORLD, ierr)

      endif


      call MPI_FINALIZE(ierr)

  20  format('Initialized array sum = ',E12.6)
  30  format(5E14.6)
  40  format('*** Final sum= ',E12.6,' ***')

      end




      subroutine update(myoffset, chunksize, myid, mysum)
        integer   ARRAYSIZE, myoffset, chunksize, myid, i
        parameter (ARRAYSIZE = 16000000)
        real*4 mysum, data(ARRAYSIZE)
        common /a/ data
!       Perform addition to each of my array elements and keep my sum
        mysum = 0
        do i=myoffset, myoffset + chunksize-1
          data(i) = data(i) + i * 1.0
          mysum = mysum + data(i)
        end do
        write(*,50) myid,mysum
  50    format('Task',I4,' mysum = ',E12.6)
      end subroutine update

------Results msun[589] time /usr/local/mpich2-1.4.1p1/bin/mpirun -np 4 mpi_array
 MPI task           2 has started...
 MPI task           1 has started...
 MPI task           3 has started...
 MPI task           0 has started...
Initialized array sum = 0.450360E+16
 Sent    40000000 elements to task           1 offset=    40000001
Task   1 mysum = 0.465009E+16
 Sent    40000000 elements to task           2 offset=    80000001
Task   2 mysum = 0.867365E+16
 Sent    40000000 elements to task           3 offset=   120000001
Task   0 mysum = 0.159791E+16
Task   3 mysum = 0.105973E+17
 Sample results:
  0.200000E+01  0.400000E+01  0.600000E+01  0.800000E+01  0.100000E+02
  0.800000E+08  0.800000E+08  0.800000E+08  0.800000E+08  0.800000E+08
  0.160000E+09  0.160000E+09  0.160000E+09  0.160000E+09  0.160000E+09
  0.240000E+09  0.240000E+09  0.240000E+09  0.240000E+09  0.240000E+09
*** Final sum= 0.255189E+17 ***
135.172u 5.240s 0:35.20 398.8%  0+0k 0+0io 0pf+0w------------------   
 

eaton

CSEG and Liaisons
That looks like a good test.I think the next step is to identify where the call to mpi_gather which is failing is coming from.  You should be able to get this by getting a stack traceback from a debugger.  Since the error message says an invalid datatype is being used, we need to identify the actual call to see whether that makes sense. 
 
Dear Eaton,     Thanks for your confirmatin. I configure with debug and compile the code. However when I tried to debug, I can't trace anything.  I already sent an email to our System ADmin to look at gdb. But I do not think that will help a lot. According to your experience, can I insert print statement before subroutines what call MPI_Gather() to find out where? I usually do this for my other codes since they are mostly simple codes.  ---- $camcfg/configure -fc mpif90 -fc_type intel -cc mpicc -dyn fv -hgrid 10x15 -ntasks 6 -nosmp -test -v -debug    
 

eaton

CSEG and Liaisons
One thing you might try with the debugger is to build the code using "-ntasks 1".  This will still invoke the mpi library and should still give the same error as with 6 tasks.  If no luck with the debugger then print statements are the best option.  Just look for the code that's producing the final output in the log file and start from that point looking for the next call to an mpi_gather. 
 
Hi, Eaton,
    I  add lines in m_MCTWorld.F90, that is where the MPI run stops. My question is which mpif.h is used, it is /usr/local/include or mpi-serial/mpif.h? 2nd question is where is MPI_Irecv is defined. I found one in recv.c under models/utils/mct/mpi-seriall directory. that is the only one I found. I wonder if I found the right one. Any hints are welcome. It is getting closer. Thanks.  ---- between line 249-256  if(myGid == 0) then
    do i=1,ncomps
print*,'*hello1', i, root_nprocs(i), MP_INTEGER, MP_ANY_SOURCE, globalcomm,reqs(i),ier
       call MPI_IRECV(root_nprocs(i), 1, MP_INTEGER, MP_ANY_SOURCE,i, &
     globalcomm, reqs(i), ier)
       if(ier /= 0) call MP_perr_die(myname_,'MPI_IRECV(root_nprocs)',ier)
    enddo
  endif-----
then mpirun -np 1 ../bld/camthe results are:---............(seq_comm_joincomm) initialize ID ( 14 CPLALLROFID     ) join IDs =     2     8       ( npes =     1) ( nthreads =  1)
 =hello6
 *hello1           0           0           4          -1  1140850688           0
           0
 *hello1           1           0           4          -1  1140850688           0
           0
Fatal error in MPI_Irecv: Invalid datatype, error stack:
MPI_Irecv(145): MPI_Irecv(buf=0xa283e30, count=1, INVALID DATATYPE, src=MPI_PROC_NULL, tag=1, MPI_COMM_WORLD, request=0xa283c80) failed
MPI_Irecv(111): Null Datatype pointer
------ 
 

eaton

CSEG and Liaisons
When -ntasks is specified to configure the mpi-serial library should not be built, and mct/mpi-serial should not be in the list of paths set by the -I options that are searched for include files.  You can verify this by looking at the output from make.  If the mpi-serial version of mpif.h is being used in this build I think that could explain the error you are seeing.
 
Hi, Eaton,     Thanks for your help. I checked, mpi-serial is not used. Later I found out that my mpi include and lib paths are not included on ifort -c and ifort -o. I always thought, these things should be taken care of by configure and make. In my case, I need to add them.     AFter I done that, the program run much further, but a new problem appears, which I need to work on. If you could look at and help, that would be great. Thanks.------ close to end----............  heat_capacity             =        T
  atmbndy                   =  default
  fyear_init                =     1900
  ycycle                    =        1
  atm_data_type             = default
  calc_strair               =        T
  calc_Tsfc                 =        T
  Tfrzpt                    = linear_S
  update_ocn_f              =        F
  oceanmixed_ice            =        F
  sss_data_type             = default
  sst_data_type             = default

Diagnostic point 1: lat, lon =   90.00    0.00
Diagnostic point 2: lat, lon =  -65.00  -45.00
  tr_iage                   =        F
  restart_age               =        F
  tr_FY                     =        F
  restart_FY                =        F
  tr_lvl                    =        F
  restart_lvl               =        F
  tr_pond                   =        T
  restart_pond              =        F
  tr_aero                   =        F
  restart_aero              =        F


Domain Information

  Horizontal domain: nx =     24
                     ny =     19
  No. of categories: nc =      1
  No. of ice layers: ni =      4
  No. of snow layers:ns =      1
  Processors:  total    =      1
  Processor shape:       square-pop
  Distribution type:     roundrobin
  Distribution weight:     latitude
  {min,max}Blocks =            1     2
  Number of ghost cells:       1

CalcWorkPerBlock: Total blocks:     8 Ice blocks:     8 IceFree blocks:     0 Land blocks:     0
forrtl: severe (408): fort: (2): Subscript #2 of the array BLOCKINDEX has value 3 which is greater than the upper bound of 2

Image              PC                Routine            Line        Source
cam                000000000288B2F1  Unknown               Unknown  Unknown
cam                000000000288A2C5  Unknown               Unknown  Unknown
cam                0000000002819B2A  Unknown               Unknown  Unknown
cam                00000000027ADCF2  Unknown               Unknown  Unknown
cam                00000000027ACC0E  Unknown               Unknown  Unknown
cam                0000000000EEE86E  ice_distribution_         809  ice_distribution.F90
cam                0000000000EEDDEB  ice_distribution_         159  ice_distribution.F90
cam                0000000000F06A57  ice_domain_mp_ini         509  ice_domain.F90
cam                0000000000FFA89B  ice_grid_mp_init_         254  ice_grid.F90
cam                000000000048B70A  cice_initmod_mp_c         106  CICE_InitMod.F90
cam                0000000000EC883B  ice_comp_mct_mp_i         254  ice_comp_mct.F90
cam                0000000000976BA2  ccsm_comp_mod_mp_        1074  ccsm_comp_mod.F90
cam                0000000000996B12  MAIN__                     90  ccsm_driver.F90
cam                000000000040033C  Unknown               Unknown  Unknown
cam                0000000002898AC4  Unknown               Unknown  Unknown
cam                0000000000400209  Unknown               Unknown  Unknown
-----------------------  
 
Now, when I configure with ntask -4 and run with mpirun -np 1 ../bld/camthe program stops at line 233 allocate in m_MCTWorld.F90---------! allocate space on global root to receive info about
! the other components
 if(myGid == 0) then
    allocate(nprocs(ncomps),compids(ncomps),&
    reqs(ncomps),status(MP_STATUS_SIZE,ncomps),&
    root_nprocs(ncomps),stat=ier)
    if (ier /= 0) then
       call die(myname_, 'allocate(nprocs,...)',ier)
    endif
 endif
------   I typed 'unlimit', it does not work. and I linked by -heap-array option, it does not work either.Any suggestions? thanks.  
 

eaton

CSEG and Liaisons
The only reason that -ntasks is an argument to configure (rather than just using -spmd) is because the cice model sets up its decomposition at build time.  So when using cice as the sea ice component you should use the same number of tasks when you run the model that were specified via -ntasks when the executable was built.  I think that's the root of this problem.
 
Dear Eaton,
       Now, I can do mpirun on my own x86 machine with 8 processors. However, when I submit to Sun Grid Engine as a batch job. it has problems.       my configure:  $camcfg/configure -fc mpif90 -fc_type intel -cc mpicc -dyn fv -hgrid 10x15 -ntasks 4 -nosmp -test -v -debug       When submit to SGE, I got the end of the results and the error message:      qsub -pe scf_multicpu 4 -l arch=lx-amd64 test.sh      my test.sh is one line command:      /SPG_ops/utils/x86_64/sles11/mpich/mpich2-1.4.1p1-ifort/bin/mpirun /homedir/msun/cam5/work/bld/cam      the results:----------------------------     ..............      Attempting to read monthly vegetation data .....
 nstep =            0  month =            1  day =            1
 (GETFIL): attempting to find local file surfdata_10x15_simyr2000_c090928.nc
 (GETFIL): using
 /homedir/msun/cam5/data/lnd/clm2/surfdata/surfdata_10x15_simyr2000_c090928.nc
 Opened existing file
 /homedir/msun/cam5/data/lnd/clm2/surfdata/surfdata_10x15_simyr2000_c090928.nc
          37
 Successfully read monthly vegetation data for
 month            1
 
 dtime_sync=         1800  dtime_clm=         1800  mod =            0
(lnd_init_mct) :Atmospheric input is from a prognostic model
(seq_mct_drv) : Initialize rof component ROF
(rof_init_mct) :RTM land model initialization
 Read in rtm_inparm namelist from: rof_in
 define run:
    run type              = initial
 RTM will not be active
(seq_mct_drv) : Initialize ocn component OCN
(shr_file_setIO) file ocn_modelio.nml non existant
------ End     The error -----forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source            
cam                0000000004EA3561  Unknown               Unknown  Unknown
cam                000000000459E866  shr_mpi_mod_mp_sh        2064  shr_mpi_mod.F90
cam                0000000004675F2D  shr_sys_mod_mp_sh         278  shr_sys_mod.F90
cam                0000000004668DD0  shr_stream_mod_mp        2901  shr_stream_mod.F90
cam                0000000004630A92  shr_stream_mod_mp         608  shr_stream_mod.F90
cam                0000000004625E5D  shr_strdata_mod_m        1150  shr_strdata_mod.F90
cam                00000000014B857A  docn_comp_mod_mp_         275  docn_comp_mod.F90
cam                000000000333B7B3  ocn_comp_mct_mp_o          61  ocn_comp_mct.F90
cam                0000000000EA7912  ccsm_comp_mod_mp_        1051  ccsm_comp_mod.F90
cam                0000000000F07814  MAIN__                     90  ccsm_driver.F90
cam                000000000040033C  Unknown               Unknown  Unknown
cam                00000000050A67E4  Unknown               Unknown  Unknown
cam                0000000000400209  Unknown               Unknown  Unknown
------------------------        Appreciate your help. Thanks.        
 

santos

Member
The segmentation fault is due to a bug in shr_mpi_mod that has been fixed in more recent versions. However, the bug is in the "abort" method, so your real problem is the fact that shr_stream_mod is aborting; I'm not sure if that has anything to do with the ocn_modelio.nml warning.If you change line 2064 of shr_mpi_mod.F90 to use "rc" instead of "rcode", you should at least get a better error message. In fact, I'm surprised that you didn't get one anyway, since the message should have been printed and flushed before the segfault.
 
I made progress. Thanks for your help. Now I run into problems at line 1074, call ice_init_mct(). And the results:----------  maskhalo_remap            =        T
  maskhalo_bound            =        T
  kstrength                 =        0
  krdg_partic               =        1
  krdg_redist               =        1
  advection                 =    remap
  shortwave                 =     dEdd
  albedo_type               =  default
  R_ice                     =        0.00
  R_pnd                     =        0.00
  R_snw                     =        1.75
  dT_mlt_in                 =        1.00
   rsnw_melt_in             =     1000.00
  albicev                   =        0.75
  albicei                   =        0.45
  albsnowv                  =        0.98
  albsnowi                  =        0.73
  heat_capacity             =        T
  atmbndy                   =  default
  fyear_init                =     1900
  ycycle                    =        1
  atm_data_type             = default
  calc_strair               =        T
  calc_Tsfc                 =        T
  Tfrzpt                    = linear_S
  update_ocn_f              =        F
  oceanmixed_ice            =        F
  sss_data_type             = default
  sst_data_type             = default
 
Diagnostic point 1: lat, lon =   90.00    0.00
Diagnostic point 2: lat, lon =  -65.00  -45.00
  tr_iage                   =        F
  restart_age               =        F
  tr_FY                     =        F
  restart_FY                =        F
  tr_lvl                    =        F
  restart_lvl               =        F
  tr_pond                   =        T
  restart_pond              =        F
  tr_aero                   =        F
  restart_aero              =        F
 

Domain Information

  Horizontal domain: nx =     24
                     ny =     19
  No. of categories: nc =      1
  No. of ice layers: ni =      4
  No. of snow layers:ns =      1
  Processors:  total    =      4
  Processor shape:       square-pop
  Distribution type:     roundrobin
  Distribution weight:     latitude
  {min,max}Blocks =            1     8
  Number of ghost cells:       1

CalcWorkPerBlock: Total blocks:    57 Ice blocks:    57 IceFree blocks:     0 Land blocks:     0
------------the error message:forrtl: severe (408): fort: (2): Subscript #2 of the array BLOCKINDEX has value 9 which is greater than the upper bound of 8

Image              PC                Routine            Line        Source            
cam                00000000050A12C1  Unknown               Unknown  Unknown
cam                00000000050A0295  Unknown               Unknown  Unknown
cam                000000000502F9BA  Unknown               Unknown  Unknown
cam                0000000004FC3B82  Unknown               Unknown  Unknown
cam                0000000004FC2A9E  Unknown               Unknown  Unknown
cam                0000000001BD3510  ice_distribution_         809  ice_distribution.F90
cam                0000000001BCD9EF  ice_distribution_         159  ice_distribution.F90
cam                0000000001C125F1  ice_domain_mp_ini         509  ice_domain.F90
cam                0000000001D835AD  ice_grid_mp_init_         254  ice_grid.F90
cam                00000000004EC765  cice_initmod_mp_c         106  CICE_InitMod.F90
cam                0000000001B42DFB  ice_comp_mct_mp_i         254  ice_comp_mct.F90
cam                0000000000EA8F55  ccsm_comp_mod_mp_        1075  ccsm_comp_mod.F90
cam                0000000000F087B4  MAIN__                     90  ccsm_driver.F90
cam                000000000040033C  Unknown               Unknown  Unknown
cam                00000000050AEA94  Unknown               Unknown  Unknown
cam                0000000000400209  Unknown               Unknown  Unknown
----------I tried to print out in subroutine ice_init_mct(), but nothing printed from this subroutine. Appreciate your contiuned support.  
 
Yes, it looks like #9. However, this time I did exactly as you said in #10.configure:$camcfg/configure -fc mpif90 -fc_type intel -cc mpicc -dyn fv -hgrid 10x15 -ntasks 4 -nosmp -test -v -debugand qsub run qsub -pe scf_multicpu 4 -l arch=lx-amd64 test.shAnd the question is: it runs ok if I use my local machine like:mpirun -np 4 ../bld/cambut not in SGE. However, I still suspect this relates to how I set up model than my mpich installation. Thanks.
 
Now, after I set up the right mpich. the model runs well.The problem is due to confusion of mpich on my machine, which I have no control. I did not set up the environment correctly. So, I compiled with one version but run with another version of mpich. Thank you all for your help. cheers!
 
Top