Main menu

Navigation

How to deal with these mpi errors?

1 post / 0 new
liushan@...
How to deal with these mpi errors?

Hi,

I want to run CCCSM3 on our machine under MPICH2. I compiled it successfully. However, I got some error message about MPI during runnig it.

1) In the run script, I asked for 32 cpus ( use PBS batch system). After starting up mpd daemons, I wrote " /mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2 $EXEROOT/all/cpl : -n 2 $EXEROOT/all/csim : -n 8 $EXEROOT/all/clm : -n 4 $EXEROOT/all/pop : -n 16 $EXEROOT/all/cam" .
The process is over quite quickly after I qsub it. With error message like:
rank 5 in job 1 compute-0-10.local_46741 caused collective abort of all ranks
exit status of rank 5: return code 1
AND
14: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
14: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, displ=1, source=0x2582aa0, dest=0x2582aa4) failed
14: MPI_Cart_shift(80).: Null communicator
15: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
15: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, displ=1, source=0x2582aa0, dest=0x2582aa4) failed
5: Assertion failed in file helper_fns.c at line 337: 0
15: MPI_Cart_shift(80).: Null communicator
5: memcpy argument memory ranges overlap, dst_=0xf2c37f4 src_=0xf2c37f4 len_=4
9: Assertion failed in file helper_fns.c at line 337: 0
5:
9: memcpy argument memory ranges overlap, dst_=0x1880ce64 src_=0x1880ce64 len_=4
5: internal ABORT - process 5
9:
9: internal ABORT - process 9
4: Assertion failed in file helper_fns.c at line 337: 0
4: memcpy argument memory ranges overlap, dst_=0x1c9615d0 src_=0x1c9615d0 len_=4
4:
4: internal ABORT - process 4

I have been told that mixing different types of MPI may induce the error"invalid communicator". I have edited my .bashrc file and Macros file, and compiled the case fresh after replacing MPICH1 with MPICH2. The error message"invalid communicator" still exists. I am sure my CCSM3 Makefiles now point to MPICH2 include files and the library directories are also have been changed to associate to MPICH2. Why is this error still here?

2) What quite puzzeled me is that if I delete any one of the five (cpl, csim, clm, pop, cam ) , the model can running. For example, delete "cpl", I wrote "/mnt/storage-space/disk1/mpich/bin/mpiexec -l -n 2 $EXEROOT/all/csim : -n 8 $EXEROOT/all/clm : -n 4 $EXEROOT/all/pop : -n 16 $EXEROOT/all/cam" will be ok(I mean in this situation I can see these process names by the command "top"). But if I run all of the five at the same time, then I can not see these process names with "top". The error message as mentioned above will appear.

3) The reselution I use is T31_gx3v5. I asked for two nodes(each node with 16G memory and 16 cpus). Are these resources enough for CCSM3 running?

Anyone can give some suggestions?

Thanks in advace!

L. S

Who's new

  • ahadibfar@...
  • mrostami@...
  • bxz125@...
  • yixiaozhang@...
  • dongxia.yang@...