Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

how to run ccsm3 under mpich 1.2.7

Hi everyone,



Does anyone know how to set up the CCSM3 running under MPI version 1.2.7 ?

In my run script, I wrote “/opt/pgi/linux86-64/7.1/mpi/mpich/bin/mpirun -p4pg $EXEROOT/mpirun.pgfile ./$COMPONENTS[1]”,but there seems to be something wrong with it.

I got error message like this:

ccsm.o contains



p23_27852: p4_error: : 197

rm_l_23_27871: (1.363281) net_send: could not write to fd=5, errno = 32

(cpl_comm_init) cpl_comm_comp, size: 137 4

(cpl_comm_init) comm world : comm,npe,pid 133 32 22

(cpl_comm_init) comm component: comm,npe,pid 137 4 2

(cpl_comm_init) comm world pe0: atm,ice,lnd,ocn,cpl,me 8 1 2 6 0 6

(cpl_comm_init) mph cid : atm,ice,lnd,ocn,cpl,me 1 2 3 4 5 4



p6_25585: p4_error: net_recv read: probable EOF on socket: 1

P4 procgroup file is /mnt/storage-space/disk1/lius/exe/TDB .01a .T31_gx3v5.B.jazz.205822/mpirun.pgfile.



AND ccsm.e contains:



23 - MPI_CART_SHIFT : Null communicator

[23] Aborting program !

[23] Aborting program!

PGFIO/stdio: Bad file descriptor

PGFIO-F-/OPEN/unit=10/error code returned by host stdio - 9.

File name = mph_processors_map.in formatted, sequential access record = 8

In source file /mnt/storage-space/disk1/lius/ccsmroot/ccsm3/models/csm_share/shr/shr_msg_mod.F90, at line number 102

22 - MPI_CART_SHIFT : Null communicator

[22] Aborting program !

[22] Aborting program!



Here is my hardware information:

Machine: Intel xeon cluster with Linux, 16 cpus per node
mpi: mpich 1.2.7 , using mpirun
pgi: 7.1
Batch: PBS
Network: Gbit Ethernet



Anyone can give some advice?

Thanks.



Liu. S
 
Hi liushanoh/gingko

Suggestions:

1) Upgrade to OpenMPI or to MPICH2.
MPICH1 (mpich-1.2.7) is too old, not maintained anymore,
and doesn't work on current Linux kernels,
which is the reason why you see the "p4" errors.

http://www.open-mpi.org/
http://www.mcs.anl.gov/research/projects/mpich2/

2) Register to the CGD Bulletin Board,
and move your questions to that forum:

http://bb.cgd.ucar.edu/

The CGD BB is active, and you can get more answers and advice there.

This mailing list is dead.
It should already have been shut down.
Nobody answers to it ... well, maybe only one person does ... :)

I hope this helps.
Gus Correa
 
Hi liushanoh/gingko

The CGD forum is the right place to ask questions,
even if they are not very responsive during the Holidays.
You can post these questions on the CCSM3 general setup or
software engineering forums.
Also, search their archive.
You may find that somebody had the same problem, with a solution.

"Invalid communicator" can
happen if you mix two types of MPI (e.g., MPICH1 and MPICH2).
Make sure your MPI include directory/files in the Makefiles
are pointing to the *same* MPICH2 include file directory.
Likewise for the MPI libraries.
If you hardwired any mpi.h into your source files,
it also must point
to the same MPICH2 mpi.h.
(If you use mpif90 and mpicc compiler wrappers,
they should be able to find the right
include files, as long as your Makefiles
are not pointing to the wrong direction.)
And compile everything fresh, from scratch, again,
to avoid any old leftovers object files or modules to mix in.

It may be tricky to change the limits (limit stacksize unlimited)
on your mpirun command.
It is probably easier to ask your system administrator to change the
limits on your compute nodes directly.

I am not sure about the "t_setoption" message.
It may be some routine that may be part of ESMF.
It's been a while since I compiled CCSM3,
so I don't remember all details.

Please, move your questions to the CGD forum.
Good luck.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


s l wrote:
> Hi Correa,
>
> Thanks for your suggestions. Just as you have said, only you replied me. Hehe:)
> I put my questions on the CGD forum. But till now there is no replys yet.
>
> I want to ask you two questions:
>
> 1) I have upgraded to mpich2, and there is no "p4 error" any more. However, it still has some problems.
>
> ccsm.e contains:
> 14: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
> 15: Fatal error in MPI_Cart_shift: Invalid communicator, error stack:
> 14: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, displ=1, source=0x2582aa0, dest=0x2582aa4) failed
> 14: MPI_Cart_shift(80).: Null communicator
> 15: MPI_Cart_shift(172): MPI_Cart_shift(MPI_COMM_NULL, direction=1, displ=1, source=0x2582aa0, dest=0x2582aa4) failed
> 15: MPI_Cart_shift(80).: Null communicator
> AND ccsm.o contians error messages like
> "rank 15 in job 1 compute-0-18.local_35674 caused collective abort of all ranks
> exit status of rank 15: return code 1 "
>
> I have added "limit stacksize unlimited" in the run script, but "error stack" still apppears. Could you give some advices on how to solve it ?
>
> 2) In the ccsm.o file, a message "t_setoption: option disabled: Usr Sys " has appeared many times, is it an error message? And if it were, how can I deal with it ?
>
> Thanks a lot! :)
>
> Liu. S
 
Top