ERROR when porting CESM1.0 on a generic_linux_pgi

huangdzh@foxmail_com · Oct 20, 2010

Hi, all,
when porting CESM1.0 on our generic_linux_pgi, there is a error when running:

......
8 pes participating in computation for CLM

-----------------------------------

NODE# NAME
( 0) compute-0-40.local
( 1) cluster.sysu.edu.cn
( 2) cluster.sysu.edu.cn
( 3) cluster.sysu.edu.cn
( 4) cluster.sysu.edu.cn
( 5) cluster.sysu.edu.cn
( 6) cluster.sysu.edu.cn
( 7) cluster.sysu.edu.cn
Reading setup_nml
Reading grid_nml
Reading ice_nml
Reading tracer_nml
CalcWorkPerBlock: Total blocks: 8 Ice blocks: 8 IceFree blocks: 0 Land blocks: 0
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
p0_8804: (2263.718750) net_recv failed for fd = 5
p0_8804: p4_error: net_recv read, errno = : 110
p0_8804: (2277.730469) net_send: could not write to fd=4, errno = 32

is this due to the mpi?

we have 1) compiled the netcdf and mpich with -fc=pgf90 -f77=pgf77 -cc=pgcc,
2) changed 3 files *.generic_linux_pgi in $CESMROOT/scripts/ccsm_utils/Machines/,
3) set the -max_tasks_per_node 8, and set all the NTASKS=8, NTHRDS=1, ROOTPE=0 in env_mach_pes.xml,
4) changed the PBS option: -l nodes=1:ppn=8, -l walltime=48:00:00,

any help will be appreciated.

leo

eaton · Oct 22, 2010

It does look like an mpi problem. What looks strange is that task 0 is assigned to a different node (compute-0-40.local) than tasks 1-7 (cluster.sysu.edu.cn). Is that right? It looks like the job launcher (mpiexec or mpirun) is not creating a valid list of hosts since your PBS resource request is for 8 tasks on 1 node.

ERROR when porting CESM1.0 on a generic_linux_pgi

huangdzh@foxmail_com

New Member

eaton

CSEG and Liaisons