Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

ERROR when porting CESM1.0 on a generic_linux_pgi

Hi, all,
when porting CESM1.0 on our generic_linux_pgi, there is a error when running:

......
8 pes participating in computation for CLM

-----------------------------------

NODE# NAME
( 0) compute-0-40.local
( 1) cluster.sysu.edu.cn
( 2) cluster.sysu.edu.cn
( 3) cluster.sysu.edu.cn
( 4) cluster.sysu.edu.cn
( 5) cluster.sysu.edu.cn
( 6) cluster.sysu.edu.cn
( 7) cluster.sysu.edu.cn
Reading setup_nml
Reading grid_nml
Reading ice_nml
Reading tracer_nml
CalcWorkPerBlock: Total blocks: 8 Ice blocks: 8 IceFree blocks: 0 Land blocks: 0
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
p0_8804: (2263.718750) net_recv failed for fd = 5
p0_8804: p4_error: net_recv read, errno = : 110
p0_8804: (2277.730469) net_send: could not write to fd=4, errno = 32


is this due to the mpi?

we have 1) compiled the netcdf and mpich with -fc=pgf90 -f77=pgf77 -cc=pgcc,
2) changed 3 files *.generic_linux_pgi in $CESMROOT/scripts/ccsm_utils/Machines/,
3) set the -max_tasks_per_node 8, and set all the NTASKS=8, NTHRDS=1, ROOTPE=0 in env_mach_pes.xml,
4) changed the PBS option: -l nodes=1:ppn=8, -l walltime=48:00:00,

any help will be appreciated.

leo
 

eaton

CSEG and Liaisons
It does look like an mpi problem. What looks strange is that task 0 is assigned to a different node (compute-0-40.local) than tasks 1-7 (cluster.sysu.edu.cn). Is that right? It looks like the job launcher (mpiexec or mpirun) is not creating a valid list of hosts since your PBS resource request is for 8 tasks on 1 node.
 
Top