Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Broken pipe error in linux cluster

I am running CAM on a linux cluster using mpich version 1.2.6. The model configures and compiles fine, however when I try to run it it crashes. The end of the output file looks like this:

-----------------------------------------
Number of lats passed north & south = 3
Node Partition Extended Partition
-----------------------------------------
0 1- 32 -2- 35
1 33- 64 30- 67
procid 0 assigned 473 spectral coefficients and
21 m values: 1 5 9 13
17 21 25 29 33 37
41 4 8 12 16 20
24 28 32 36 40
procid 1 assigned 473 spectral coefficients and
22 m values: 2 6 10 14
18 22 26 30 34 38
42 3 7 11 15 19
23 27 31 35 39 43
SPMDBUF: Allocating SPMD buffers of size 2387984
**** Summary of Logical Unit assignments ****

Restart pointer unit (nsds) = 1
Master restart unit (nrg) = 2
Abs/ems unit for restart (nrg2) = 3
History restart unit (luhrest) = 4
p0_24535: (0.189976) net_send: could not write to fd=4, errno = 32
p4_error: latest msg from perror: Broken pipe
p0_24535: p4_error: net_send write: -1
p0_24535: (2.195281) net_send: could not write to fd=4, errno = 32
-------------------------------------------------------------------------------------
Has anybody gotten these errors before? If so, what do these errors mean and are they fixable?

Thanks,
Cathy
 

jmccaa

New Member
Hi Cathy,

MPI problems can be difficult to debug. One reason for the model dying immediately is lack of memory for the MPI communications.
Do you have the environment variable P4_GLOBMEMSIZE set to a large value? I have it set as follows:
setenv P4_GLOBMEMSIZE 16777216

I don't know exactly what this does, but have found it to be a prerequisite for the model to run on our clusters. You have to either put this in your submission script or your .cshrc file so it gets picked up by the model.

Jim
 
Top