Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

problem with running cam5.0

weili@psu_edu

New Member
Hi,

I am trying to run cam5.0 (fv1.9x2.5) on a x86_64 Linux cluster (32 nodes, 8 processors/node, 16G memory/node). Model was configured as " $cfgdir/configure -dyn fv -hgrid 1.9x2.5 -ocn docn -ntask 8 -nosmp", and run with 2 nodes x4 proc. The model stop to run with the following message:

********
Attempting to read surface boundary data .....
(GETFIL): attempting to find local file surfdata_1.9x2.5_simyr2000_c091005.nc
(GETFIL): using
/mc1/s0/wwl5090/CAM4.0/inputdata/lnd/clm2/surfdata/surfdata_1.9x2.5_simyr2000_c091005.nc
Successfully read surface boundary data

rank 3 in job 6 mc1031.met.psu.edu_60200 caused collective abort of all ranks
exit status of rank 3: killed by signal 9
*********
Is there anybody encounter the similar problem?

Thanks very much!
 

eaton

CSEG and Liaisons
This configuration runs for me on a similar linux cluster. It appears to be a system problem with mpi.

To pin down that the problem is in the mpi system try running on one node with just one task. The answers should be identical to a serial run. If that works try 2 tasks. The answers from 1 and 2 tasks should be identical. If two tasks works on 1 node, then try it on 2 nodes.
 

weili@psu_edu

New Member
Hi eaton,

I actually already tried one node one task and then more nodes more tasks but they all didn't work. I am using mvapich2-1.2p1 and pgi-10.6. Does this matters?

Thanks.
 
Top