Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

errors when running CCSM on linux cluster

Hello,

I recently installed CCSM on a GB Ethernet linux cluster using mpich-1.2.5.2 and pgi 6.0-5. The model builds fine, but when I go to run it, I get the following output (each error is repeated many times, I've only pasted a few here):

-----------------------------------------------------------------
Terminated
recv-atm
p1_23528: p4_error: net_recv read: probable EOF on socket: 1
(cpl_contract_init) ice-send-cpl
p3_23574: p4_error: net_recv read: probable EOF on socket: 1
(cpl_contract_init) ice-send-cpl
p4_23597: p4_error: net_recv read: probable EOF on socket: 1
(cpl_contract_init) ice-send-cpl
p5_23620: p4_error: net_recv read: probable EOF on socket: 1
(cpl_contract_init) ice-send-cpl
p8_23689: p4_error: net_recv read: probable EOF on socket: 1
(cpl_comm_init) setting up communicators, name = ocn
===================================
mph attempted to call MPI_INIT
(cpl_comm_init) cpl_comm_comp, size: 137 24
(cpl_comm_init) comm world : comm,npe,pid 133 56 37
(cpl_comm_init) comm component: comm,npe,pid 137 24 21
(cpl_comm_init) comm world pe0: atm,ice,lnd,ocn,cpl,me 40 2 10 16 0 16
(cpl_comm_init) mph cid : atm,ice,lnd,ocn,cpl,me 1 2 3 4 5 4
(cpl_contract_init) ocn-send-cpl
p37_17996: p4_error: net_recv read: probable EOF on socket: 1
(cpl_comm_init) setting up communicators, name = ocn
===================================
...
p19_17600: p4_error: net_recv read: probable EOF on socket: 1
p41_18123: p4_error: net_recv read: probable EOF on socket: 1
p16_17534: p4_error: net_recv read: probable EOF on socket: 1
p45_18211: p4_error: net_recv read: probable EOF on socket: 1
p10_17402: p4_error: net_recv read: probable EOF on socket: 1
p12_17446: p4_error: net_recv read: probable EOF on socket: 1
p11_17424: p4_error: net_recv read: probable EOF on socket: 1

bm_list_23506: (97.578573) wakeup_slave: unable to interrupt slave 0 pid 23505
bm_list_23506: (97.579286) wakeup_slave: unable to interrupt slave 0 pid 23505
p2_23551: p4_error: net_recv read: probable EOF on socket: 1
Broken pipe
Broken pipe
Killing MPICH slave process, PID 23505
Killing MPICH slave process, PID 23506
--------------------------------------------------------------------------

It seems like the different model components are unable to communicate with the coupler.

Does anybody know what any of these errors mean? Do they indicate a problem in the model source code? An MPICH problem? Something else?

Thanks,
Cathy
 
I have looked in the cpl.log file, and found that the actual error that CCSM is getting is:

---------------------------------------------------------------------------
MCT::m_AttrVect::indexRA_:: ERROR--attribute not found: "afrac" Traceback:

(frac_set) ->MCT::m_AttrVect::indexRA_
MCT(MPEU)::m_List::clean_: deallocate(aList%...) error, stat =1
MCT(MPEU)::m_List::clean_: deallocate(aList%...) error, stat =1
MCT::m_AttrVect::indexRA_:: ERROR--attribute not found: "afrac" Traceback:

(cpl_bundle_mult) ->MCT::m_AttrVect::indexRA_
MCT(MPEU)::m_List::clean_: deallocate(aList%...) error, stat =1
MCT::m_AttrVect::indexRA_:: ERROR--attribute not found: "afrac" Traceback:

(cpl_bundle_mult) ->MCT::m_AttrVect::indexRA_
MCT(MPEU)::m_List::clean_: deallocate(aList%...) error, stat =1
MCT::m_AttrVect::indexRA_:: ERROR--attribute not found: "afrac" Traceback:

(cpl_bundle_mult) ->MCT::m_AttrVect::indexRA_
MCT(MPEU)::m_List::clean_: deallocate(aList%...) error, stat =1
MCT::m_AttrVect::indexRA_:: ERROR--attribute not found: "afrac" Traceback:

(cpl_bundle_mult) ->MCT::m_AttrVect::indexRA_
MCT(MPEU)::m_List::clean_: deallocate(aList%...) error, stat =1
(cpl_map_bun) WARNING: bundle aoflux_o has accum count = 0
MCT::m_AttrVect::indexRA_:: ERROR--attribute not found: "afrac" Traceback:

(cpl_bundle_mult) ->MCT::m_AttrVect::indexRA_
MCT(MPEU)::m_List::clean_: deallocate(aList%...) error, stat =1
--------------------------------------------------------------------

I'm sure nobody has seen errors like these before, but does anybody know what "afrac" is? Apparently CCSM cannot locate this attribute, but I can't figure out where it's coming from.

Thanks,
Cathy
 
Top