Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

B1850 issue

pansah

Peter Ansah
New Member
The B1850 compset with reso "f19_g17" results in the UCX error below when I submit. I tried the compset QPC4 --res f45_f45_mg37, and got no UCX erorr - (at least after manually exporting UCX_TLS=ud,sm,self). Has this got anything to do this resolution? I understand this may be a system issue- or maybe not. But someone must have encountered it before.

[1721139988.509109] [uagc20-12:212195:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable
[1721139988.509109] [uagc20-12:212196:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable
[1721139988.509189] [uagc21-03:158607:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable
[1721139988.509140] [uagc21-01:261909:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable
[1721139988.509401] [uagc21-04:56013:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable
[1721139988.509179] [uagc21-05:127821:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable
[1721139988.509343] [uagc21-02:2655724:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable
[1721139988.509188] [uagc21-03:158608:0] select.c:630 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable

MPIDI_OFI_mpi_init_hook(1602)....:
insert_addr_table_roots_only(451): OFI get address vector map failed
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)............:
MPID_Init(1532)..................:
MPIDI_OFI_mpi_init_hook(1602)....:
insert_addr_table_roots_only(451): OFI get address vector map failed
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)............:
MPID_Init(1532)..................:
MPIDI_OFI_mpi_init_hook(1602)....:
insert_addr_table_roots_only(451): OFI get address vector map failed
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)............:
 

fischer

CSEG and Liaisons
Staff member
Hi Peter,

The B1850 compset at f19_g17 uses significantly more memory than QPC4 at f45_f45_mg37. So if you're running with the same PE layout for both, your probably running out of memory.

Chris
 

pansah

Peter Ansah
New Member
Hi Peter,

The B1850 compset at f19_g17 uses significantly more memory than QPC4 at f45_f45_mg37. So if you're running with the same PE layout for both, your probably running out of memory.

Chris

Please what is PE layout? I did not make any specifications for system requirements. But the former automatically assigned 1node, 24 tasks/node; while the latter automatically took 6nodes and 24 tasks/node.
 

fischer

CSEG and Liaisons
Staff member
Hi Peter,

PE layout is the processor element layout, such as the number of nodes and the number of tasks per node. This is set in config_pes.xml that I pointed out in the other thread. You can also change the number of nodes used by changing the values of NTASKS in config_pes.xml in your run directory. But when you change that, you'll need to rebuild. Having said that, I'm not sure if 6 nodes would be enough to run a B1850 compset.

Chris
 

pansah

Peter Ansah
New Member
Okay. Thank you! Is it possible to just run on just 1 node with whatever cpus or processor tasks that gets automatically assigned?
 
Top