Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Problems with distributing tasks over multiple nodes

johanna_teresa

Johanna Malle
New Member
Hi,
On a recent port to our HPC I am having problems with distributing tasks over multiple nodes - it all works well as long as I am running it all on 1 node, but as soon as I request more it does not work anymore. It seems like even though e.g. 5 nodes are requested/granted the model is trying to run all tasks on the same node, and hence failing eventually due to a memory error. I am using openmpi, and have recently added the --oversubscribe option due to recommendation of our IT departement (which I guess is why too many tasks are allocated to one node, but without it it always fails with the message "There are not enough slots available in the system to satisfy the 100 slots that were requested by the application". I have also tried various pe layouts, but it always crashes at the same point, see the attached log/config files. If anyone has had a similar problem or sees some mistake in my configurations, please let me know - it would be greatly appreciated.
Cheers,
Johanna
 

Attachments

  • version.txt
    7.4 KB · Views: 2
  • files.zip
    6.5 KB · Views: 2
Top