Hi,
On a recent port to our HPC I am having problems with distributing tasks over multiple nodes - it all works well as long as I am running it all on 1 node, but as soon as I request more it does not work anymore. It seems like even though e.g. 5 nodes are requested/granted the model is trying to run all tasks on the same node, and hence failing eventually due to a memory error. I am using openmpi, and have recently added the --oversubscribe option due to recommendation of our IT departement (which I guess is why too many tasks are allocated to one node, but without it it always fails with the message "There are not enough slots available in the system to satisfy the 100 slots that were requested by the application". I have also tried various pe layouts, but it always crashes at the same point, see the attached log/config files. If anyone has had a similar problem or sees some mistake in my configurations, please let me know - it would be greatly appreciated.
Cheers,
Johanna
On a recent port to our HPC I am having problems with distributing tasks over multiple nodes - it all works well as long as I am running it all on 1 node, but as soon as I request more it does not work anymore. It seems like even though e.g. 5 nodes are requested/granted the model is trying to run all tasks on the same node, and hence failing eventually due to a memory error. I am using openmpi, and have recently added the --oversubscribe option due to recommendation of our IT departement (which I guess is why too many tasks are allocated to one node, but without it it always fails with the message "There are not enough slots available in the system to satisfy the 100 slots that were requested by the application". I have also tried various pe layouts, but it always crashes at the same point, see the attached log/config files. If anyone has had a similar problem or sees some mistake in my configurations, please let me know - it would be greatly appreciated.
Cheers,
Johanna