number of tasks problem

TCNasa · Jul 7, 2023

I'm trying to increase the number of tasks in CESM to run faster, but if I use any NTASKS greater than 12, the code fails like this:

[bn12:13502] *** An error occurred in MPI_Group_range_incl
[bn12:13502] *** reported by process [2537488385,7]

[bn12:13502] *** on communicator MPI_COMM_WORLD
[bn12:13502] *** MPI_ERR_RANK: invalid rank
[bn12:13502] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[bn12:13502] *** and potentially your MPI job)

Any idea where to look for the problem?

jedwards · Jul 10, 2023

What are the settings for MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE in config_machines.xml?

TCNasa · Jul 10, 2023

It is set to 1 in that file, but I change it to 6 or sometimes 12 with the xmlchange command.

jedwards · Jul 10, 2023

I believe that this is the problem - what is the hardware? How many cpus per shared memory node?

TCNasa · Jul 10, 2023

I think there are 12 cpus per amd64 node on our system. Usually we have no problem running 12 tasks. We want to increase task number to speed up the code if possible.

TCNasa · Jul 10, 2023

I know there are about 10 different "sections" of CESM: atm, cpl, etc. They don't run simultaneously do they? Usually we can run 12 tasks per component.

TCNasa · Jul 12, 2023

For some reason, my CESM keeps defaulting to just 12 tasks. Here is a message from the cesm log file:
(seq_comm_setcomm) init ID ( 1 GLOBAL ) pelist = 0 11 1 ( npes = 12) ( nthreads =

1)( suffix =)

[zf17:11678] *** An error occurred in MPI_Group_range_incl

[zf17:11678] *** reported by process [759234561,10]

[zf17:11678] *** on communicator MPI_COMM_WORLD

[zf17:11678] *** MPI_ERR_RANK: invalid rank

jedwards · Jul 12, 2023

Post the config_machines.xml and config_batch.xml files for your port and I'll see if I can spot the problem.
Also have you run an mpi hello world type program using more than 12 tasks and confirmed that it works?

TCNasa · Jul 12, 2023

Here are the config files. They usually reside in cime/config/cesm/machines

jedwards · Jul 12, 2023

First - these values should both be 12 according to your description:

<MAX_TASKS_PER_NODE>1</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>1</MAX_MPITASKS_PER_NODE>

Then the entire mpirun section in config_machines.xml is commented out. It's not clear to me why this was done or what you expect to happen.
You also have batch system set to none in config_machines.xml - does your cluster not have a batch system?

TCNasa · Jul 12, 2023

I had to give them a .txt extension to post them

TCNasa · Jul 12, 2023

When I alter those values in config_machines.xml and uncomment the mpirun section I get a different problem:
--------------------------------------------------------------------------

Your job has requested more processes than the ppr for

this topology can support:

App: 1

Number of procs: 12

PPR: 6:node

Please revise the conflict and try again.

--------------------------------------------------------------------------

jedwards · Jul 12, 2023

It would seem to suggest that you have 6 procs per node, not 12.
Do you have a system administrator that you can consult?

TCNasa · Jul 12, 2023

Yes, thanks for your efforts

number of tasks problem

TCNasa

Tom Caldwell

Member

jedwards

CSEG and Liaisons

TCNasa

Tom Caldwell

Member

jedwards

CSEG and Liaisons

TCNasa

Tom Caldwell

Member

TCNasa

Tom Caldwell

Member

TCNasa

Tom Caldwell

Member

jedwards

CSEG and Liaisons

TCNasa

Tom Caldwell

Member

Attachments

jedwards

CSEG and Liaisons

TCNasa

Tom Caldwell

Member

TCNasa

Tom Caldwell

Member

jedwards

CSEG and Liaisons

TCNasa

Tom Caldwell

Member