Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPIR_Group_check_valid_ranges(331) and PMPI_Group_range_incl fatal error

MarkR_UoLeeds

Mark Richardson
New Member
Hello, I am porting CESM2 (2.1.3 cime5.6) to a new platform with the main variation being the job scheduler is SGE. After some success with config_batch.xml I have found that the cesm run fails. the build should be using Intel 19.0.6 and intelmpi, each node has 40 cores. I was running the smoke test : SMS.f19_g17.X.arc4_intel.20210809_170336_jrngry and I also tried scripts_regression_tests.py. I wonder if I should try openmpi or the MVAPICH2 that are also available.

In the "run" directory there is a log:
[earmgr@login2.arc4 run]$ more cesm.log.210809-172625
Invalid PIO rearranger comm max pend req (comp2io), 0
Resetting PIO rearranger comm max pend req (comp2io) to 64
PIO rearranger options:
comm type =
p2p

comm fcd =
2denable

max pend req (comp2io) = 0
enable_hs (comp2io) = T
enable_isend (comp2io) = F
max pend req (io2comp) = 64
enable_hs (io2comp) = F
enable_isend (io2comp) = T
(seq_comm_setcomm) init ID ( 1 GLOBAL ) pelist = 0 0 1 ( npes = 1) ( n
threads = 1)( suffix =)
Abort(537497356) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Group_range_incl: Invalid argume
nt, error stack:
PMPI_Group_range_incl(200)........: MPI_Group_range_incl(group=0x88000006, n=1, ranges=0x117c900, n
ew_group=0x7fffe35d1214) failed
MPIR_Group_check_valid_ranges(331): The 0th element of a range array ends at 39 but must be nonnega
tive and less than 1
 

fischer

CSEG and Liaisons
Staff member
Hi Mark,

Doing a quick google, it looks like you're try to start a job with a different number of tasks than what the model is expecting. You should be
able to run ./preview_run to get more information on the total number of tasks you're running.

Chris
 

MarkR_UoLeeds

Mark Richardson
New Member
Hi Chris,
here is the preview which looks okay (1 node has 40 cores) I wonder if that is the correct for SMS?
Additional info. The system admins have "wrapped mpirun" so that you have to be clear in the SGE options for the mpi placement.
#$ -l nodes=1, ppn=40, tpp=1
mpirun ./hello_world.exe
is what a standard hello world would look like if I want to run fully populated node. Where do I find the submission script?


Bash:
[earmgr@login2.arc4 SMS.f19_g17.X.arc4_intel.20210809_170336_jrngry]$ ./preview_run
CASE INFO:
  nodes: 1
  total tasks: 40
  tasks per node: 40
  thread count: 1

BATCH INFO:
  FOR JOB: case.test
    ENV:
      module command is module unload openmpi
      module command is module load intel/19.0.4 intelmpi/2019.4.243 netcdf/4.6.3
      Setting Environment OMP_STACKSIZE=64M
      Setting Environment OMP_NUM_THREADS=1

    SUBMIT CMD:
      qsub -q 40core-192G.q -l h_rt=24:00:00 -v ARGS_FOR_SCRIPT='' .case.test

    MPIRUN (job=case.test):
      mpirun
                    -np
                    40
                 /nobackup/earmgr/cesm_sims/SMS.f19_g17.X.arc4_intel.20210809_170336_jrngry/bld/cesm.exe  >> cesm.log.$LID 2>&1
 

MarkR_UoLeeds

Mark Richardson
New Member
I think the error is in
max pend req (io2comp) = 64

So I need to work out where to set "PIO_REARR_COMM_MAX_PEND_REQ_IO2COMP" or something similar (but of course I am still finding my way around CESM.)

Mark
 

fischer

CSEG and Liaisons
Staff member
The submission script is .case.test in your run directory. This file is rewritten every time you run case.submit. You can
change the settings in it by changing the settings in env_batch.xml.
 

jedwards

CSEG and Liaisons
Staff member
there is no error in the PIO information, the problem is apparent in this line:
(seq_comm_setcomm) init ID ( 1 GLOBAL ) pelist = 0 0 1 ( npes = 1) ( n
threads = 1)( suffix =)

The mpi_init command is only seeing 1 task. What does the header section of .case.test look like?
 

MarkR_UoLeeds

Mark Richardson
New Member
#!/usr/bin/env python
#$ -N test.SMS.f19_g17.X.arc4_intel.20210809_170336_jrngry
#$ -V
#$ -l nodes=1,ppn=40,tpp=1
#$ -pe ib 40
#$ -l node_type=40core-192G

so a slight conflict of nodes set to 1 (meaning only use all of one node) and "-pe ib 40" (which I think tells SGE to use any available cores) - I will remove (-pe) option from the config_batch.xml.
 

MarkR_UoLeeds

Mark Richardson
New Member
Just to close the discussion on this specific error (PIO) the mpirun command was mal-formed.

When I inspected the .case.run file I found the command had been spread over 3 lines. So SGE batch scheduler was only picking up the first line. I fixed it in the config_machine.xml.

so this line (78) had been incorrectly formatted (current correct version shown):

75 <mpirun mpilib="default">
76 <executable>mpirun</executable>
77 <arguments>
78 <arg name="ntasks"> -np {{ total_tasks }} </arg>
79 </arguments>
80 </mpirun>

I also had an extra directives in config_batch.xml but now it is more appropriate for my parallel machine:

<directives>
<directive> -N {{ job_id }}</directive>
<directive> -V </directive>
<directive>-l nodes={{ num_nodes }},ppn={{ tasks_per_node }},tpp={{ thread_count }}</directive>
</directives>
 
Top