Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Run time error in CESM2, E component set that consist of slab ocean model for year 1850

Dear All, I am trying to run CESM2 with user compset "1850_CAM60%CCTS_CLM45%SP_CICE_DOCN%SOM_MOSART_SGLC_SWAV_TEST". I am successfully able to build my compset at the one-degree resolution without any error message. However, the simulation automatically killed when started running after 2 minutes with an error:

"Attaching 27164 to 1202607.pbshpc
Attaching 21343 to 1202607.pbshpc
Attaching 35481 to 1202607.pbshpc
Attaching 5385 to 1202607.pbshpc
Attaching 40470 to 1202607.pbshpc
Attaching 15478 to 1202607.pbshpc
Attaching 27936 to 1202607.pbshpc
Attaching 2585 to 1202607.pbshpc
Attaching 34965 to 1202607.pbshpc
Attaching 35824 to 1202607.pbshpc
Attaching 32387 to 1202607.pbshpc
Attaching 17541 to 1202607.pbshpc
Attaching 5050 to 1202607.pbshpc
Attaching 25202 to 1202607.pbshpc
Attaching 15207 to 1202607.pbshpc
Attaching 12545 to 1202607.pbshpc
Attaching 6784 to 1202607.pbshpc
Attaching 40250 to 1202607.pbshpc
Attaching 40277 to 1202607.pbshpc
Attaching 2618 to 1202607.pbshpc
Attaching 24877 to 1202607.pbshpc
Attaching 18602 to 1202607.pbshpc
Attaching 24686 to 1202607.pbshpc
Invalid PIO rearranger comm max pend req (comp2io), 0
Resetting PIO rearranger comm max pend req (comp2io) to 64
Resetting PIO rearranger comm max pend req (comp2io) to 64
PIO rearranger options:
comm type =
p2p
comm fcd =
2denable
comm fcd =
2denable
max pend req (comp2io) = 0
enable_hs (comp2io) = T
enable_isend (comp2io) = F
max pend req (io2comp) = 64
enable_hs (io2comp) = F
enable_isend (io2comp) = T
(seq_comm_setcomm) init ID ( 1 GLOBAL ) pelist = 0 479 1 ( npes = 480) ( nthreads = 1)( suffix =)
(seq_comm_setcomm) init ID ( 2 CPL ) pelist = 0 479 1 ( npes = 480) ( nthreads = 1)( suffix =)
(seq_comm_setcomm) init ID ( 5 ATM ) pelist = 0 479 1 ( npes = 480) ( nthreads = 1)( suffix =)
(seq_comm_joincomm) init ID ( 6 CPLATM ) join IDs = 2 5 ( npes = 480) ( nthreads = 1)
(seq_comm_jcommarr) init ID ( 3 ALLATMID ) join multiple comp IDs ( npes = 480) ( nthreads = 1)
(seq_comm_joincomm) init ID ( 4 CPLALLATMID ) join IDs = 2 3 ( npes = 480) ( nthreads = 1)
(seq_comm_setcomm) init ID ( 9 LND ) pelist = 0 479 1 ( npes = 480) ( nthreads = 1)( suffix =)
(seq_comm_joincomm) init ID ( 10 CPLLND ) join IDs = 2 9 ( npes = 480) ( nthreads = 1)
(seq_comm_jcommarr) init ID ( 7 ALLLNDID ) join multiple comp IDs ( npes = 480) ( nthreads = 1)
(seq_comm_joincomm) init ID ( 8 CPLALLLNDID ) join IDs = 2 7 ( npes = 480) ( nthreads = 1)
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffee46b0da8) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffe50377a28) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffcd44e8128) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffcfeaa4528) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffe3af4f5a8) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7fff94e14928) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffd89e3ab28) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7fff94e14928) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffd89e3ab28) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7fff94e14928) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
PMPI_Group_range_incl(213)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x65eec90, new_group=0x7ffd89e3ab28) failed
MPIR_Group_check_valid_ranges(326): The 0th element of a range array ends at 559 but must be nonnegative and less than 480
Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:
.........."
I am also attaching my env_mach_pes.xml for perusal.
I sincerely thank you for your help.

Best,
Pawan Vats
 

Attachments

  • env_mach_pes.txt
    6.9 KB · Views: 15

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Pawan,

Are you attempting this run on Cheyenne, or has this version of CESM been ported to a different machine? If using a ported version, please provide the information described here:


Or at the very least information on the compiler and libraries (e.g. MPI and NetCDF) you are using. I am also moving this thread to the "Infrastructure" forum, as they are likely going to be much better at debugging MPI issues than I am.

Thanks, and have a great day!

Jesse
 
Dear Jesse,
Thanks for your reply. I am running CESM on the PUDAM machine, not on Cheyenne. I am here attaching various files that have the detail of describe_version, Step from starting with the create_newcase command, changes you made to XML files (via XML change ), config_compilers.xml, config_machines.xml, and config_batch.xml files, and the machine details. Please let me know for further details.
 

Attachments

  • describe_Step_for_running_CESM.pdf
    58.4 KB · Views: 5
  • dependencies versions & installation procedure.txt
    2.9 KB · Views: 2
  • config_batch.txt
    4 KB · Views: 5
  • config_compilers.txt
    6.4 KB · Views: 8
  • config_machines.txt
    2.6 KB · Views: 10
  • manage_externals.log.txt
    29.9 KB · Views: 3

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Pawan,

Thanks for sending along the files! I am going to try and bring @jedwards into this conversation, as he is our resident porting expert, and will likely be more helpful than I am.

Jesse
 

jedwards

CSEG and Liaisons
Staff member
Hi Pawan,

It appears from your env_mach_pes.xml file that you are trying to run on 192 pes over 4 nodes of your system with 48 tasks per node.
However the output log suggests that you are running on 480 tasks. What is the result of running ./preview_run in your case directory?
 
Dear Jedwards,
Thanks for your reply. Please find the output of the ./preview_run command and I also find correct env_mach_pes.xml file.

./preview_run
CASE INFO:
nodes: 32
total tasks: 640
tasks per node: 20
thread count: 1
BATCH INFO:
FOR JOB: case.run
ENV:
module command is /usr/share/Modules/3.2.10/bin/modulecmd python purge
module command is /usr/share/Modules/3.2.10/bin/modulecmd python load apps/CESM/dep/impi2015
Setting Environment NETCDF=/home/soft/centOS/apps/wrf/impi20152
Setting Environment PNETCDF=/home/soft/centOS/apps/wrf/impi20152
Setting Environment HDF5=/home/soft/centOS/apps/wrf/impi20152
Setting Environment PHDF5=/home/soft/centOS/apps/wrf/impi20152
Setting Environment MPIROOT=/home/soft/intel2015//impi/5.0.3.048/intel64
Setting Environment CIMEROOT=/home/apps/centos7/CESM/cesm2/cesm2.1.1/cime
Setting Environment OMP_NUM_THREADS=1
SUBMIT CMD:
qsub -v ARGS_FOR_SCRIPT='--resubmit' .case.run
MPIRUN:
mpiexec.hydra -np $PBS_NTASKS /scratch/cas/phd/asz138508/CSM/CESM2/Exp/somexpt//E1850l45CAM6Ch3/bld/cesm.exe >> cesm.log.$LID 2>&1
 

Attachments

  • env_mach_pes.xml.txt
    6.9 KB · Views: 4
Dear Jedwards,
Thanks for your reply. Please find the output of the ./preview_run command and I also find env_mach_pes.xml file. Sorry, previously I had chosen an old file while uploading the XML file.

./preview_run
CASE INFO:
nodes: 32
total tasks: 640
tasks per node: 20
thread count: 1
BATCH INFO:
FOR JOB: case.run
ENV:
module command is /usr/share/Modules/3.2.10/bin/modulecmd python purge
module command is /usr/share/Modules/3.2.10/bin/modulecmd python load apps/CESM/dep/impi2015
Setting Environment NETCDF=/home/soft/centOS/apps/wrf/impi20152
Setting Environment PNETCDF=/home/soft/centOS/apps/wrf/impi20152
Setting Environment HDF5=/home/soft/centOS/apps/wrf/impi20152
Setting Environment PHDF5=/home/soft/centOS/apps/wrf/impi20152
Setting Environment MPIROOT=/home/soft/intel2015//impi/5.0.3.048/intel64
Setting Environment CIMEROOT=/home/apps/centos7/CESM/cesm2/cesm2.1.1/cime
Setting Environment OMP_NUM_THREADS=1
SUBMIT CMD:
qsub -v ARGS_FOR_SCRIPT='--resubmit' .case.run
MPIRUN:
mpiexec.hydra -np $PBS_NTASKS /scratch/cas/phd/asz138508/CSM/CESM2/Exp/somexpt//E1850l45CAM6Ch3/bld/cesm.exe >> cesm.log.$LID 2>&1

Thanks
Pawan
 

Attachments

  • env_mach_pes.xml.txt
    6.9 KB · Views: 14

jedwards

CSEG and Liaisons
Staff member
ICE and LND are both set to 480 tasks but with ROOTPE_ICE = 80 so you have lnd and ice tasks overlapping
as well as ice and ocn tasks. May I suggest that based on the compset you are trying to run you change
./xmlchange ROOTPE=0 also I am not sure that $PBS_NTASKS should be used in the mpi exec line, please change to
<arg name="num_tasks"> -np {{ total_tasks }}</arg> in config_machines.xml

If you are still having problems please try running some simple compsets - start with X, when it works try A, then F.
 
Top