error by running cesm2.2

Dear All,I'm trying to run CESM2.2 on my own cluster without batch system. The case was set up and built without error. I use the I2000Clm50Vic compset with f19_g17 resolution for a test run. When I submit the case, I get the following errors:-------------------------------------------------------Primary job  terminated normally, but 1 process returneda non-zero exit code.. Per user-direction, the job has been aborted.-------------------------------------------------------Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:PMPI_Group_range_incl(195)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffe318e4460, new_group=0x7ffe318e4004) failedMPIR_Group_check_valid_ranges(323): The 0th element of a range array ends at 63 but must be nonnegative and less than 1[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=201953292:system msg for write_line failure : Bad file descriptorFatal error in PMPI_Group_range_incl: Invalid argument, error stack:PMPI_Group_range_incl(195)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7fff9764eb70, new_group=0x7fff9764e714) failedMPIR_Group_check_valid_ranges(323): The 0th element of a range array ends at 63 but must be nonnegative and less than 1[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=626700:system msg for write_line failure : Bad file descriptorFatal error in PMPI_Group_range_incl: Invalid argument, error stack:PMPI_Group_range_incl(195)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffd6b803ac0, new_group=0x7ffd6b803664) failedMPIR_Group_check_valid_ranges(323): The 0th element of a range array ends at 63 but must be nonnegative and less than 1[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=201953292:system msg for write_line failure : Bad file descriptorFatal error in PMPI_Group_range_incl: Invalid argument, error stack:PMPI_Group_range_incl(195)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffe83c4e920, new_group=0x7ffe83c4e4c4) failedMPIR_Group_check_valid_ranges(323): The 0th element of a range array ends at 63 but must be nonnegative and less than 1 [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1007259660................. Invalid PIO rearranger comm max pend req (comp2io),            0 Resetting PIO rearranger comm max pend req (comp2io) to           64 PIO rearranger options:   comm type     =p2p   comm fcd      =2denable   max pend req (comp2io)  =           0   enable_hs (comp2io)     = T   enable_isend (comp2io)  = F   max pend req (io2comp)  =          64   enable_hs (io2comp)    = F   enable_isend (io2comp)  = T(seq_comm_setcomm)  init ID (  1 GLOBAL          ) pelist   =     0     0     1 ( npes =     1) ( nthreads =  1)( suffix =) Invalid PIO rearranger comm max pend req (comp2io),            0 Resetting PIO rearranger comm max pend req (comp2io) to           64 PIO rearranger options:   comm type     =p2p   comm fcd      =2denable   max pend req (comp2io)  =           0   enable_hs (comp2io)     = T   enable_isend (comp2io)  = F   max pend req (io2comp)  =          64   enable_hs (io2comp)    = F   enable_isend (io2comp)  = T(seq_comm_setcomm)  init ID (  1 GLOBAL          ) pelist   =     0     0     1 ( npes =     1) ( nthreads =  1)( suffix =)Fatal error in PMPI_Group_range_incl: Invalid argument, error stack:PMPI_Group_range_incl(195)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffddc3b5600, new_group=0x7ffddc3b51a4) failedMPIR_Group_check_valid_ranges(323): The 0th element of a range array ends at 63 but must be nonnegative and less than 1[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=470388748:system msg for write_line failure : Bad file descriptorFatal error in PMPI_Group_range_incl: Invalid argument, error stack:PMPI_Group_range_incl(195)........: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffeddab1ac0, new_group=0x7ffeddab1664) failedMPIR_Group_check_valid_ranges(323): The 0th element of a range array ends at 63 but must be nonnegative and less than 1[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=470388748:system msg for write_line failure : Bad file descriptor--------------------------------------------------------------------------mpirun detected that one or more processes exited with non-zero status, thus causingthe job to be terminated. The first process to do so was:   Process name: [[64377,1],0]  Exit code:    12-------------------------------------------------------------------------- 42 total processes failed to start Unfortunately, I could not find the root of the problem. Any help will be really appreciated. Kind regards   My xml codes for my machine are:      Linux 64bit    none    LINUX    gnu    mpich    /home/as2/CESM/projects/scratch    /mnt/FNas/CESM/projects/cesm-inputdata    /mnt/FNas/CESM/projects/cesm-inputdata/atm/datm7    /mnt/FNas/CESM/projects/scratch/archive/$CASE    /mnt/FNas/CESM/projects/baselines    $CIMEROOT/tools/cprnc/build/cprnc    8    none    asakalli    32    32          /usr/bin/mpirun              -np 64        --hostfile $ENV{HOME}/my_hosts_ip                        /home/as2/local/netcdf461      /home/as2/local/netcdf461       My xml codes for my compiler are:      -std=gnu99     -fopenmp     -g -Wall -Og -fbacktrace -ffpe-trap=invalid,zero,overflow -fcheck=bounds     -O3             -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU    FORTRAN      -fdefault-real-8               -fconvert=big-endian -ffree-line-length-none -ffixed-line-length-none     -fopenmp         -g -Wall -Og -fbacktrace -ffpe-trap=zero,overflow -fcheck=bounds     -O3             -ffixed-form         -ffree-form     FALSE    /usr/bin/mpicc    /usr/bin/mpicxx   /usr/bin/mpif90   /usr/bin/gcc   /usr/bin/g++ /usr/bin/gfortran   TRUE      -L/usr/lib -llapack -lblas -L/home/as2/local/netcdf461/lib/ -Wl,-Bsymbolic-functions -Wl,-z,relro -lnetcdf -lnetcdff            The output from pelayout is:Comp  NTASKS  NTHRDS  ROOTPECPL :     64/     1;      0ATM :     64/     1;      0LND :     64/     1;      0ICE :     64/     1;      0OCN :     64/     1;      0ROF :     64/     1;      0GLC :     64/     1;      0WAV :     64/     1;      0ESP :      1/     1;      0 And the output from preview_run is:CASE INFO:  nodes: 2  total tasks: 64  tasks per node: 32  thread count: 1 BATCH INFO:  FOR JOB: case.run    ENV:      Setting Environment NETCDF_DIR=/home/as2/local/netcdf461      Setting Environment NETCDF_PATH=/home/as2/local/netcdf461      Setting Environment OMP_NUM_THREADS=1    SUBMIT CMD:      None   FOR JOB: case.st_archive    ENV:      Setting Environment NETCDF_DIR=/home/as2/local/netcdf461      Setting Environment NETCDF_PATH=/home/as2/local/netcdf461      Setting Environment OMP_NUM_THREADS=1    SUBMIT CMD:      None MPIRUN:  /usr/bin/mpirun -np 64 --hostfile /home/as2/my_hosts_ip /home/as2/CESM/projects/scratch/denemeI2000Clm50VicSecond/bld/cesm.exe  >> cesm.log.$LID 2>&1  
 

jedwards

CSEG and Liaisons
Staff member
This looks like an mpi configuration problem - have you tried running something like hello-world?  Often mpi will work on one node but fail when you try to use more than one so make sure and try your hello-world on 64 tasks. 
 
Back
Top