Hello all,
I am porting CESM2.2.2 to a new machine, Sherlock. Using the attached configuration files, I've successfully passed create_newcase, case.setup, and case.build stages, but hit errors at runtime for the test case "b.e20.B1850.f19_g17.test" (compset=B1850, res=f19_g17). It was built with mpich, slurm, and gnu (complete env in attached spack.yaml). The run could be submitted via case.submit, but quickly hit errors:
Abort(473542924) on node 0: Fatal error in internal_Group_range_incl: Invalid argument, error stack:
internal_Group_range_incl(45858)..: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffdaf007fe8, newgroup=0x7ffdaf007e54) failed
MPIR_Group_check_valid_ranges(280): The 0th element of a range array starts at 1 but must be nonnegative and less than 1
The complete log, as well as env_match_pes.xml, are attached.
I found two possibly relevant threads in the forum:
Run time error in CESM2, E component set that consist of slab ocean model for year 1850
modifying batch settings: CESM1 versus CESM2
Based on these discussions, I'm guessing the error may be relevant to how ROOTPE and NTASKS were set, but I have no clue how to set them properly. I made some random attempts, by setting all NTASKS=1 and unique ROOTPE values for each component via ./xmlchange, then reset setup, re-built, and re-submitted, and the errors persist.
Do you have any suggestions on how to resolve this error? Or in general, what are the criteria for ensuring consistency between ROOTPE, NTASKS, total-tasks, cpu-per-task, etc.? Any advice would be greatly appreciated.
Thanks!
Yuhan
Btw, here is my ./preview_run:
CASE INFO:
nodes: 2
total tasks: 10
tasks per node: 8
thread count: 1
BATCH INFO:
FOR JOB: case.run
ENV:
[... skipped module loading lines]
Setting Environment OMP_STACKSIZE=256M
Setting Environment OMP_NUM_THREADS=1
SUBMIT CMD:
sbatch .case.run --resubmit
MPIRUN (job=case.run):
srun -n 10 -d 1 /scratch/users/yhanw/cesm/case/b.e20.B1850.f19_g17.test/bld/cesm.exe >> cesm.
log.$LID 2>&1
I am porting CESM2.2.2 to a new machine, Sherlock. Using the attached configuration files, I've successfully passed create_newcase, case.setup, and case.build stages, but hit errors at runtime for the test case "b.e20.B1850.f19_g17.test" (compset=B1850, res=f19_g17). It was built with mpich, slurm, and gnu (complete env in attached spack.yaml). The run could be submitted via case.submit, but quickly hit errors:
Abort(473542924) on node 0: Fatal error in internal_Group_range_incl: Invalid argument, error stack:
internal_Group_range_incl(45858)..: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffdaf007fe8, newgroup=0x7ffdaf007e54) failed
MPIR_Group_check_valid_ranges(280): The 0th element of a range array starts at 1 but must be nonnegative and less than 1
The complete log, as well as env_match_pes.xml, are attached.
I found two possibly relevant threads in the forum:
Run time error in CESM2, E component set that consist of slab ocean model for year 1850
modifying batch settings: CESM1 versus CESM2
Based on these discussions, I'm guessing the error may be relevant to how ROOTPE and NTASKS were set, but I have no clue how to set them properly. I made some random attempts, by setting all NTASKS=1 and unique ROOTPE values for each component via ./xmlchange, then reset setup, re-built, and re-submitted, and the errors persist.
Do you have any suggestions on how to resolve this error? Or in general, what are the criteria for ensuring consistency between ROOTPE, NTASKS, total-tasks, cpu-per-task, etc.? Any advice would be greatly appreciated.
Thanks!
Yuhan
Btw, here is my ./preview_run:
CASE INFO:
nodes: 2
total tasks: 10
tasks per node: 8
thread count: 1
BATCH INFO:
FOR JOB: case.run
ENV:
[... skipped module loading lines]
Setting Environment OMP_STACKSIZE=256M
Setting Environment OMP_NUM_THREADS=1
SUBMIT CMD:
sbatch .case.run --resubmit
MPIRUN (job=case.run):
srun -n 10 -d 1 /scratch/users/yhanw/cesm/case/b.e20.B1850.f19_g17.test/bld/cesm.exe >> cesm.
log.$LID 2>&1