CESM2 case run error when porting to a new machine

yhanw

Yuhan
New Member
Hello all,

I am porting CESM2.2.2 to a new machine, Sherlock. Using the attached configuration files, I've successfully passed create_newcase, case.setup, and case.build stages, but hit errors at runtime for the test case "b.e20.B1850.f19_g17.test" (compset=B1850, res=f19_g17). It was built with mpich, slurm, and gnu (complete env in attached spack.yaml). The run could be submitted via case.submit, but quickly hit errors:

Abort(473542924) on node 0: Fatal error in internal_Group_range_incl: Invalid argument, error stack:
internal_Group_range_incl(45858)..: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffdaf007fe8, newgroup=0x7ffdaf007e54) failed
MPIR_Group_check_valid_ranges(280): The 0th element of a range array starts at 1 but must be nonnegative and less than 1


The complete log, as well as env_match_pes.xml, are attached.

I found two possibly relevant threads in the forum:
Run time error in CESM2, E component set that consist of slab ocean model for year 1850
modifying batch settings: CESM1 versus CESM2
Based on these discussions, I'm guessing the error may be relevant to how ROOTPE and NTASKS were set, but I have no clue how to set them properly. I made some random attempts, by setting all NTASKS=1 and unique ROOTPE values for each component via ./xmlchange, then reset setup, re-built, and re-submitted, and the errors persist.

Do you have any suggestions on how to resolve this error? Or in general, what are the criteria for ensuring consistency between ROOTPE, NTASKS, total-tasks, cpu-per-task, etc.? Any advice would be greatly appreciated.

Thanks!
Yuhan

Btw, here is my ./preview_run:
CASE INFO:
nodes: 2
total tasks: 10
tasks per node: 8
thread count: 1

BATCH INFO:
FOR JOB: case.run
ENV:
[... skipped module loading lines]
Setting Environment OMP_STACKSIZE=256M
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
sbatch .case.run --resubmit

MPIRUN (job=case.run):
srun -n 10 -d 1 /scratch/users/yhanw/cesm/case/b.e20.B1850.f19_g17.test/bld/cesm.exe >> cesm.
log.$LID 2>&1
 

yhanw

Yuhan
New Member
I was able to resolve the problem - For future reference and/or anyone experiencing similar errors, I'll share my workaround.

1) I followed some guides below to set ROOTPE, NTASKS, and NTHRDS.

2)
Maybe useful - I found this database about PE layout and load balancing on different machines: CESM Timing, Performance & Load Balancing Data
 

yhanw

Yuhan
New Member
Oops sent out before finish - please ignore the last message.

Complete version:
I was able to resolve the problem - For future reference and/or anyone experiencing similar errors, I'll share my workaround.

1) I followed some guides below to set ROOTPE, NTASKS, and NTHRDS.

Maybe useful - I found this database about PE layout and load balancing on different machines: CESM Timing, Performance & Load Balancing Data.

2) I switched my mpi exe from "srun -n {total_tasks}" to "mpirun -np {total_tasks}" by update config_machines.xml accordingly.
Turns out that on my HPC machine (sherlock), a minimal MPI test of these two commands show differences. The srun one calls only 1 MPI rank, although -n is set to 8.
srun -n 8 hello_mpi
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
mpirun -np 8 hello_mpi
Hello world from processor sh03-09n01.int, rank 0 out of 8 processors
Hello world from processor sh03-09n01.int, rank 4 out of 8 processors
Hello world from processor sh03-09n01.int, rank 1 out of 8 processors
Hello world from processor sh03-09n01.int, rank 2 out of 8 processors
Hello world from processor sh03-09n01.int, rank 3 out of 8 processors
Hello world from processor sh03-09n01.int, rank 5 out of 8 processors
Hello world from processor sh03-09n01.int, rank 6 out of 8 processors

Hello world from processor sh03-09n01.int, rank 7 out of 8 processors
 
Back
Top