Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM2 case run error when porting to a new machine

yhanw

Yuhan
New Member
Hello all,

I am porting CESM2.2.2 to a new machine, Sherlock. Using the attached configuration files, I've successfully passed create_newcase, case.setup, and case.build stages, but hit errors at runtime for the test case "b.e20.B1850.f19_g17.test" (compset=B1850, res=f19_g17). It was built with mpich, slurm, and gnu (complete env in attached spack.yaml). The run could be submitted via case.submit, but quickly hit errors:

Abort(473542924) on node 0: Fatal error in internal_Group_range_incl: Invalid argument, error stack:
internal_Group_range_incl(45858)..: MPI_Group_range_incl(group=0x88000000, n=1, ranges=0x7ffdaf007fe8, newgroup=0x7ffdaf007e54) failed
MPIR_Group_check_valid_ranges(280): The 0th element of a range array starts at 1 but must be nonnegative and less than 1


The complete log, as well as env_match_pes.xml, are attached.

I found two possibly relevant threads in the forum:
Run time error in CESM2, E component set that consist of slab ocean model for year 1850
modifying batch settings: CESM1 versus CESM2
Based on these discussions, I'm guessing the error may be relevant to how ROOTPE and NTASKS were set, but I have no clue how to set them properly. I made some random attempts, by setting all NTASKS=1 and unique ROOTPE values for each component via ./xmlchange, then reset setup, re-built, and re-submitted, and the errors persist.

Do you have any suggestions on how to resolve this error? Or in general, what are the criteria for ensuring consistency between ROOTPE, NTASKS, total-tasks, cpu-per-task, etc.? Any advice would be greatly appreciated.

Thanks!
Yuhan

Btw, here is my ./preview_run:
CASE INFO:
nodes: 2
total tasks: 10
tasks per node: 8
thread count: 1

BATCH INFO:
FOR JOB: case.run
ENV:
[... skipped module loading lines]
Setting Environment OMP_STACKSIZE=256M
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
sbatch .case.run --resubmit

MPIRUN (job=case.run):
srun -n 10 -d 1 /scratch/users/yhanw/cesm/case/b.e20.B1850.f19_g17.test/bld/cesm.exe >> cesm.
log.$LID 2>&1
 

yhanw

Yuhan
New Member
I was able to resolve the problem - For future reference and/or anyone experiencing similar errors, I'll share my workaround.

1) I followed some guides below to set ROOTPE, NTASKS, and NTHRDS.

2)
Maybe useful - I found this database about PE layout and load balancing on different machines: CESM Timing, Performance & Load Balancing Data
 

yhanw

Yuhan
New Member
Oops sent out before finish - please ignore the last message.

Complete version:
I was able to resolve the problem - For future reference and/or anyone experiencing similar errors, I'll share my workaround.

1) I followed some guides below to set ROOTPE, NTASKS, and NTHRDS.

Maybe useful - I found this database about PE layout and load balancing on different machines: CESM Timing, Performance & Load Balancing Data.

2) I switched my mpi exe from "srun -n {total_tasks}" to "mpirun -np {total_tasks}" by update config_machines.xml accordingly.
Turns out that on my HPC machine (sherlock), a minimal MPI test of these two commands show differences. The srun one calls only 1 MPI rank, although -n is set to 8.
srun -n 8 hello_mpi
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
Hello world from processor sh03-09n01.int, rank 0 out of 1 processors
mpirun -np 8 hello_mpi
Hello world from processor sh03-09n01.int, rank 0 out of 8 processors
Hello world from processor sh03-09n01.int, rank 4 out of 8 processors
Hello world from processor sh03-09n01.int, rank 1 out of 8 processors
Hello world from processor sh03-09n01.int, rank 2 out of 8 processors
Hello world from processor sh03-09n01.int, rank 3 out of 8 processors
Hello world from processor sh03-09n01.int, rank 5 out of 8 processors
Hello world from processor sh03-09n01.int, rank 6 out of 8 processors

Hello world from processor sh03-09n01.int, rank 7 out of 8 processors
 
Top