Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Is CESM1.0 thread safe in beowulf clusters?

Dear CESM1.0 fans and pros

Did anybody succeed to run any of the CESM1.0 configurations on a beowulf cluster?
Is CESM1.0 thread-safe on this class of machines?

I tried to run the SC-WACCM compset with MPI tasks and OpenMP threads
and the program hangs at runtime.
The symptoms are those of a race condition between the (two) threads on the
atmosphere master process task.

I am using NTASKS=16 and NTHRDS=2 (and OMP_NUM_THREADS=2 with 256 megabytes for
KMP_STACKSIZE), and requesting a total of 32 processors (4 nodes of our cluster).
All components have root processor 0, NTASKS=16, NTHRDS=2 (in env_mach_pes.xml).

The atmosphere component gets stuck when trying to read a chemistry tracer file
(GHG forcing),
via the subroutine open_trc_datafile.
The last thing printed to the atm.log are the messages from open_trc_datafile,
suggesting that two threads may be trying to open the file with no synchronization.
This race condition may happen if the code to open the file and read the data is part of a threaded
block of code, which I haven't confirmed though (too much code to check).

The other components (lnd,ice,ocn) don't even get started.
The only logs produced are from atm, cpl, and ccsm.

Note that the same setup works perfectly well if I use NTASKS=32 and NTHRDS=1
(i.e. MPI tasks only, no threads/OpenMP).
Hence the problem is restricted to the the threaded build.

FYI, I am using Intel ifort and icc 10.1.017, OpenMPI 1.4.3, cesm1_0_3.

Do I need to use a thread-safe MPI?
Is this what you have in the IBM Bluefire (and in the ORNL Cray machines)?

Thank you,
Gus Correa
 
Let me answer my own question, at least partially.

The issue seems to be restricted to the SC-WACCM compset of CESM1.0.

To test CESM1.0 thread safety, I compiled and ran the F (F_2000) compset, using
the same NTASKS and NTHRHDS (OpenMP/threads) specified in the
env_mach_pes.xml that I described in my original posting.
It just works.

Hence, I am going to move this query to the WACCM forum.

Thank you,
Gus Correa
 
Top