Main menu

Navigation

Is CESM1.0 thread safe in beowulf clusters?

2 posts / 0 new
Last post
gus@...
Is CESM1.0 thread safe in beowulf clusters?

Dear CESM1.0 fans and pros

Did anybody succeed to run any of the CESM1.0 configurations on a beowulf cluster?
Is CESM1.0 thread-safe on this class of machines?

I tried to run the SC-WACCM compset with MPI tasks and OpenMP threads
and the program hangs at runtime.
The symptoms are those of a race condition between the (two) threads on the
atmosphere master process task.

I am using NTASKS=16 and NTHRDS=2 (and OMP_NUM_THREADS=2 with 256 megabytes for
KMP_STACKSIZE), and requesting a total of 32 processors (4 nodes of our cluster).
All components have root processor 0, NTASKS=16, NTHRDS=2 (in env_mach_pes.xml).

The atmosphere component gets stuck when trying to read a chemistry tracer file
(GHG forcing),
via the subroutine open_trc_datafile.
The last thing printed to the atm.log are the messages from open_trc_datafile,
suggesting that two threads may be trying to open the file with no synchronization.
This race condition may happen if the code to open the file and read the data is part of a threaded
block of code, which I haven't confirmed though (too much code to check).

The other components (lnd,ice,ocn) don't even get started.
The only logs produced are from atm, cpl, and ccsm.

Note that the same setup works perfectly well if I use NTASKS=32 and NTHRDS=1
(i.e. MPI tasks only, no threads/OpenMP).
Hence the problem is restricted to the the threaded build.

FYI, I am using Intel ifort and icc 10.1.017, OpenMPI 1.4.3, cesm1_0_3.

Do I need to use a thread-safe MPI?
Is this what you have in the IBM Bluefire (and in the ORNL Cray machines)?

Thank you,
Gus Correa

Gus Correa Lamont-Doherty Earth Observatory of Columbia University

Let me answer my own question, at least partially.

The issue seems to be restricted to the SC-WACCM compset of CESM1.0.

To test CESM1.0 thread safety, I compiled and ran the F (F_2000) compset, using
the same NTASKS and NTHRHDS (OpenMP/threads) specified in the
env_mach_pes.xml that I described in my original posting.
It just works.

Hence, I am going to move this query to the WACCM forum.

Thank you,
Gus Correa

Gus Correa Lamont-Doherty Earth Observatory of Columbia University

Log in or register to post comments

Who's new

  • praveenmaniyatt@...
  • arjunbabun11@...
  • lama@...
  • sisi393@...
  • 1658093099@...