gus@ldeo_columbia_edu
Member
Dear CESM1.0 fans and pros
Did anybody succeed to run any of the CESM1.0 configurations on a beowulf cluster?
Is CESM1.0 thread-safe on this class of machines?
I tried to run the SC-WACCM compset with MPI tasks and OpenMP threads
and the program hangs at runtime.
The symptoms are those of a race condition between the (two) threads on the
atmosphere master process task.
I am using NTASKS=16 and NTHRDS=2 (and OMP_NUM_THREADS=2 with 256 megabytes for
KMP_STACKSIZE), and requesting a total of 32 processors (4 nodes of our cluster).
All components have root processor 0, NTASKS=16, NTHRDS=2 (in env_mach_pes.xml).
The atmosphere component gets stuck when trying to read a chemistry tracer file
(GHG forcing),
via the subroutine open_trc_datafile.
The last thing printed to the atm.log are the messages from open_trc_datafile,
suggesting that two threads may be trying to open the file with no synchronization.
This race condition may happen if the code to open the file and read the data is part of a threaded
block of code, which I haven't confirmed though (too much code to check).
The other components (lnd,ice,ocn) don't even get started.
The only logs produced are from atm, cpl, and ccsm.
Note that the same setup works perfectly well if I use NTASKS=32 and NTHRDS=1
(i.e. MPI tasks only, no threads/OpenMP).
Hence the problem is restricted to the the threaded build.
FYI, I am using Intel ifort and icc 10.1.017, OpenMPI 1.4.3, cesm1_0_3.
Do I need to use a thread-safe MPI?
Is this what you have in the IBM Bluefire (and in the ORNL Cray machines)?
Thank you,
Gus Correa
Did anybody succeed to run any of the CESM1.0 configurations on a beowulf cluster?
Is CESM1.0 thread-safe on this class of machines?
I tried to run the SC-WACCM compset with MPI tasks and OpenMP threads
and the program hangs at runtime.
The symptoms are those of a race condition between the (two) threads on the
atmosphere master process task.
I am using NTASKS=16 and NTHRDS=2 (and OMP_NUM_THREADS=2 with 256 megabytes for
KMP_STACKSIZE), and requesting a total of 32 processors (4 nodes of our cluster).
All components have root processor 0, NTASKS=16, NTHRDS=2 (in env_mach_pes.xml).
The atmosphere component gets stuck when trying to read a chemistry tracer file
(GHG forcing),
via the subroutine open_trc_datafile.
The last thing printed to the atm.log are the messages from open_trc_datafile,
suggesting that two threads may be trying to open the file with no synchronization.
This race condition may happen if the code to open the file and read the data is part of a threaded
block of code, which I haven't confirmed though (too much code to check).
The other components (lnd,ice,ocn) don't even get started.
The only logs produced are from atm, cpl, and ccsm.
Note that the same setup works perfectly well if I use NTASKS=32 and NTHRDS=1
(i.e. MPI tasks only, no threads/OpenMP).
Hence the problem is restricted to the the threaded build.
FYI, I am using Intel ifort and icc 10.1.017, OpenMPI 1.4.3, cesm1_0_3.
Do I need to use a thread-safe MPI?
Is this what you have in the IBM Bluefire (and in the ORNL Cray machines)?
Thank you,
Gus Correa