Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Threaded build of SC-WACCM hangs

Dear CESM and WACCM experts

I tried to run the SC-WACCM compset with MPI tasks and OpenMP threads
and the program hangs at runtime.
The problem doesn't happen in the *non-threaded* build of SC-WACCM.
Moreover, the problem doesn't happen in the *threaded* build of another CESM compset (F_2000)
either.

Could this be an OpenMP race condition or a deadlock in some specific SC-WACCM
source code file, perhaps?
Is there any fix or patch for this?

I posted this question originally in the general CESM forum (software issues),
but later I realized that the problem seems to be localized on the SC-WACCM compset.
Hence I am moving the question to this forum.

I am using NTASKS=16 and NTHRDS=2 (and OMP_NUM_THREADS=2 with 256 megabytes for
KMP_STACKSIZE), and requesting a total of 32 processors (4 nodes of our cluster).
All components have root processor 0, NTASKS=16, NTHRDS=2 (in env_mach_pes.xml).

The atmosphere component gets stuck after it reads the GHG forcing file.
The last several lines in the atm.log file are:

**********************
(GETFIL): attempting to find local file ghg_forcing_2000_c110321.nc
(GETFIL): using
/data4/gus/CCSM4.0/inputdata/atm/waccm/ub/ghg_forcing_2000_c110321.nc
open_trc_datafile:
/data4/gus/CCSM4.0/inputdata/atm/waccm/ub/ghg_forcing_2000_c110321.nc
trcdata_init: file%has_ps = F
**********************

The other components (lnd,ice,ocn) don't even get started, as the program hangs.
The only logs produced are from atm, cpl, and ccsm.

Note that the same setup works perfectly well if I use NTASKS=32 and NTHRDS=1
(i.e. MPI tasks only, no threads/OpenMP).
Hence the problem is restricted to the the *threaded* build of SC-WACCM.

The issue seems to be restricted to the SC-WACCM compset of CESM1.0,
as at least another compset (F_200) works right when I compile it in threaded mode.

To test CESM1.0 thread safety, I compiled and ran the F (F_2000) compset, using
the same NTASKS and NTHRHDS (OpenMP/threads) specified in the
env_mach_pes.xml that I described above.
It just works.

FYI, I am using a standard Linux cluster (beowulf) with the
Intel ifort and icc 10.1.017, OpenMPI 1.4.3, cesm1_0_3.

Any help is appreciated.
Thank you,
Gus Correa
 
Top