Main menu

Navigation

CESM2_0:CAM-Chem hanging in mpialltoallint

3 posts / 0 new
Last post
raeder
CESM2_0:CAM-Chem hanging in mpialltoallint

Benjamin Gaubert and I are trying to build a CAM-Chem
version from the released CESM2_0
(/gpfs/fs1/work/raeder/Models/cesm2_0).
I'm building a 1 degree model using compset
HIST_CAM60%CCTS_CLM50%SP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV
and giving it 6 nodes per instance in a 2 instance forecast.

A similar problem happens in single instance forecasts with 1, 2, or 3 nodes.


CAM stops progressing, although when I logged onto
the compute nodes, 'top' reports that all the CPUs were very busy.

I put in debug prints and narrowed the problem down to
phys_grid.F90:transpose_block_to_chunk:
call mpialltoallint(rdispls, 1, pdispls, 1, mpicom)
!
# if defined(MODCM_DP_TRANSPOSE)

I wouldn't be productive for me to try to pursue it any deeper,
so I'm hoping that someone else recognizes a mistake we're
making, or has ideas for things to change.

Jim Edwards suggested that this is a CAM problem,

rather than CIME, but there is an open CIME issue about it (2808).

Thanks,
Kevin and Benjamin

raeder

A test without chemistry succeeds.

HIST_CAM60_CLM50%BGC-CROP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV

raeder

Brian Eaton discovered that this was caused by a terrible definition of npr_yz (1,1,1,1)

in cam/cime_config/buildnml, which defined 1 subdomain, which ran the whole

dycore on 1 task.  Changing to the intended default,

build-namelist --ntasks $NTASKS_PER_INST_ATM

fixed this and the problem with initial file longitudes being set to 0 for large ensembles:

https://bb.cgd.ucar.edu/initial-file-longitudes-are-0s-large-ensemble-cam6

So this issue is resolved.

Log in or register to post comments

Who's new

  • alessandro.delo...
  • zweina@...
  • yuan.liang@...
  • lian.xue@...
  • 353482168@...