CESM2_0:CAM-Chem hanging in mpialltoallint

raeder · Sep 20, 2018

Benjamin Gaubert and I are trying to build a CAM-Chem
version from the released CESM2_0
(/gpfs/fs1/work/raeder/Models/cesm2_0).
I'm building a 1 degree model using compset
HIST_CAM60%CCTS_CLM50%SP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV
and giving it 6 nodes per instance in a 2 instance forecast.A similar problem happens in single instance forecasts with 1, 2, or 3 nodes.
CAM stops progressing, although when I logged onto
the compute nodes, 'top' reports that all the CPUs were very busy.I put in debug prints and narrowed the problem down to
phys_grid.F90:transpose_block_to_chunk:
call mpialltoallint(rdispls, 1, pdispls, 1, mpicom)
!
# if defined(MODCM_DP_TRANSPOSE)I wouldn't be productive for me to try to pursue it any deeper,
so I'm hoping that someone else recognizes a mistake we're
making, or has ideas for things to change.Jim Edwards suggested that this is a CAM problem,rather than CIME, but there is an open CIME issue about it (2808).Thanks,
Kevin and Benjamin

raeder · Sep 21, 2018

A test without chemistry succeeds.HIST_CAM60_CLM50%BGC-CROP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV

raeder · Nov 1, 2018

Brian Eaton discovered that this was caused by a terrible definition of npr_yz (1,1,1,1)in cam/cime_config/buildnml, which defined 1 subdomain, which ran the wholedycore on 1 task. Changing to the intended default,build-namelist --ntasks $NTASKS_PER_INST_ATMfixed this and the problem with initial file longitudes being set to 0 for large ensembles:https://bb.cgd.ucar.edu/initial-file-longitudes-are-0s-large-ensemble-cam6So this issue is resolved.

CESM2_0:CAM-Chem hanging in mpialltoallint

raeder

Member

raeder

Member

raeder

Member