Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM2_0:CAM-Chem hanging in mpialltoallint

raeder

Member
Benjamin Gaubert and I are trying to build a CAM-Chem
version from the released CESM2_0
(/gpfs/fs1/work/raeder/Models/cesm2_0).
I'm building a 1 degree model using compset
HIST_CAM60%CCTS_CLM50%SP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV
and giving it 6 nodes per instance in a 2 instance forecast.A similar problem happens in single instance forecasts with 1, 2, or 3 nodes.
CAM stops progressing, although when I logged onto
the compute nodes, 'top' reports that all the CPUs were very busy.I put in debug prints and narrowed the problem down to
phys_grid.F90:transpose_block_to_chunk:
call mpialltoallint(rdispls, 1, pdispls, 1, mpicom)
!
# if defined(MODCM_DP_TRANSPOSE)I wouldn't be productive for me to try to pursue it any deeper,
so I'm hoping that someone else recognizes a mistake we're
making, or has ideas for things to change.Jim Edwards suggested that this is a CAM problem,rather than CIME, but there is an open CIME issue about it (2808).Thanks,
Kevin and Benjamin
 

raeder

Member
Brian Eaton discovered that this was caused by a terrible definition of npr_yz (1,1,1,1)in cam/cime_config/buildnml, which defined 1 subdomain, which ran the wholedycore on 1 task.  Changing to the intended default,build-namelist --ntasks $NTASKS_PER_INST_ATMfixed this and the problem with initial file longitudes being set to 0 for large ensembles:https://bb.cgd.ucar.edu/initial-file-longitudes-are-0s-large-ensemble-cam6So this issue is resolved.
 
Top