modify diabatic heating in CAM6 (CESM2.1.3): is it possible to call mpi_gather/scatter inside the tphysbc/tphysac subroutines?

hhzhang · May 13, 2021

Hello,

I am trying to modify the diabatic heating tendency over the Indian Ocean in CAM6 (CESM2.1.3). I think I've figured out where to do it in the source codes (physpkg.F90), but I am running into some technical questions, as detailed below.

Specifically, at every physics time step, I would like to first compute the domain averaged diabatic heating (computed by CAM physics) over the Indian Ocean and then use this average value to overwrite the model-computed diabatic heating over the Indian Ocean. In the source codes physpkg.F90, these domain averaging and overwriting operations would be done inside the tphysbc/tphysac subroutines, to the physics tendency variable 'ptend%s', right before each 'call physics_update()'.

To do this, it requires that I need to first MPI_gather all chunks from each MPI process to a global field, do the averaging/overwriting, and finally MPI_scatter the modified global ptend%s field to each MPI process. However, it seems that these mpi_gather/scatter functions cannot be directly called from inside the tphysbc/tphysac subroutines, because tphysbc/tphysac are executed sequentially on the chunks of each MPI process, provided that the number of OpenMP threads is less than the number of the chunks on each MPI process (see copied codes below).

-----------------------------------------------------------------------------
!$OMP PARALLEL DO PRIVATE (C, NCOL, phys_buffer_chunk)
do c=begchunk,endchunk
ncol = get_ncols_p(c)
phys_buffer_chunk => pbuf_get_chunk(pbuf2d, c)
!
! surface diagnostics for history files
!
call t_startf('diag_surf')
call diag_surf(cam_in(c), cam_out(c), phys_state(c), phys_buffer_chunk)
call t_stopf('diag_surf')

call tphysac(ztodt, cam_in(c), &
cam_out(c), &
phys_state(c), phys_tend(c), phys_buffer_chunk)
end do ! Chunk loop
-----------------------------------------------------------------------------

I wonder if I can get around this issue by increasing the number of MPI processes (or OMP threads) and setting it equal to the total number of chunks on the physics grid, so that all chunks will be processed in a parallel way to allow the use of mpi_gather/scatter functions. If yes, how?

If not, any suggestions on how to do what I want?

Thanks a lot.
Honghai

goldy · May 19, 2021

Using MPI inside of tphysac or tphysbc is not supported in CAM. However, I think your best bet is to try and only have one chunk and use one thread since then all the columns will be available on each call to tphysbc or tphysac.
To make sure there is only one chunk, set PCOLS to a large enough number. For instance, say you are doing a 1 degree FV run (55296 columns) on 256 tasks. You want to round up (# columns / # tasks) or in this case, PCOLS = 216.
To set the threads to one and PCOLS to 216:

./xmlchange --append CAM_CONFIG_OPTS="-pcols 216"
./xmlchange NTHRDS=1

Then, run ./case.setup && ./case.build && ./case.submit

--Steve

hhzhang · May 20, 2021

Hi Steve,

Thank you very much for answering!

I understand your suggestion to only have one chunk that includes all the columns which will be available on each call to tphysbc or tphysac, but I don't understand your example of setting PCOLS = 216 (i.e., 55296/256). Since PCOLS is the maximum number of columns in a chunk, shouldn't I have to set PCOLS=55296 (or any larger number) in order to ensure having only one chunk? I think what I really don't understand is how chunks/columns are distributed among the MPI tasks, or more specifically in this case, how one chunk including all columns is distributed among different MPI tasks.

Another question is do I have to set the threads to one when there is only one chunk? If there is only one chunk, then I would expect the number of threads shouldn't matter because the chunk loop ('do c=begchunk,endchunk') will not be a loop anymore since begchunk=endchunk. My understanding, which could be wrong, is that when there are more than one chunks and multiple threads, the chunks will be divided into different threads to achieve parallel computing. I guess it's probably a safer practice to set #threads to one in this case?

Best,
Honghai

goldy · May 21, 2021

If you set PCOLS = 55296, you will probably run out of memory.

Columns are distributed among MPI tasks so that every task has the same or close to the same number of columns. That is why I divided by the number of tasks (but make sure PCOLS is large enough to absorb all columns on a task).

The number of chunks is decided on a per-task basis -- you want one chunk *per task* so that MPI calls work properly.

As for threads, you probably do not need to set NTHRDS to 1 but that seems safer since you have no use for threading (at least in CAM). It is certainly something you can play with once your code is working with NTHRDS==1.

modify diabatic heating in CAM6 (CESM2.1.3): is it possible to call mpi_gather/scatter inside the tphysbc/tphysac subroutines?

hhzhang

Honghai Zhang

New Member

goldy

AMP / CGD / NCAR

hhzhang

Honghai Zhang

New Member

goldy

AMP / CGD / NCAR