Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

modify diabatic heating in CAM6 (CESM2.1.3): is it possible to call mpi_gather/scatter inside the tphysbc/tphysac subroutines?

hhzhang

Honghai Zhang
New Member
Hello,

I am trying to modify the diabatic heating tendency over the Indian Ocean in CAM6 (CESM2.1.3). I think I've figured out where to do it in the source codes (physpkg.F90), but I am running into some technical questions, as detailed below.

Specifically, at every physics time step, I would like to first compute the domain averaged diabatic heating (computed by CAM physics) over the Indian Ocean and then use this average value to overwrite the model-computed diabatic heating over the Indian Ocean. In the source codes physpkg.F90, these domain averaging and overwriting operations would be done inside the tphysbc/tphysac subroutines, to the physics tendency variable 'ptend%s', right before each 'call physics_update()'.

To do this, it requires that I need to first MPI_gather all chunks from each MPI process to a global field, do the averaging/overwriting, and finally MPI_scatter the modified global ptend%s field to each MPI process. However, it seems that these mpi_gather/scatter functions cannot be directly called from inside the tphysbc/tphysac subroutines, because tphysbc/tphysac are executed sequentially on the chunks of each MPI process, provided that the number of OpenMP threads is less than the number of the chunks on each MPI process (see copied codes below).

-----------------------------------------------------------------------------
!$OMP PARALLEL DO PRIVATE (C, NCOL, phys_buffer_chunk)
do c=begchunk,endchunk
ncol = get_ncols_p(c)
phys_buffer_chunk => pbuf_get_chunk(pbuf2d, c)
!
! surface diagnostics for history files
!
call t_startf('diag_surf')
call diag_surf(cam_in(c), cam_out(c), phys_state(c), phys_buffer_chunk)
call t_stopf('diag_surf')

call tphysac(ztodt, cam_in(c), &
cam_out(c), &
phys_state(c), phys_tend(c), phys_buffer_chunk)
end do ! Chunk loop
-----------------------------------------------------------------------------

I wonder if I can get around this issue by increasing the number of MPI processes (or OMP threads) and setting it equal to the total number of chunks on the physics grid, so that all chunks will be processed in a parallel way to allow the use of mpi_gather/scatter functions. If yes, how?

If not, any suggestions on how to do what I want?


Thanks a lot.
Honghai
 

goldy

AMP / CGD / NCAR
Staff member
Using MPI inside of tphysac or tphysbc is not supported in CAM. However, I think your best bet is to try and only have one chunk and use one thread since then all the columns will be available on each call to tphysbc or tphysac.
To make sure there is only one chunk, set PCOLS to a large enough number. For instance, say you are doing a 1 degree FV run (55296 columns) on 256 tasks. You want to round up (# columns / # tasks) or in this case, PCOLS = 216.
To set the threads to one and PCOLS to 216:

./xmlchange --append CAM_CONFIG_OPTS="-pcols 216"
./xmlchange NTHRDS=1

Then, run ./case.setup && ./case.build && ./case.submit

--Steve
 
Last edited:

hhzhang

Honghai Zhang
New Member
Hi Steve,

Thank you very much for answering!

I understand your suggestion to only have one chunk that includes all the columns which will be available on each call to tphysbc or tphysac, but I don't understand your example of setting PCOLS = 216 (i.e., 55296/256). Since PCOLS is the maximum number of columns in a chunk, shouldn't I have to set PCOLS=55296 (or any larger number) in order to ensure having only one chunk? I think what I really don't understand is how chunks/columns are distributed among the MPI tasks, or more specifically in this case, how one chunk including all columns is distributed among different MPI tasks.

Another question is do I have to set the threads to one when there is only one chunk? If there is only one chunk, then I would expect the number of threads shouldn't matter because the chunk loop ('do c=begchunk,endchunk') will not be a loop anymore since begchunk=endchunk. My understanding, which could be wrong, is that when there are more than one chunks and multiple threads, the chunks will be divided into different threads to achieve parallel computing. I guess it's probably a safer practice to set #threads to one in this case?


Best,
Honghai
 

goldy

AMP / CGD / NCAR
Staff member
If you set PCOLS = 55296, you will probably run out of memory.

Columns are distributed among MPI tasks so that every task has the same or close to the same number of columns. That is why I divided by the number of tasks (but make sure PCOLS is large enough to absorb all columns on a task).

The number of chunks is decided on a per-task basis -- you want one chunk *per task* so that MPI calls work properly.

As for threads, you probably do not need to set NTHRDS to 1 but that seems safer since you have no use for threading (at least in CAM). It is certainly something you can play with once your code is working with NTHRDS==1.
 
Top