Is there a publicly released version of CESM based on GPUs?

>

levinzx

Xin
New Member
Dear community,

Is there a publicly released version of GPU-based CESM? It seems that CESM2 is CPU-based. Which repository and guide should I refer to if I'm running CESM on a GPU machine?

Best,
Xin
 
>

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
Hi Xin,

Let me add a bit more information here -- Haipeng is correct that there will be some GPU-accelerated code within CESM3, as part of various efforts towards running on GPUs, but we are still quite a ways off from a fully performant, GPU-resident CESM. Basically, parts of the atmosphere model (some physics, some dynamical cores) are indeed runnable now, but other parts are not - on a lot of current equipment, this results in slow data-transfers between the CPU and GPU parts of the code, which effectively mitigate the benefits of the GPUs. On much newer platforms that offer more tightly coupled GPU/CPU memory systems, I expect that to be a lot better, but I don't have numbers on that yet. There are also ongoing efforts to GPU-ize other components, like the ocean model, but those are also still early into development.

In short, this is very much a work in progress. If you're doing GPU development of CAM physics, say, then yes, CESM3 will have some code you can look at, use, modify, etc. But if you want to do science runs on a GPU system, that won't be a target of CESM3.

Hope that helps, and if you have specific needs or questions, I'm happy to provide more info!

Cheers,
- Brian
 
Vote Upvote 0 Downvote
>

levinzx

Xin
New Member
Hi Xin,

Let me add a bit more information here -- Haipeng is correct that there will be some GPU-accelerated code within CESM3, as part of various efforts towards running on GPUs, but we are still quite a ways off from a fully performant, GPU-resident CESM. Basically, parts of the atmosphere model (some physics, some dynamical cores) are indeed runnable now, but other parts are not - on a lot of current equipment, this results in slow data-transfers between the CPU and GPU parts of the code, which effectively mitigate the benefits of the GPUs. On much newer platforms that offer more tightly coupled GPU/CPU memory systems, I expect that to be a lot better, but I don't have numbers on that yet. There are also ongoing efforts to GPU-ize other components, like the ocean model, but those are also still early into development.

In short, this is very much a work in progress. If you're doing GPU development of CAM physics, say, then yes, CESM3 will have some code you can look at, use, modify, etc. But if you want to do science runs on a GPU system, that won't be a target of CESM3.

Hope that helps, and if you have specific needs or questions, I'm happy to provide more info!

Cheers,
- Brian

Hi Brian,

Thank you very much for the insights! We plan to run GPU-based CESM and use the output for some applications, since we have a new GPU-based instance.

It's a very interesting point that the hybrid CPU and GPU modules may have challenges in data transfer, and reduce the benefit from a pure GPU architecture. We haven't thought about this before! It seems that almost all GCMs that currently consider GPU acceleration are not fully GPU-ized, which means that the limitation of slow data transfers between the CPU and GPU parts is going to persist in these models for a considerable period of time. Meanwhile, we will still need a computer instance that supports both CPU and GPU to install and run these models.

This is very helpful! We will do some tests on the instance to see if we can set up a hybrid CPU and GPU model. Meanwhile, we are looking forward to the release of CESM3.

Best,
Xin
 
Vote Upvote 0 Downvote
>

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
Hi Xin,

Just to be clear, running on many GPUs right now (such as the A100s in NCAR's Derecho) results in slower performance than the CPU-based runs, due to those data transfers. Some parts of the code run ~4x faster, but overall, it's a slowdown, so I'd recommend against running on those GPUs. I haven't yet tried on the newer systems, like NVIDIA's Grace-Hopper or above, or the newer AMD accelerators with unified memory, which avoid some of the dramatic costs with the data transfers, so those might fare better... but they're likely still not yet at the point where they're cost-effective yet, since a lot of code still runs on CPUs. Note that there are also issues with scaling out on GPUs (depending on how much memory you have on one) due to the large state size of CESM.

If you have a GPU you're looking to use, I'd recommend running on CPUs (or, using existing datasets?), then doing some analysis on the GPU, which likely does have the potential to speed things up considerably.

As for other models, I think some parts are likely not necessary on GPUs - eg, every GPU system still has a CPU on the server, and for some low-cost parts of the model, like the land portion, you can likely run concurrently on CPUs. Typically, our focus is on getting the atmosphere and ocean running on GPUs, since they dominate the cost.

Cheers,
- Brian
 
Vote Upvote 0 Downvote
>

levinzx

Xin
New Member
Hi Xin,

Just to be clear, running on many GPUs right now (such as the A100s in NCAR's Derecho) results in slower performance than the CPU-based runs, due to those data transfers. Some parts of the code run ~4x faster, but overall, it's a slowdown, so I'd recommend against running on those GPUs. I haven't yet tried on the newer systems, like NVIDIA's Grace-Hopper or above, or the newer AMD accelerators with unified memory, which avoid some of the dramatic costs with the data transfers, so those might fare better... but they're likely still not yet at the point where they're cost-effective yet, since a lot of code still runs on CPUs. Note that there are also issues with scaling out on GPUs (depending on how much memory you have on one) due to the large state size of CESM.

If you have a GPU you're looking to use, I'd recommend running on CPUs (or, using existing datasets?), then doing some analysis on the GPU, which likely does have the potential to speed things up considerably.

As for other models, I think some parts are likely not necessary on GPUs - eg, every GPU system still has a CPU on the server, and for some low-cost parts of the model, like the land portion, you can likely run concurrently on CPUs. Typically, our focus is on getting the atmosphere and ocean running on GPUs, since they dominate the cost.

Cheers,
- Brian

Hi Brian,

Thank you very much for the follow-up! We realized that all current GPU-ized GCMs available publicly are hybrid. Therefore, the data transfer issue will be a problem when running on GPU nodes. I think it is a great idea to test the model performance on CPUs, then see how it compares to the performance on GPU nodes.

Thank you again for the advice!

Best,
Xin
 
Vote Upvote 0 Downvote
>

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
Sure, keep us in the loop as your plans firm up -- I've got some exploration of this space planned in the near future (specifically on the newer Grace-Hopper / Grace-Blackwell nodes), with the hopes of it pointing towards future development targets for the GPU capabilities.
 
Vote Upvote 0 Downvote
Back
Top