Optimize PE layout for regional case

Joshua Rady · Mar 25, 2022

I could use some suggestions on improving the efficiency of a set of CLM-FATES regional simulations. The compset is 2000_DATM%GSWP3v1_CLM50%FATES_SICE_SOCN_RTM_SGLC_SWAV, however I have turned off RTM with ./xmlchange RTM_MODE=NULL. There are 1125 grid cells but 217 are ocean, leaving 908 as land. I'm running with with 10 nodes on Cheyenne so I should have ~3 grid cell per core. Beyond that I don't really know what I doing. I'm sure my performance is not optimized. Any suggestions would be appreciated. Here is my PE layout:

Comp NTASKS NTHRDS ROOTPE PSTRIDE
CPL : 360/ 1; 0 1
ATM : 360/ 1; 0 1
LND : 360/ 1; 0 1
ICE : 360/ 1; 0 1
OCN : 360/ 1; 0 1
ROF : 360/ 1; 0 1
GLC : 360/ 1; 0 1
WAV : 360/ 1; 0 1
IAC : 1/ 1; 0 1
ESP : 1/ 1; 0 1

oleson · Mar 25, 2022

Perhaps @erik can answer this. Or have you checked with Jackie, I know she runs FATES regional simulations, right?

Joshua Rady · Mar 25, 2022

I basically started with settings from Jackie Shuman. However, she wasn't sure about optimizing the PE layout. I don't think what I have is terribly inefficient but my allocation is getting pretty thin so I'm trying to make what improvements I can. PE layout have always been a bit mysterious to me.

erik · Mar 25, 2022

Performance optimization is both an art and a science. But, the main question at play is: what do you want to optimize for? Do you want to run as fast as possible, or be more efficient and lower cost? Or some balance of both? Ahh, you just posted that you are primarily thinking about cost.

FATES also unfortunately makes performance optimization much harder since the number of subgrid scale units grows and changes in time. So the performance characteristics will change through your run. So I would make sure I run for a while and save a restart file that you can start from before you look at timing characteristics.

You can't really use the CESM performance tool for I cases, but you will need to understand the timing files, and there's some on that for cime here:

7. Porting and validating CIME on a new platform — CIME master documentation

When your main goal is throughput it's best to run with datm concurrent to CTSM and use as many processors for CTSM to keep up with datm on a single node. But, for optimizing for cost you'll want to run sequential (which is what you are doing here by the way). But, @sacks found that datm doesn't scale so you only want to run it on a single node (he found 4 nodes was slower). So possibly you could use 2 or 3 nodes for datm, but that would be the most.

So that's the first change I'd recommend is that you tell datm to run on fewer nodes. The highest you could go in terms of processors would be that 908 mark. So the most you could use for an even number of nodes would be 25.

By the way I recommend using the negative full node syntax here, so you don't have to do math. So tell it you want -1 tasks (meaning 1 node) for datm (ATM_NTASKS) and -10 tasks (meaning 10 nodes or 360 tasks) for CTSM (LND_NTASKS).

Now the bad news is that you can get the model to run faster by adding more processors, but overall that's at the expensive of efficiency. So if you really wanted to run efficiently running on a smaller number of nodes would likely be the best answer. I suspect that running on just a few nodes would be the most cost effective, but it's also going to be too slow for your work. So you'll have to figure out what balance of efficiency and throughput you want. That's where you run some tests and look at the timing files and look at both throughput and cost, and get the balance you want to see.

One thing that does come into play is how balanced the workload is between all processors. If the workload is perfectly balanced the multiprocessing case will be quite efficient. But, that never happens. One way is to choose a number of tasks that evenly goes into the gridcells. But, your 908 number makes that hard! But, with FATES you don't expect that different gridcells are going to have a balanced load anyway, and there isn't really a way to really get them balanced (there is a performance namelist item that you could play with, but still it's only static). So you can just different number of tasks and look at the timing to find a value that seems to be optimized.

We have a performance test that runs for 20 days, so that history output isn't done and doesn't play into the performance. So I'd recommend running tests for around that, and you can decide if you want to see I/O performance in the optimization. I/O is going to be more variable though...

erik · Mar 25, 2022

@Joshua Rady by the way in case my long message doesn't make this clear, it's difficult to really have a "bad" PE layout for CTSM I cases. So you don't have to be too nervous that your layouts aren't optimized. I think it's worth spending some time in timing tests for configurations you are going to burn a lot of hours on though. But, for example your layout above probably is pretty reasonable, I suggest things above to make it a little better, but it's not going to be a horrible starting point.

Joshua Rady · Mar 25, 2022

Thanks @erik. This is all very informative. I will try the changes you suggested and look at the timing file. I think this gives me a good place to start testing.

The tradeoffs you describe make sense and help frame things a bit for me. I would probably be fine with my runs taking a bit longer but not not days longer. So I have probably already found close to the number of nodes I want. I'm also using a non-standard input file and I discover that hard way that adding processors increased competition for access to that file, which along with overly verbose reporting had an adverse impact on performance. I rewrote my code and managed to eliminate most of this cost, but it did highlight the issue associated with parallel processes needing access to resources at roughly the same time.

One last question, are there any differences you would suggest under anomaly forcing, e.g. SSP585_DATM%GSWP3v1_CLM50%FATES_SICE_SOCN_RTM_SGLC_SWAV? I'm not sure this is really any different, but I'm not sure how anomaly forcing works on the technical level.

erik · Mar 25, 2022

The one thing with anomaly forcing is that it adds more work (and more I/O) to datm. So datm will probably be slower running on 10 nodes, but maybe faster than one on two for three nodes. But, you should be able to see the optimal number of nodes for datm with a little testing.

Optimize PE layout for regional case

Joshua Rady

Joshua Rady

New Member

oleson

Keith Oleson

CSEG and Liaisons

Joshua Rady

Joshua Rady

New Member

erik

Erik Kluzek

CSEG and Liaisons

erik

Erik Kluzek

CSEG and Liaisons

Joshua Rady

Joshua Rady

New Member

erik

Erik Kluzek

CSEG and Liaisons