Performance optimization is both an art and a science. But, the main question at play is: what do you want to optimize for? Do you want to run as fast as possible, or be more efficient and lower cost? Or some balance of both? Ahh, you just posted that you are primarily thinking about cost.
FATES also unfortunately makes performance optimization much harder since the number of subgrid scale units grows and changes in time. So the performance characteristics will change through your run. So I would make sure I run for a while and save a restart file that you can start from before you look at timing characteristics.
You can't really use the CESM performance tool for I cases, but you will need to understand the timing files, and there's some on that for cime here:
When your main goal is throughput it's best to run with datm concurrent to CTSM and use as many processors for CTSM to keep up with datm on a single node. But, for optimizing for cost you'll want to run sequential (which is what you are doing here by the way). But,
@sacks found that datm doesn't scale so you only want to run it on a single node (he found 4 nodes was slower). So possibly you could use 2 or 3 nodes for datm, but that would be the most.
So that's the first change I'd recommend is that you tell datm to run on fewer nodes. The highest you could go in terms of processors would be that 908 mark. So the most you could use for an even number of nodes would be 25.
By the way I recommend using the negative full node syntax here, so you don't have to do math. So tell it you want -1 tasks (meaning 1 node) for datm (ATM_NTASKS) and -10 tasks (meaning 10 nodes or 360 tasks) for CTSM (LND_NTASKS).
Now the bad news is that you can get the model to run faster by adding more processors, but overall that's at the expensive of efficiency. So if you really wanted to run efficiently running on a smaller number of nodes would likely be the best answer. I suspect that running on just a few nodes would be the most cost effective, but it's also going to be too slow for your work. So you'll have to figure out what balance of efficiency and throughput you want. That's where you run some tests and look at the timing files and look at both throughput and cost, and get the balance you want to see.
One thing that does come into play is how balanced the workload is between all processors. If the workload is perfectly balanced the multiprocessing case will be quite efficient. But, that never happens. One way is to choose a number of tasks that evenly goes into the gridcells. But, your 908 number makes that hard! But, with FATES you don't expect that different gridcells are going to have a balanced load anyway, and there isn't really a way to really get them balanced (there is a performance namelist item that you could play with, but still it's only static). So you can just different number of tasks and look at the timing to find a value that seems to be optimized.
We have a performance test that runs for 20 days, so that history output isn't done and doesn't play into the performance. So I'd recommend running tests for around that, and you can decide if you want to see I/O performance in the optimization. I/O is going to be more variable though...