OK, this is great! Basically, the scaling looks good - and, already, at 96 cores, we're doing considerably better with the ATM and OCN than we were seeing at 288 cores! (The ocean, in fact, is roughly 4x faster, dropping from 226 seconds to 54!). So this makes the idea of a (general) communication issue less likely, and we know we can already double your performance, and likely more too.
... The catch, though, is we still don't know why the original run was so slow. Let's try to figure that out, but first, let me answer your questions about POP decomposition and barriers. At this resolution, with a 2-degree CAM and 1-degree POP, that's generally expected - and, honestly, what we want. The reason for this is we typically (but not in our 2 tests above!) use different sets of cores for the ATM and OCN, hence the 'ROOTPE_OCN'. This makes the ATM and OCN parts run concurrently, and when the ATM finishes, it then runs other components, like LND, ICE and GLC, while the ocean finishes up. So it's OK for the ocean to be a little slower. Now, in our 2n and 4n case above, we didn't set a ROOTPE, so we ran everything on the same set of cores - this means we can certainly go even faster by running parts, especially the OCN and ATM, concurrently. In theory, this might drop things by about half, which would put us about 4x faster than the original run, but on only 192 cores, instead of 576. And maybe we can scale out even further. Does that make sense? It's a lot of information, so let me know if anything isn't clear, and I'll try to explain better.
(As for the barriers, that actually likely slows down the total time... but gives us more accurate per-component times. It's not responsible for the improvements we're seeing here.)
So, why was the original run on 576 cores so much slower than the 96-core run? I'm not sure yet, but there's a few clues in the files. First, here's what some of component times look like for the original timer file you provided, plus the two more we did:
TOT ATM OCN WAV ICE LND
Original: 227.387 110.736 226.589 12.355 (36c) 35.829 28.972
2n: 216.821 85.819 106.492 3.797 (24c) 7.877 8.857
4n: 112.370 44.193 54.228 2.031 (48c) 4.521 4.738
I added the number of cores next to the WAV component since the original time was in the middle of our two runs - our two showing fairly linear scaling, but the 36-core one showing a huge jump, which is definitely not expected behavior here. I can't really explain that other than to suggest maybe an issue on a node - either, someone else was running on it at the same time, or perhaps a bad network connection on that node? It's hard to say, but it's definitely not normal.
So I've got a few suggestions. One, if we're getting good times now, as a sanity test, re-run the original case and ensure we're still getting bad times from it. That'll rule out some temporary issue, like a long-running code on a node, or a network hiccup, or something like that. If we still are getting bad times from the original, then we want to continue scaling out, so we a) reach similar numbers of cores, and b) reach whatever nodes the original run was using. So two, I'd set up two new cases, '8n' and '16n' that continue to scale out. Here's detailed instructions, similar to above, but we'll probably lock the WAV component to smaller counts again:
create_newcase --case 8n --compset B1850 --res f19_g17
cd 8n
./xmlchange DOUT_S=false
./xmlchange COMP_RUN_BARRIERS=true
./xmlchange NTASKS=192
./xmlchange NTASKS_WAV=72
./xmlchange NTASKS_ESP=1
./xmlchange ROOTPE=0
create_newcase --case 16n --compset B1850 --res f19_g17
cd 16n
./xmlchange DOUT_S=false
./xmlchange COMP_RUN_BARRIERS=true
./xmlchange NTASKS=384
./xmlchange NTASKS_WAV=96
./xmlchange NTASKS_ESP=1
./xmlchange ROOTPE=0
Technically this still doesn't put us at the same node count as the 576-core run (24 nodes), but each component (except WAV) will be running on 384 cores, which is more than each component had in the 576-core case, since we split ATM and OCN to 288 each. So this should tell us if we're hitting a scaling curve limit for some reason.
At that point, we can know where your optimal performance is, and turn off the barriers and set the ROOTPEs to run different components concurrently, and ideally give you good performance. If we are seeing a bad node, though, we'll have to find some way to track that down!
Hope that helps, and sorry it's taking more work before we get you an answer, but I'm very optimistic we can get you running with decent performance after this. :-)
Any questions? Good luck!
- Brian