how to check CPU time for each component

littledddna@gmail_com · Jul 8, 2020

Dear all, I am running f19g17 resolution for B1850 case. How ever the run is very slow. When I check the cpl log, each dt costs almost 300 seconds.

e.g. tStamp_write: model date = 00010206 0 wall clock = 2020-07-03 16:39:57 avg dt = 272.78 dt = 272.21
memory_write: model date = 00010206 0 memory = -0.00 MB (highwater) 2062.15 MB (usage) (pe= 0 comps= cpl ATM LND ICE OCN GLC WAV ESP)

So, I was wondering, how do I check time cost for each component? I can see the time steps in CAM log, what about POP?

I have also attached the PE layout for the test. Does it look reasonable ? Thanks very much !

fischer · Jul 8, 2020

Hi,
In your case directory there is a timing directory that has timing files in it. The cesm_timing.CASENAME file has total running time for each component.

Looking at you PE layout, you didn't set any ROOTPE.

Try using the following.
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">400</value>
<value compclass="WAV">304</value>
<value compclass="GLC">0</value>
<value compclass="ICE">0</value>
<value compclass="ROF">0</value>
<value compclass="LND">240</value>
<value compclass="ESP">0</value>
</values>

On our main test system we use the following.

<entry id="NTASKS">
<type>integer</type>
<values>
<value compclass="ATM">288</value>
<value compclass="CPL">288</value>
<value compclass="OCN">288</value>
<value compclass="WAV">36</value>
<value compclass="GLC">36</value>
<value compclass="ICE">108</value>
<value compclass="ROF">40</value>
<value compclass="LND">144</value>
<value compclass="ESP">1</value>
<value compclass="IAC">1</value>
</values>
<desc>number of tasks for each component</desc>
</entry>
...
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">288</value>
<value compclass="WAV">252</value>
<value compclass="GLC">0</value>
<value compclass="ICE">144</value>
<value compclass="ROF">0</value>
<value compclass="LND">0</value>
<value compclass="ESP">0</value>
<value compclass="IAC">0</value>
</values>
<desc>ROOTPE (mpi task in MPI_COMM_WORLD) for each component</desc>
</entry>

And get the follow timesteps.

tStamp_write: model date = 00010108 0 wall clock = 2020-06-25 22:41:15 avg dt = 22.41 dt = 21.06

littledddna@gmail_com · Jul 8, 2020

Thanks very much :) That is really helpful. I will try it and see how does it work

littledddna@gmail_com · Jul 8, 2020

fischer said:
Hi,
In your case directory there is a timing directory that has timing files in it. The cesm_timing.CASENAME file has total running time for each component.

Looking at you PE layout, you didn't set any ROOTPE.

Try using the following.
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">400</value>
<value compclass="WAV">304</value>
<value compclass="GLC">0</value>
<value compclass="ICE">0</value>
<value compclass="ROF">0</value>
<value compclass="LND">240</value>
<value compclass="ESP">0</value>
</values>

On our main test system we use the following.

<entry id="NTASKS">
<type>integer</type>
<values>
<value compclass="ATM">288</value>
<value compclass="CPL">288</value>
<value compclass="OCN">288</value>
<value compclass="WAV">36</value>
<value compclass="GLC">36</value>
<value compclass="ICE">108</value>
<value compclass="ROF">40</value>
<value compclass="LND">144</value>
<value compclass="ESP">1</value>
<value compclass="IAC">1</value>
</values>
<desc>number of tasks for each component</desc>
</entry>
...
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">288</value>
<value compclass="WAV">252</value>
<value compclass="GLC">0</value>
<value compclass="ICE">144</value>
<value compclass="ROF">0</value>
<value compclass="LND">0</value>
<value compclass="ESP">0</value>
<value compclass="IAC">0</value>
</values>
<desc>ROOTPE (mpi task in MPI_COMM_WORLD) for each component</desc>
</entry>

And get the follow timesteps.

tStamp_write: model date = 00010108 0 wall clock = 2020-06-25 22:41:15 avg dt = 22.41 dt = 21.06

And to tell the truth, I dont quite understand ROOTPE. Why do we need to set it up for some of the components but not others? What if we set ROOTPE to 400 (just as an example) for all the components ? Will that speed up the model? Thanks

littledddna@gmail_com · Jul 9, 2020

fischer said:
Hi,
In your case directory there is a timing directory that has timing files in it. The cesm_timing.CASENAME file has total running time for each component.

Looking at you PE layout, you didn't set any ROOTPE.

Try using the following.
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">400</value>
<value compclass="WAV">304</value>
<value compclass="GLC">0</value>
<value compclass="ICE">0</value>
<value compclass="ROF">0</value>
<value compclass="LND">240</value>
<value compclass="ESP">0</value>
</values>

On our main test system we use the following.

<entry id="NTASKS">
<type>integer</type>
<values>
<value compclass="ATM">288</value>
<value compclass="CPL">288</value>
<value compclass="OCN">288</value>
<value compclass="WAV">36</value>
<value compclass="GLC">36</value>
<value compclass="ICE">108</value>
<value compclass="ROF">40</value>
<value compclass="LND">144</value>
<value compclass="ESP">1</value>
<value compclass="IAC">1</value>
</values>
<desc>number of tasks for each component</desc>
</entry>
...
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">288</value>
<value compclass="WAV">252</value>
<value compclass="GLC">0</value>
<value compclass="ICE">144</value>
<value compclass="ROF">0</value>
<value compclass="LND">0</value>
<value compclass="ESP">0</value>
<value compclass="IAC">0</value>
</values>
<desc>ROOTPE (mpi task in MPI_COMM_WORLD) for each component</desc>
</entry>

And get the follow timesteps.

tStamp_write: model date = 00010108 0 wall clock = 2020-06-25 22:41:15 avg dt = 22.41 dt = 21.06

Hi, I followed the main test PE layout you have provided above and the test still shows avg dt around 270 seconds.
I have attached the timing output. Looks like POP is taking most of the time to run and other components are not that fast either.
Is it because of the flags during compiling? For instance -D_USE_FLOW_CONTROL option for POP or -00 option for debugging?
Or is it due to something else?

Thanks very much again.

rajkmsaini · Jul 9, 2020

littledddna@gmail_com said:
Hi, I followed the main test PE layout you have provided above and the test still shows avg dt around 270 seconds.
I have attached the timing output. Looks like POP is taking most of the time to run and other components are not that fast either.
Is it because of the flags during compiling? For instance -D_USE_FLOW_CONTROL option for POP or -00 option for debugging?
Or is it due to something else?

Thanks very much again.

Hi Dear,

Where can find this file?

Thanks in advance.

Best,
Raj

littledddna@gmail_com · Jul 9, 2020

It

rajkmsaini said:
Hi Dear,

Where can find this file?

Thanks in advance.

Best,
Raj

It was generated after the test run complete. If you kill the job in the middle, it is not generated.

rajkmsaini · Jul 9, 2020

littledddna@gmail_com said:
It

It was generated after the test run complete. If you kill the job in the middle, it is not generated.

Thank you.

Best,
Raj

littledddna@gmail_com · Jul 9, 2020

oK, as far as I understand, in the example that you provided. CAM and CPL run on the first 288 processors and POP starts at the 288th processor. WAV uses the last 36 processors that also used for CAM and CPL. It basically makes POP run alone at processor 288 to 576. Is that correct ? If I’m that case, POP is still ten times slower than the example. Where should the problem locate ? My college was joking maybe the cluster has bad fan and the temperature is too high...

dobbins · Jul 17, 2020

Yes, that's correct - there's a flow for how CESM components operate, and the setup for a B-compset involves the ATM and OCN (the two most expensive components, generally) running first, on large processor counts. Once the ATM is finished, you can run the LND, ICE and WAV components, for example, on the now-idle ATM processors, even while OCN finishes up.

With that said, can you share a little more information about your cluster? Basically, I have two sets of questions - one, what kind of processors do you have, and what's the interconnect (eg, Infiniband or Ethernet)? This should shed some light on what's the root cause of the performance you're seeing. By way of (indirect) comparison, if I run a 1-degree (f09) case on our system, the land model -which is typically quite fast- takes ~7.3 seconds/day on 144 cores. And the ATM, again at 1-degree, on 288 cores, takes ~45 seconds/day. Since your times are 29 seconds/day (LND) and 111 seconds/day (ATM), I'm wondering if you aren't able to scale out this far on a slower interconnect, and thus we need a smaller decomposition. (I see this, for example, running on Cloud instances with Ethernet networking). Alternatively, it's just a processor thing, but the CPL COMM times you see make me think it might be interconnect.

Second question, how many processors per node do you have, and how much memory per node? That might help us craft a decomposition that's better suited to your system, though I can't promise anything as this is a difficult task! One we're working on, but still difficult.

Thanks!

littledddna@gmail_com · Jul 20, 2020

T

dobbins said:
Yes, that's correct - there's a flow for how CESM components operate, and the setup for a B-compset involves the ATM and OCN (the two most expensive components, generally) running first, on large processor counts. Once the ATM is finished, you can run the LND, ICE and WAV components, for example, on the now-idle ATM processors, even while OCN finishes up.

With that said, can you share a little more information about your cluster? Basically, I have two sets of questions - one, what kind of processors do you have, and what's the interconnect (eg, Infiniband or Ethernet)? This should shed some light on what's the root cause of the performance you're seeing. By way of (indirect) comparison, if I run a 1-degree (f09) case on our system, the land model -which is typically quite fast- takes ~7.3 seconds/day on 144 cores. And the ATM, again at 1-degree, on 288 cores, takes ~45 seconds/day. Since your times are 29 seconds/day (LND) and 111 seconds/day (ATM), I'm wondering if you aren't able to scale out this far on a slower interconnect, and thus we need a smaller decomposition. (I see this, for example, running on Cloud instances with Ethernet networking). Alternatively, it's just a processor thing, but the CPL COMM times you see make me think it might be interconnect.

Second question, how many processors per node do you have, and how much memory per node? That might help us craft a decomposition that's better suited to your system, though I can't promise anything as this is a difficult task! One we're working on, but still difficult.

Thanks!

Thanks very much for the reply Brian, The cluster I am using is 2*12 Intel Xeon E5-2692 v2. We have 24 processors each node and 64GB memory per node. But the interconnect is neither Infiniband nor Ethernet. It is self-developed. I also tried CESM1.2.2 with T31_g37 resolution with all PE layout in this table CESM1 Timing Table However, the performance is still bad, around 60 seconds to complete one model day. I need to mention that I set dt_count = 40 to avoid CFL problem. Does that setup slow the model significantly?

Thank you

dobbins · Jul 20, 2020

Ah, a self-developed interconnect makes this really interesting! It could be that, it could be the processors (they're slightly behind ours, but not THAT much), or likely, some combination of both. Let's do a very simple scaling test, if you don't mind. If this is too complex with the B compost, we can switch to an atmosphere-only case, but I think B will be more informative.

So, set up two cases - one for 2 nodes, one for 4. The 2-node case should work in 64GB of RAM/node, but if not, we'll change this up to 4 and 8. Basically, we want to look at the timing files of a run, and a run on 2x the number of cores, and see if we're getting considerably less than 2x the performance. This isn't perfect by any means, but it's a good start. Eg,:

create_newcase --case 2n --compset B1850 --res f19_g17
create_newcase --case 4n --compset B1850 --res f19_g17

Then, for each case, do the following:
./xmlchange DOUT_S=false
./xmlchange COMP_RUN_BARRIERS=true

The first just turns off the archiver (unnecessary here), and the second turns on 'barriers' between the components. This will make the timings a bit more accurate, at the expense of some performance - but since this is just to understand the performance, that's OK.

Finally, for the 2n case, do the following:
./xmlchange NTASKS=48
./xmlchange NTASKS_WAV=24
./xmlchange NTASKS_ESP=1
./xmlchange ROOTPE=0

And for the 2n case, do this:
./xmlchange NTASKS=96
./xmlchange NTASKS_WAV=48
./xmlchange NTASKS_ESP=1
./xmlchange ROOTPE=0

There's a lot going on here, so let me explain - first, we're setting each component to use 48 (or 96) tasks. Obviously this might be really slow, since we're using a lot more and running slowly, but it establishes a baseline with less communication. In theory we could go lower (eg, 1 node and 2 nodes), but in practice we'd likely run out of RAM on that few processors. Second, we're setting the WAV component to use fewer tasks - 24, or 48. This is because WAV communication performance often behaves very strangely -- this will still let us evaluate WAV scalability, but with less risk that we'll run super slow if there's some issue there. Next, we set the ESP component, which is unused, to just 1 core. Finally, we're running all of these on the same processor set - not doing the atmosphere and ocean concurrently on their own sets.

If you can set up and run those, and send the timing files, I think we can better understand what we're seeing on your system, and hopefully come up with a more optimal layout. Let me know if you run into trouble or have any questions. :-)

Thanks!

dobbins · Jul 20, 2020

Whoops, the last set of xmlchange commands starting with NTASKS=96 should say for the 4n case, obviously.

littledddna@gmail_com · Jul 22, 2020

dobbins said:
Whoops, the last set of xmlchange commands starting with NTASKS=96 should say for the 4n case, obviously.

hi, Brian, I tried 2n case and 4n case as you suggested and attached the timing file. Looks like POP and CAM running time reduce to half when we double processors. And POP is the most expensive one. So, do we need to decomposit POP?
What is more, using 96 processors is faster than 288 processors if I turn on 'barriers'. Why is that? Is it due to the coupler?

Thanks very much

dobbins · Jul 22, 2020

OK, this is great! Basically, the scaling looks good - and, already, at 96 cores, we're doing considerably better with the ATM and OCN than we were seeing at 288 cores! (The ocean, in fact, is roughly 4x faster, dropping from 226 seconds to 54!). So this makes the idea of a (general) communication issue less likely, and we know we can already double your performance, and likely more too.

... The catch, though, is we still don't know why the original run was so slow. Let's try to figure that out, but first, let me answer your questions about POP decomposition and barriers. At this resolution, with a 2-degree CAM and 1-degree POP, that's generally expected - and, honestly, what we want. The reason for this is we typically (but not in our 2 tests above!) use different sets of cores for the ATM and OCN, hence the 'ROOTPE_OCN'. This makes the ATM and OCN parts run concurrently, and when the ATM finishes, it then runs other components, like LND, ICE and GLC, while the ocean finishes up. So it's OK for the ocean to be a little slower. Now, in our 2n and 4n case above, we didn't set a ROOTPE, so we ran everything on the same set of cores - this means we can certainly go even faster by running parts, especially the OCN and ATM, concurrently. In theory, this might drop things by about half, which would put us about 4x faster than the original run, but on only 192 cores, instead of 576. And maybe we can scale out even further. Does that make sense? It's a lot of information, so let me know if anything isn't clear, and I'll try to explain better.

(As for the barriers, that actually likely slows down the total time... but gives us more accurate per-component times. It's not responsible for the improvements we're seeing here.)

So, why was the original run on 576 cores so much slower than the 96-core run? I'm not sure yet, but there's a few clues in the files. First, here's what some of component times look like for the original timer file you provided, plus the two more we did:

TOT ATM OCN WAV ICE LND
Original: 227.387 110.736 226.589 12.355 (36c) 35.829 28.972
2n: 216.821 85.819 106.492 3.797 (24c) 7.877 8.857
4n: 112.370 44.193 54.228 2.031 (48c) 4.521 4.738

I added the number of cores next to the WAV component since the original time was in the middle of our two runs - our two showing fairly linear scaling, but the 36-core one showing a huge jump, which is definitely not expected behavior here. I can't really explain that other than to suggest maybe an issue on a node - either, someone else was running on it at the same time, or perhaps a bad network connection on that node? It's hard to say, but it's definitely not normal.

So I've got a few suggestions. One, if we're getting good times now, as a sanity test, re-run the original case and ensure we're still getting bad times from it. That'll rule out some temporary issue, like a long-running code on a node, or a network hiccup, or something like that. If we still are getting bad times from the original, then we want to continue scaling out, so we a) reach similar numbers of cores, and b) reach whatever nodes the original run was using. So two, I'd set up two new cases, '8n' and '16n' that continue to scale out. Here's detailed instructions, similar to above, but we'll probably lock the WAV component to smaller counts again:

create_newcase --case 8n --compset B1850 --res f19_g17
cd 8n
./xmlchange DOUT_S=false
./xmlchange COMP_RUN_BARRIERS=true
./xmlchange NTASKS=192
./xmlchange NTASKS_WAV=72
./xmlchange NTASKS_ESP=1
./xmlchange ROOTPE=0

create_newcase --case 16n --compset B1850 --res f19_g17
cd 16n
./xmlchange DOUT_S=false
./xmlchange COMP_RUN_BARRIERS=true
./xmlchange NTASKS=384
./xmlchange NTASKS_WAV=96
./xmlchange NTASKS_ESP=1
./xmlchange ROOTPE=0

Technically this still doesn't put us at the same node count as the 576-core run (24 nodes), but each component (except WAV) will be running on 384 cores, which is more than each component had in the 576-core case, since we split ATM and OCN to 288 each. So this should tell us if we're hitting a scaling curve limit for some reason.

At that point, we can know where your optimal performance is, and turn off the barriers and set the ROOTPEs to run different components concurrently, and ideally give you good performance. If we are seeing a bad node, though, we'll have to find some way to track that down!

Hope that helps, and sorry it's taking more work before we get you an answer, but I'm very optimistic we can get you running with decent performance after this. :-)

Any questions? Good luck!
- Brian

dobbins · Jul 22, 2020

Oops. The timing table looked nicely formatted when I posted it, but now looks messed up. Here it is in table form:

	TOT	ATM	OCN	WAV	ICE	LND
Original	227.387	110.736	226.589	12.355 (36c)	35.829	28.972
2n	216.821	85.819	106.492	3.797 (24c)	7.877	8.857
4n	112.370	44.193	54.228	2.031 (48c)	4.521	4.738

littledddna@gmail_com · Jul 22, 2020

wow, thanks very much Brain, this is actually very interesting and I learned a good lesson from you. I will try to understand these information first and do the test runs. Thanks very much again and I really appreciated. When I was using CCSM3, someone set it up for me and now I finally learn :)

dobbins · Jul 22, 2020

No worries, this is complicated stuff - even those of us who deal with it often lack some certainty. We are working on having an automatic load-balancing tool, but it's a ways off from being ready for real use. That said, I'm still inclined to think this might be a bad node or something!

Anyway, I'll look forward to your updates, and to getting you up and running with good performance!

how to check CPU time for each component

Member

Attachments

CSEG and Liaisons

Member

Member

Member

Attachments

Dr. Raj Saini

Member

Member

Dr. Raj Saini

Member

Member

Brian Dobbins

CSEG and Liaisons

Member

Brian Dobbins

CSEG and Liaisons

Brian Dobbins

CSEG and Liaisons

Member

Attachments

Brian Dobbins

CSEG and Liaisons

Brian Dobbins

CSEG and Liaisons

Member

Brian Dobbins

CSEG and Liaisons