speeding up WACCM on Janus

andrew_kren@colorado_edu · Oct 24, 2013

I am running an F_1850_WACCM compset simulation, 200 years worth on the CU Janus Supercomputer. I have been doing tests to see how I can increase the speed of the simulation. My ocean/ice dataset is from a previously run interactive ocean run, with the data on a grid of 384x320 lat and lon grid. Right now, it takes over 14 hours to run just 7 months, which seems slow for me, and I am running with 1.9x2.5 resolution. I have my NTASKS set to 84, NTHRDS to 1 and ROOTPE set to 0. Is there a combination I can do to increase the speed so that I can get more months completed in a 24 hour period?Thank you,

jedwards · Oct 24, 2013

Please let us know the model version you are using and send a pointer to your case direcrtory. - thanks

santos · Oct 24, 2013

NTASKS=84 is well below where the model will scale to. I would try NTASKS=180 to start with. I don't recall whether or not there's a benefit to threading on Janus.

andrew_kren@colorado_edu · Oct 24, 2013

I am using cesm version 1.2.0.

andrew_kren@colorado_edu · Oct 30, 2013

Ok, I am trying my model with 180 tasks as suggested. I will let you whether it helps speed up my model run. I did have a question about threading. I followed this example to do threading (http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/x2574.html) but it did not work for me. I got this error message on the cesm log:mpirun noticed that process rank 86 with PID 4713 on node node0358 exited on signal 11 (Segmentation fault).

Any idea what this means? Is threading not available on Janus?

andrew_kren@colorado_edu · Nov 21, 2013

Hi All,
I ran my WACCM simulation on Janus with 180 processors and no threading. It took a day to run a year of WACCM. This is not realistic for my studies. Is there any way I can speed this up? Am I doing something wrong in my setup to make this run so slow?

jedwards · Nov 22, 2013

Okay - so you went from 84 tasks running 7 months in 14 hours to180 tasks running a year in 24 hours. It would help if you can post the simulated years per time figure which is printed out at the end of the cpl.log file. Perhaps you should try 240 tasks next? WACCM is expensive, there is no getting around that, there are things that you can try to improve the performance, such as different compiler options or environment settings. You could also run under a profiler and see if you can spot any performance bottlenecks. But we don't really have the resources to help you.

andrew_kren@colorado_edu · Nov 22, 2013

Here is the results from the cpl log: component       comp_pes    root_pe   tasks x threads instances (stride)
---------        ------     -------   ------   ------ --------- ------
cpl = cpl        180         0        180    x 1       1      (1     )
glc = sglc       180         0        180    x 1       1      (1     )
wav = swav       180         0        180    x 1       1      (1     )
lnd = clm        180         0        180    x 1       1      (1     )
rof = rtm        180         0        180    x 1       1      (1     )
ice = cice       180         0        180    x 1       1      (1     )
atm = cam        180         0        180    x 1       1      (1     )
ocn = docn       180         0        180    x 1       1      (1     )

total pes active           : 180
pes per node               : 12
pe count for cost estimate : 180

Overall Metrics:
    Model Cost:            4119.88   pe-hrs/simulated_year
    Model Throughput:         1.05   simulated_years/day

    Init Time   :    1014.669 seconds
    Run Time    :   82397.609 seconds      225.747 seconds/day
    Final Time :       0.324 seconds

    Actual Ocn Init Wait Time     :       0.009 seconds
    Estimated Ocn Init Run Time   :       0.001 seconds
    Estimated Run Time Correction :       0.000 seconds
      (This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

    TOT Run Time:   82397.609 seconds      225.747 seconds/mday         1.05 myears/wday
    LND Run Time:     226.633 seconds        0.621 seconds/mday       381.23 myears/wday
    ROF Run Time:       9.022 seconds        0.025 seconds/mday      9576.59 myears/wday
    ICE Run Time:    4315.607 seconds       11.824 seconds/mday        20.02 myears/wday
    ATM Run Time:   73900.654 seconds      202.468 seconds/mday         1.17 myears/wday
    OCN Run Time:      15.576 seconds        0.043 seconds/mday      5547.03 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL Run Time:    2087.562 seconds        5.719 seconds/mday        41.39 myears/wday
    CPL COMM Time:   4055.698 seconds       11.112 seconds/mday        21.30 myears/wday

I also attached the txt file which contains more information, if needed.

jedwards · Nov 22, 2013

Hi Andrew, To clarify - my request is to use theModel Cost: 4119.88 pe-hrs/simulated_year
Model Throughput: 1.05 simulated_years/daynumbers as a common language of model performance.
Have a great weekend.

santos · Nov 25, 2013

Yes, I think that would be helpful as well, just to be sure that things are comparable. But if it really is about 7 months in 14 hours, then that's pretty much the same as 1 yr/day, which is suggesting that increasing NTASKS from 84 to 180 had no effect, which is weird. I guess it's possible if Janus has *really* high communication costs compared with Yellowstone.For comparison, I have a recent-ish case on Yellowstone, where B1850WCN gives this (360 NTASKS, 2 NTHRDS): Model Cost: 1416.18 pe-hrs/simulated_year
Model Throughput: 7.05 simulated_years/dayNot only the atmosphere obviously still scaling well out to that point, but the model cost is 1/3 as much even on twice as many physical cores.Edit: I guess comparing pe-hr costs could be kind of a fool's errand, if Janus's hardware is too different from Yellowstone's for direct comparison. But still, the model should scale much higher than 84 or even 180 tasks.

andrew_kren@colorado_edu · Nov 25, 2013

What do you suggest as a next step Sean?

santos · Nov 25, 2013

Well, from your NTASKS=84 run, can you confirm for us what the reported "Model Throughput" is, and if it's close to 1 yr/day? You said it took "over" 14 hours to get 7 months, but I'm not clear about how much of a gap there is. If we actually can see some scaling, you can try to use yet more PEs.If not, or if that's not enough, then things are more complicated, since we don't have much in the way of resources to do performance tuning, and because it's not obvious whether or not there even is a performance issue on our end (as opposed to a general system limitation or inefficiency, or a library issue).

andrew_kren@colorado_edu · Dec 2, 2013

Yes in my post above it says the Throughput is 1.05 simulated yrs/day.

santos · Dec 2, 2013

Ah, I meant the throughput for an NTASKS=84 run (or any other number of tasks besides 180). The point is to compare it to your throughput on 180 tasks and see if there's actually a decent speedup from more cores.

andrew_kren@colorado_edu · Dec 9, 2013

I had a follow up question regarding the performance. I am wondering if there is something I am doing wrong that is making my model run slowly. In my newly run simulation, I added both the ocean dataset (200 yrs) and a varying solar cycle (also 200 years). When I did not have a solar cycle (i.e., no solar file that I created), the model ran 1 year as mentioned earlier in 24 hours. When I added the solar data file and parms file with 200 years of data, the simulation ran from Jan 1 to August 10th in 24 hours. It did not finish a year as my walltime was exceeded. So it appears that when I use the solar created file my model runs even slower. Is there some reason why this would be the case, or is there something I am doing wrong?

santos · Dec 9, 2013

Hmm. I have no idea what would cause such a big difference, unless there was some other difference between the two cases that you introduced by mistake (or a system issue). I suppose you could compare the namelist files for the two cases as a sanity check.

jedwards · Dec 9, 2013

The variablity of performance on the Janus system can be quite large.I wouldn't make this kind of conclusion based on a single run of each case.

andrew_kren@colorado_edu · Dec 9, 2013

speeding up WACCM on Janus

andrew_kren@colorado_edu

Member

jedwards

CSEG and Liaisons

santos

Member

andrew_kren@colorado_edu

Member

andrew_kren@colorado_edu

Member

andrew_kren@colorado_edu

Member

jedwards

CSEG and Liaisons

andrew_kren@colorado_edu

Member

jedwards

CSEG and Liaisons

santos

Member

andrew_kren@colorado_edu

Member

santos

Member

andrew_kren@colorado_edu

Member

santos

Member

andrew_kren@colorado_edu

Member

santos

Member

jedwards

CSEG and Liaisons

andrew_kren@colorado_edu

Member