Slow CESM simulation due to I/O on Yellowstone

abtawfik@umich_edu · Mar 4, 2013

Hi CESM Board

I had previously run CESM 1.0.4 (CAM4-CLM4 data ocean) on Yellowstone in early January at 0.25 grid spacing. This simulation had a wallclock of about 11 hours for 4 simulation months on 512 processors. I attempted to run a few other simulations more recently with CESM 1.0.5 and the same model configurations (except I output several more variables from CAM) but only for 1 simulation month. This took more than 72 wallclock hours and cost me MUCH more in core-hours. I am writing output every hour for all simulations.

When I looked at the timing of the latest simulations I noticed a bottleneck in the I/O that was consuming a huge portion of my wallclock.

Is it expected to have model runtime increased by a factor of 30 just by including several additional variables in output?

Is there an efficient or optimal processor layout for simulations with lots of I/O? For these simulations, I was using 512 processors spread across 64 nodes with 4 OpenMP threads per task.

Any thoughts, help, or advice would be greatly appreciated.

-Ahmed

jedwards · Mar 5, 2013

Hi Ahmed,

Threading performance on yellowstone is still an outstanding issue. You should use at most 2 threads per task. You didn't say which resolution or dycore you are using so I don't know for sure if this will work
but for 64 nodes you might try 960 tasks with ptile=15 and 2 openmp threads per task. Also try setting ATM_PIO_TYPENAME = "pnetcdf' in the env_run.xml file. If you wish to continue this discussion please include
the model resolution and compset or a pointer to your yellowstone case directory in your response.

Thanks,
Jim

abtawfik@umich_edu · Mar 7, 2013

Hi Jim

Thanks for the response. I am still trying to understand what would be most efficient if I'm writing output frequently (every 1-hour). Would ATM_PIO_TYPENAME = "pnetcdf' improve writing efficiency? What does it do exactly?

Here are some more details about the configuration:
-compset F_2000
-res 0.23x0.31_0.23x0.31
case directory = /glade/u/home/abtawfik/cesm1_0_5/scripts/CAM4_0.25

Also, has anyone tried to run CESM writing output every hour? If so, any idea about how poorly it scales with increasing processors?

Thank you again for the help. It is greatly appreciated.

-Ahmed

jedwards · Mar 7, 2013

Hi Ahmed,

I have made a few changes in your case that show a 10x speedup over what you have on the same number of nodes:

First grab my env_mach_pes.xml and env_mach_specific from directory /glade/scratch/jedwards/cesmtests/CAM4_0.25

These changes require that you reconfigure your case:

./configure -clean
./configure -case

The make the following changes in env_run.xml :

<
---
>
122c122
<
---
>

and the following in CAM4_0.25.run

< #BSUB -n 480
< #BSUB -R "span[ptile=15]"
----
> #BSUB -n 1024
> #BSUB -R "span[ptile=16]"

One of the primary motivations for moving to the cam-se dycore was to improve scaling in these high resolution cases, consider going to the cesm1_1_1 and cam-se dycore.

Thanks,
Jim

abtawfik@umich_edu · Mar 7, 2013

Great!
I'll give this configuration a try.
Thanks again. Hopefully this saves me some core-hour charges in the future!

-Ahmed

jedwards · Mar 8, 2013

I did another configuration of 960x2 and entered it in the timing table here http://www.cesm.ucar.edu/models/cesm1.0/timing_cesm1_0_5/
it get's 1.97 ypd.

Slow CESM simulation due to I/O on Yellowstone

abtawfik@umich_edu

New Member

jedwards

CSEG and Liaisons

abtawfik@umich_edu

New Member

jedwards

CSEG and Liaisons

abtawfik@umich_edu

New Member

jedwards

CSEG and Liaisons