Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Slow CESM simulation due to I/O on Yellowstone

Hi CESM Board

I had previously run CESM 1.0.4 (CAM4-CLM4 data ocean) on Yellowstone in early January at 0.25 grid spacing. This simulation had a wallclock of about 11 hours for 4 simulation months on 512 processors. I attempted to run a few other simulations more recently with CESM 1.0.5 and the same model configurations (except I output several more variables from CAM) but only for 1 simulation month. This took more than 72 wallclock hours and cost me MUCH more in core-hours. I am writing output every hour for all simulations.

When I looked at the timing of the latest simulations I noticed a bottleneck in the I/O that was consuming a huge portion of my wallclock.

Is it expected to have model runtime increased by a factor of 30 just by including several additional variables in output?

Is there an efficient or optimal processor layout for simulations with lots of I/O? For these simulations, I was using 512 processors spread across 64 nodes with 4 OpenMP threads per task.

Any thoughts, help, or advice would be greatly appreciated.

-Ahmed
 

jedwards

CSEG and Liaisons
Staff member
Hi Ahmed,

Threading performance on yellowstone is still an outstanding issue. You should use at most 2 threads per task. You didn't say which resolution or dycore you are using so I don't know for sure if this will work
but for 64 nodes you might try 960 tasks with ptile=15 and 2 openmp threads per task. Also try setting ATM_PIO_TYPENAME = "pnetcdf' in the env_run.xml file. If you wish to continue this discussion please include
the model resolution and compset or a pointer to your yellowstone case directory in your response.

Thanks,
Jim
 
Hi Jim

Thanks for the response. I am still trying to understand what would be most efficient if I'm writing output frequently (every 1-hour). Would ATM_PIO_TYPENAME = "pnetcdf' improve writing efficiency? What does it do exactly?

Here are some more details about the configuration:
-compset F_2000
-res 0.23x0.31_0.23x0.31
case directory = /glade/u/home/abtawfik/cesm1_0_5/scripts/CAM4_0.25

Also, has anyone tried to run CESM writing output every hour? If so, any idea about how poorly it scales with increasing processors?

Thank you again for the help. It is greatly appreciated.

-Ahmed
 

jedwards

CSEG and Liaisons
Staff member
Hi Ahmed,

I have made a few changes in your case that show a 10x speedup over what you have on the same number of nodes:

First grab my env_mach_pes.xml and env_mach_specific from directory /glade/scratch/jedwards/cesmtests/CAM4_0.25

These changes require that you reconfigure your case:

./configure -clean
./configure -case

The make the following changes in env_run.xml :

<
---
>
122c122
<
---
>


and the following in CAM4_0.25.run

< #BSUB -n 480
< #BSUB -R "span[ptile=15]"
----
> #BSUB -n 1024
> #BSUB -R "span[ptile=16]"


One of the primary motivations for moving to the cam-se dycore was to improve scaling in these high resolution cases, consider going to the cesm1_1_1 and cam-se dycore.

Thanks,
Jim
 
Top