Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Gauging interest: CESM ensembles on cloud infrastructure?

jrvb

Rob von Behren
New Member
[NOTE: I posted this earlier on the general CESM forum, but then realized many folks may not watch those messages; apologies for the cross-post!]

Greetings!

I am part of the Climate and Energy group inside Google Research, and have been using CESM for the past few years. As part of my work, I have built some infrastructure for running large CESM ensembles (100s to low 1000s of members) on Google Cloud spot instances. This infrastructure handles messy aspects of running on commodity preemptible hardware such as
  • Starting and stopping cloud VMs based on the simulation demand

  • Handling failure modes when spot instances are stopped in the middle of a computation (mostly detecting various types of file corruption and selecting a non-corrupt set of restart files before resuming)

  • Managing instance local disks (which are faster than shared file systems for output) and collecting outputs to zarr format in cheap long-term storage

From a cost and performance perspective, I've been using c2-standard-112 instances, which allow me to run 1-degree fixed SST CESM ensembles at a rate of about 1 simulated year / 12 hours and a compute cost of around $13 / simulated year. (So for example, to run a 100-member 2-year fixed SST ensemble takes about 1 day and costs around $2600.) I also expect both the run times and costs to decrease as new hardware is available. (For example, the newly launched c3 instances look like they might cut the cost down to around $8 / sim year, although I haven't tested these out yet.)

Open sourcing this infrastructure will require a bit of work, but I would be happy to do this if there is sufficient value to the CESM user community. I've put together a quick survey to gauge the level of interest:


If this seems like something you would find useful, I would love it if you can let me know. Please also feel free to share the survey with colleagues who you think might be interested.

Best,

-Rob von Behren
 
Top