[NOTE: I posted this earlier on the general CESM forum, but then realized many folks may not watch those messages; apologies for the cross-post!]
Greetings!
I am part of the Climate and Energy group inside Google Research, and have been using CESM for the past few years. As part of my work, I have built some infrastructure for running large CESM ensembles (100s to low 1000s of members) on Google Cloud spot instances. This infrastructure handles messy aspects of running on commodity preemptible hardware such as
From a cost and performance perspective, I've been using c2-standard-112 instances, which allow me to run 1-degree fixed SST CESM ensembles at a rate of about 1 simulated year / 12 hours and a compute cost of around $13 / simulated year. (So for example, to run a 100-member 2-year fixed SST ensemble takes about 1 day and costs around $2600.) I also expect both the run times and costs to decrease as new hardware is available. (For example, the newly launched c3 instances look like they might cut the cost down to around $8 / sim year, although I haven't tested these out yet.)
Open sourcing this infrastructure will require a bit of work, but I would be happy to do this if there is sufficient value to the CESM user community. I've put together a quick survey to gauge the level of interest:
If this seems like something you would find useful, I would love it if you can let me know. Please also feel free to share the survey with colleagues who you think might be interested.
Best,
-Rob von Behren
Greetings!
I am part of the Climate and Energy group inside Google Research, and have been using CESM for the past few years. As part of my work, I have built some infrastructure for running large CESM ensembles (100s to low 1000s of members) on Google Cloud spot instances. This infrastructure handles messy aspects of running on commodity preemptible hardware such as
- Starting and stopping cloud VMs based on the simulation demand
- Handling failure modes when spot instances are stopped in the middle of a computation (mostly detecting various types of file corruption and selecting a non-corrupt set of restart files before resuming)
- Managing instance local disks (which are faster than shared file systems for output) and collecting outputs to zarr format in cheap long-term storage
From a cost and performance perspective, I've been using c2-standard-112 instances, which allow me to run 1-degree fixed SST CESM ensembles at a rate of about 1 simulated year / 12 hours and a compute cost of around $13 / simulated year. (So for example, to run a 100-member 2-year fixed SST ensemble takes about 1 day and costs around $2600.) I also expect both the run times and costs to decrease as new hardware is available. (For example, the newly launched c3 instances look like they might cut the cost down to around $8 / sim year, although I haven't tested these out yet.)
Open sourcing this infrastructure will require a bit of work, but I would be happy to do this if there is sufficient value to the CESM user community. I've put together a quick survey to gauge the level of interest:
CESM ensembles on Google cloud
The Climate and Energy group inside Google Research has built some infrastructure for running large CESM ensembles (100s to low 1000s of members) on Google Cloud spot instances. This infrastructure handles messy aspects of running on commodity preemptible hardware such as Starting and stopping...
forms.gle
If this seems like something you would find useful, I would love it if you can let me know. Please also feel free to share the survey with colleagues who you think might be interested.
Best,
-Rob von Behren