Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Nodes/cores/storage for CESM simulations

Aber

New Member
Hi, I apologize for the complete newbie question, this must have been answered before but I couldn't find easy info on this here or on the web. Not even sure my question is sufficiently precise to make any sense... but here it is:
I've never run CESM before. I am trying to determine how much resources I would need on my local university HPC cluster to do so (possibly to purchase them to have priority access). To run a full (ocean/atmo) CESM simulation, let's say a historical or transient climate change experiment, in an acceptable amount of time (a few weeks I suppose?), what are the computing resources needed, in terms of number of nodes and cores? I am a bit (a lot) out of my depth here, so don't hesitate to ELI5 (explain like I am 5...). I'd also be very grateful if you can describe how much space is needed to store model outputs in that case (for a typical number of variables/spatial-temporal resolution). What about in a land-atmo (prescribed SST) case only?

Perhaps the answer for someone like me would be to simply ask to run simulations on NCAR machines, but I heard you could only do this with NSF grants.
Thank you!
 

jedwards

CSEG and Liaisons
Staff member
The answer is dependent on the length and resolution of the simulation you wish to complete along with the amount of output data you choose to save. It is also very dependent on the cpu and network of your HPC system. We usually try to get a minimum performance of 5 simulated-years/cpu day.

CESM is highly configurable and you can scale it to match the resources that you have available. If your university is a UCAR member I believe that you can get access to NCAR systems.
 

Aber

New Member
Thanks for the quick reply! I guess I am talking about, let's say, 100 years, at a 1 deg resolution (current typical resolution of CESM2 I believe?).
Just copying some info here, but let's say we are talking about nodes such as: 2x Intel(R) Xeon(R) Gold 6226R @ 2.90GHz, 32 cores/node, 192GB memory/node, 100GB Infiniband interconnect.

As for the output size, I don't know, but I was assuming there was maybe a standard output level? (e.g., monthly for a standard number of variables).

(I don't believe our institution is part of UCAR.)
Thanks,
 

jedwards

CSEG and Liaisons
Staff member
If you are running cesm2.1, the cmip6 model you would be using the fv dycore which has a practical limit of around 1800 cores. I think that is similar to our previous generation system - cheyenne. Screenshot 2024-03-11 at 1.40.14 PM.png
 

Aber

New Member
Thanks - I am not sure I understand, though: let's say I want to buy a certain number of nodes to add to our university cluster to have priority access on, to run CESM ; with the specs specified in my previous reply for instance, how many would I need to run a typical, 100-yr, 1deg res CESM2 simulation in a reasonable amount of time?
is the 1800 number of cores you mentioned a lower limit?? as in, I would need 50+ nodes (given 30 nodes per core)?

Thanks for your help!
 

jedwards

CSEG and Liaisons
Staff member
Sorry - I gave you a land only timing file, not what I meant to provide.
Here is one for a fully coupled case using 3456 cores or about 108 nodes of your system. It performed at 29.26ypd (years per day). So in theory if you
wanted to run at the minimum acceptable 5ypd you would need about 1/6 of this or 18 nodes. Again this is for the cmip6 model, if you intend to run our cutting edge cmip7 model its considerably more expensive.

---------------- TIMING PROFILE ---------------------
Case : b.e20.B1850.f09_g17.pi_control.all.297
LID : 9749916.chadmin1.180527-144253
Machine : cheyenne
Caseroot : /glade/p/cesmdata/cseg/runs/cesm2_0/b.e20.B1850.f09_g17.pi_control.all.297
Timeroot : /glade/p/cesmdata/cseg/runs/cesm2_0/b.e20.B1850.f09_g17.pi_control.all.297/Tools
User : hannay
Curr Date : Sun May 27 14:48:40 2018
grid : a%0.9x1.25_l%0.9x1.25_oi%gx1v7_r%r05_g%gland4_w%ww3a_m%gx1v7
compset : 1850_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO_MOSART_CISM2%NOEVOLVE_WW3_BGC%BDRD
run_type : hybrid, continue_run = TRUE (inittype = FALSE)
stop_option : nmonths, stop_n = 1
run_length : 31 days (31.0 for ocean)

component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 3456 0 1152 x 3 1 (1 )
atm = cam 3456 0 1152 x 3 1 (1 )
lnd = clm 2592 0 864 x 3 1 (1 )
ice = cice 864 864 288 x 3 1 (1 )
ocn = pop 768 1152 256 x 3 1 (1 )
rof = mosart 2592 0 864 x 3 1 (1 )
glc = cism 3456 0 1152 x 3 1 (1 )
wav = ww 96 1408 32 x 3 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )

total pes active : 12960
mpi tasks per node : 36
pe count for cost estimate : 4320

Overall Metrics:
Model Cost: 3543.22 pe-hrs/simulated_year
Model Throughput: 29.26 simulated_years/day

Init Time : 66.757 seconds
Run Time : 250.776 seconds 8.090 seconds/day
Final Time : 0.014 seconds

Actual Ocn Init Wait Time : 73.226 seconds
Estimated Ocn Init Run Time : 0.000 seconds
Estimated Run Time Correction : 0.000 seconds
(This correction has been applied to the ocean and total run times)

Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components

TOT Run Time: 250.776 seconds 8.090 seconds/mday 29.26 myears/wday
CPL Run Time: 24.669 seconds 0.796 seconds/mday 297.46 myears/wday
ATM Run Time: 182.964 seconds 5.902 seconds/mday 40.11 myears/wday
LND Run Time: 26.772 seconds 0.864 seconds/mday 274.10 myears/wday
ICE Run Time: 35.135 seconds 1.133 seconds/mday 208.85 myears/wday
OCN Run Time: 175.634 seconds 5.666 seconds/mday 41.78 myears/wday
ROF Run Time: 2.569 seconds 0.083 seconds/mday 2856.40 myears/wday
GLC Run Time: 0.959 seconds 0.031 seconds/mday 7651.81 myears/wday
WAV Run Time: 45.839 seconds 1.479 seconds/mday 160.08 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 198.033 seconds 6.388 seconds/mday 37.05 myears/wday
 

Aber

New Member
Thanks a lot, that's very helpful. So, around 20 nodes for 5years per day of CESM2, in the fully coupled version. How much does that decrease in a land-atmosphere only configuration (i.e., with prescribed SSTs)?

Thanks,
 
Top