Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

running speed difference between spinup and hist

Fuhow

Fu Hao
Member
Hi everyone,


I’m currently running a CTSM case with a regional grid of 687*315, using regional surfdata and landuse.timeseries file. I found that during the accelerated spin-up for the I2000 case, it took about 1 minute per model day with 416 processes.


However, after finishing spin-up and switching to the hybrid run for the ihist case (initialized from spin-up restart files), with a 16 GB landuse.timeseries dataset spanning 1980–2023, the model now only runs one model day in about 5 minutes.


Is this large difference in performance normal? Any advice or guidance would be greatly appreciated.


Thanks!
 

slevis

Moderator
Staff member
I do not know exactly how much slower the model should be, but a slowdown is expected for two reasons that I can think of:
- In accel. spin-up the model outputs significantly reduced amounts of history
- In accel. spin-up the model does not engage in transient landuse calculations
 

Fuhow

Fu Hao
Member
I do not know exactly how much slower the model should be, but a slowdown is expected for two reasons that I can think of:
- In accel. spin-up the model outputs significantly reduced amounts of history
- In accel. spin-up the model does not engage in transient landuse calculations
Dear @slevis

Thanks for your response.

For condition 1, I set the same daily output variable.
For example, during accel, I set:
fire_method = 'nofire'
hist_empty_htapes = .true.
hist_fields_list_file = .false.

hist_fincl1 = 'TOTECOSYSC','TOTSOMC','TOTVEGC','LIVESTEMC','LEAFC','TLAI','GPP','CPOOL','NPP','NEE','GPP','NEP','SOILC_vr','SOILC_HR','TSOI','H2OSOI','ALT','ER','AR','RR','HR'
hist_fincl2 = 'TOTECOSYSC','TOTSOMC','TOTVEGC','LIVESTEMC','LEAFC','TLAI','GPP','CPOOL','NPP','NEE','GPP','NEP','SOILC_vr','SOILC_HR','TSOI','H2OSOI','ALT','ER','AR','RR','HR'

hist_mfilt = 1, 1
hist_nhtfrq = -24, -8760

during IHIST case run, I set:
fire_method = 'nofire'
hist_empty_htapes = .true.
hist_fields_list_file = .false.

hist_fincl1 = 'TOTECOSYSC','TOTSOMC','TOTVEGC','LIVESTEMC','LEAFC','TLAI','GPP','CPOOL','NPP','NEE','GPP','NEP','SOILC_vr','SOILC_HR','TSOI','H2OSOI','ALT','ER','AR','RR','HR'
hist_fincl2 = 'TOTECOSYSC','TOTSOMC','TOTVEGC','LIVESTEMC','LEAFC','TLAI','GPP','CPOOL','NPP','NEE','GPP','NEP','SOILC_vr','SOILC_HR','TSOI','H2OSOI','ALT','ER','AR','RR','HR'

hist_mfilt = 1, 1
hist_nhtfrq = -24, -8760

So, I think this might not be the reason of low run speed.

For condition 2, I dont know whether the yearly landusetimeseries file is readed each time step, is so, the 16G large file must be the reason of low run speed, but if the landusetimeseries file is read at the first time step of the year and it might not be the reason of the low run speed. Meanwhile, I checked the output time between 2 years, for example:
...
01:01 TP_5km_cruj_hist.clm2.h0a.1983-12-30-00000.nc
01:06 TP_5km_cruj_hist.clm2.h0a.1983-12-31-00000.nc
01:10 TP_5km_cruj_hist.clm2.h0a.1984-01-01-00000.nc
01:26 TP_5km_cruj_hist.clm2.h0a.1984-01-02-00000.nc
01:31 TP_5km_cruj_hist.clm2.h0a.1984-01-03-00000.nc
...
it can be seen that the output time between 1984-01-01 (end of 1983) and 1984-01-02 (start of 1984) take 11 min more than other in year output, so I gusse the landusetimeseries is readed at the start of each year, and it might not be the reason of low daily run speed.

Besides, I checked the difference between 2 lnd.log of IHIST and accel, it shows 48 compute orbital parameters betweeen 2 daily output:

(shr_orb_params) ------ Computed Orbital Parameters ------
(shr_orb_params) Eccentricity = 1.671015E-02
(shr_orb_params) Obliquity (deg) = 2.344185E+01
(shr_orb_params) Obliquity (rad) = 4.091374E-01
(shr_orb_params) Long of perh(deg) = 1.026214E+02
(shr_orb_params) Long of perh(rad) = 4.932674E+00
(shr_orb_params) Long at v.e.(rad) = -3.252218E-02

I think this might be the resaon of low daily output speed.

Therefore, I would like to ask you to help check whether my speculation is correct: is the slow runtime at each time step caused by the need to compute orbital information at every step? I have also asked others about the runtime speed of IHSIT, and their experience is that slowdowns typically occur when starting the IHIST simulation using restart files, and the speed usually returns to normal after running for about 20 years, but I am not sure what causes this. If so, is there any way to solve this problem?
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
The landuse timeseries should be read in once per year, this is normal.
Orbital calculations are done every time step in a historical, this is normal.
There are other input streams that are being read in at various time scales in a historical, this is normal.
I would expect a typical historical to be 2-3 times slower than a spinup depending on configuration, while yours is 5 times slower.
You were already quite slow in your spinup, 3.9 model years/wallclock day. You may just be reaching the limits of your system/processors. Maybe there are memory problems also.

I don't have any specific suggestions other than to add more processors if possible and comparing timing files between your two cases to see where they are differing significantly.
 

Fuhow

Fu Hao
Member
Dear @oleson

Many thanks for your advice.

I noticed that I had previously set TSOI:A, TSOI:X, TSOI:M in one output file, but only TSOIL:A was actually written to the output. I have now set output TSOI:X and TSOI:M into separate second and third output files, respectively. This has reduced the computational time per model day to 2–3 minutes, which is roughly 2–3 times my original 1–2 minutes, as you described. However, I am not sure if there was some kind of conflict or overlap among these variables.

I also have another question. During my earlier spin‑up run, I used a modified surfdata file that included updated land‑use and land‑cover data such as PCT_GLACIER, PCT_NATVEG, PCT_CROP, PCT_LAKE, PCT_URBAN, vegetation data like MONTHLY_LAI and MONTHLY_SAI, as well as some soil physical properties. Now I am using the restart file from this spin‑up run for a hybrid simulation, and performing a historical run with a landusetimeseries file generated from the original surfdata. This results in an error:

1771608045843.png


When I set use_init_interp=.true. in the user_nl_clm file, the simulation runs, but the output shows many NA grid cells (white grid points). Is this caused by the interpolation? Which variables are most likely responsible for this issue? Can I modify the corresponding variables in the landusetimeseries file to match my spin‑up results, while still retaining the spin‑up restart file, in order to eliminate the gridcell mismatches between the landusetimeseries and the restart file?

1771608799711.png
 

slevis

Moderator
Staff member
It's an interesting question where the NAs came from.
- My first idea was the restart file that initiated the historical simulation. But the restart file came from the spin-up, and you said that you didn't see NAs in the spin-up.
- I would not expect NAs from the "nearest neighbor" interpolation used by use_init_interp = .true.. We use this very often and do not see such a behavior.
- If there is a problem with the "original" fsurdat and landuse, you could try confirming that with a "cold" start simulation where you set finidat = ' '. If you see the NAs in this case, then I would suspect bad data in your original fsurdat and/or landuse files. If so, you will need to address that before trying the simulation again.
 

Fuhow

Fu Hao
Member
Hi @slevis , an update, during test in spinup, I found the Na disappeared if I use 10 nodes (32 threads per node), and same Na appeared while 19 nodes.
 

Fuhow

Fu Hao
Member
It's an interesting question where the NAs came from.
- My first idea was the restart file that initiated the historical simulation. But the restart file came from the spin-up, and you said that you didn't see NAs in the spin-up.
- I would not expect NAs from the "nearest neighbor" interpolation used by use_init_interp = .true.. We use this very often and do not see such a behavior.
- If there is a problem with the "original" fsurdat and landuse, you could try confirming that with a "cold" start simulation where you set finidat = ' '. If you see the NAs in this case, then I would suspect bad data in your original fsurdat and/or landuse files. If so, you will need to address that before trying the simulation again.
Hi, dear @slevis @oleson , after comparison of running with different NTASKS setting, the hybrid run output shows normal (without NA grids) with the same NTASKS of spinup. But, what can leads to this?
 
Top