Regarding using high resolution USDA cropscap crop cover data for improved crop yield in CLM.

uzairrahil · Jun 30, 2025

Dear Scientists,

I am currently working on a regional case study focused on simulating crop yields and projecting their future trends in Lower Michigan. The configuration I’m using is:

CTSM Version: alpha-ctsm5.2.mksrf.23_ctsm5.1.dev171
Resolution: 0.05° over Lower Michigan
Compset: IHISTCLM50_BGC_Crop
Meterological Forcing: CONUS404

Unfortunately, the crop yield results from my simulation are not satisfactory. To improve accuracy, I am considering incorporating USDA Cropscape data, given its high spatial resolution (10m–30m). Specifically, I aim to replace the default crop types: rainfed_temperate_corn, rainfed_temperate_soybean, and spring_wheat.

I have a few uncertainties and would appreciate your guidance:

Do you think incorporating Cropscape data will improve yield results, particularly for the above crop types? Given its higher resolution, I am hopeful it will offer better spatial representation—but I’m interested in your experience or opinion. Potentially, what other data could we replace so it will improve?
Regarding implementation, I’ve come across two approaches in various threads:
- The first is to modify the raw land cover files and use mksurfdata_map to regenerate the surface dataset after that. I find this complex.
- The second, which I’m considering, is to generate the surface and dynamic land use time series files for the period 1980–2022 (Already Generated), and then modify the crop fractions within the dynamic land use time series using Cropscape data.
- Which approach would you recommend? Are there any specific threads, papers, or resources you suggest I review?
Based on your experience, what are the most critical factors I should pay close attention to when working on such a case? I’d be grateful for any tips or lessons learned you can share.
Which kind of spinup do you recommend. I am considering to do spinp as follows:

Create a new case for spinup, and then: (repeat first year forcing for each other year).
./xmlchange RUN_STARTDATE=0001-01-01
./xmlchange DATM_YR_ALIGN=1
./xmlchange DATM_YR_END=1980
./xmlchange DATM_YR_START=1980

and run the case from 1980 till 2022 and then use the finidate for the actual run. (1980-2022).

I truly appreciate your support and guidance in advance.

Warm regards,
MUR

uzairrahil · Jul 1, 2025

Your Kind support will be appreciated a lot.
@samrabin @oleson @slevis

oleson · Jul 8, 2025

Regarding 2), the most robust way is to modify the raw input files. Then you will be taking advantage of the aggregation rules that are built into the mksurfdata_map code. And you'll be able to generate surface datasets at other resolutions or other regions if needed.
Regarding 3), we generally recommend spinning up over multiple atmospheric years to avoid biasing the model. E.g., if the single year you've chosen to spinup over is a very dry or very wet year, then the initial conditions you generate will be biased toward very dry or very wet conditions.

uzairrahil · Jul 8, 2025

Thank you very much for your response, dear. @oleson . I greatly appreciate your guidance. I do, however, have a few remaining questions and would value your insight on them.
My goal is to use the USDA cropscap crop land data layer (30m) for a regional case in lower Michigan for crop yield improvement through the IHIST_CLM50_BGC_CROP compset. I use dynamic land use time series from 1980-2022 and also created surface data through mksufdata_esmf . I am using CONUS404 on 0.05 deg resolution.

The "landuse_timeseries_hist_1980-2022_78pfts.txt" Created has some useful directories and each year has four global files as below :
=================================
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/CTSM53RawData/globalctsm53histTRENDY2024Deg025_240728/mksrf_landuse_ctsm53_histTRENDY2024_1980.c240728.nc 1980
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/CTSM53RawData/globalctsm53histTRENDY2024Deg025_240728/mksrf_landuse_ctsm53_histTRENDY2024_1980.c240728.nc 1980
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/gao_oneill_urban/historical/urban_properties_GaoOneil_05deg_ThreeClass_1980_cdf5_c20220910.nc 1980
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/lake_area/mksurf_lake_0.05x0.05_hist_clm5_hydrolakes_1980.cdf5.c20220325.nc 1980

and so on for other years till 2022 (my simulation end year). While the first two are the same for each year and are repeated for each.
================================

1 - Can you please confirm whether the first two files for each simulation year—those containing variables such as PCT_CFT, PCT_NAT_PFT, and FERTINITRO_CF—are indeed the ones I need to modify?

2- Besides these files, are there any other files I should be modifying? If so, which variables in those files specifically need to be updated?

3- Should I regrid the USDA CDL dataset to match my simulation resolution of 0.05°, or is it acceptable to use the original 30-meter resolution directly by CLM?

4- Since the raw land use/land cover data is global and at a coarser resolution, should I clip the dataset to match my forcing domain (simulation meshgrid) before modification, or can I modify the entire global dataset? If I clip it, will CLM still accept and properly interpret the resulting file during simulation? what will be the proper way to clip or subset?

5- After Modifications, Do I need to add something in the user_nl_clm or any other configurations I need to change?

Thank you so much for your guide.
Cheers
Rahil

uzairrahil · Jul 8, 2025

Plz Ignore the above message:

Thank you very much for your response, dear. @oleson . I greatly appreciate your guidance. I do, however, have a few remaining questions and would value your insight on them.
My goal is to use the USDA cropscap crop land data layer (30m) for a regional case in lower Michigan for crop yield improvement through the IHIST_CLM50_BGC_CROP compset. I use dynamic land use time series from 1980-2022 and also created surface data through mksufdata_esmf . I am using CONUS404 on 0.05 deg resolution.

The "landuse_timeseries_hist_1980-2022_78pfts.txt" Created has some useful directories and each year has four global files as below :
=================================
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/CTSM53RawData/globalctsm53histTRENDY2024Deg025_240728/mksrf_landuse_ctsm53_histTRENDY2024_1980.c240728.nc 1980
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/CTSM53RawData/globalctsm53histTRENDY2024Deg025_240728/mksrf_landuse_ctsm53_histTRENDY2024_1980.c240728.nc 1980
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/gao_oneill_urban/historical/urban_properties_GaoOneil_05deg_ThreeClass_1980_cdf5_c20220910.nc 1980
/glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/rawdata/lake_area/mksurf_lake_0.05x0.05_hist_clm5_hydrolakes_1980.cdf5.c20220325.nc 1980

and so on for other years till 2022 (my simulation end year). While the first two are the same for each year and are repeated for each.
================================

1 - Can you please confirm whether the first two files for each simulation year—those containing variables such as PCT_CFT, PCT_NAT_PFT, and FERTINITRO_CF—are indeed the ones I need to modify?

2- Besides these files, are there any other files I should be modifying? If so, which variables in those files specifically need to be updated?

3- Should I regrid the USDA CDL dataset to match my simulation resolution of 0.05°, or is it acceptable to use the original 30-meter resolution directly by CLM?

4- Since the raw land use/land cover data is global and at a coarser resolution, should I clip the dataset to match my forcing domain (simulation meshgrid) before modification, or can I modify the entire global dataset? If I clip it, will CLM still accept and properly interpret the resulting file during simulation? what will be the proper way to clip or subset?

4- The CDL available for my study area is from 2007-2022, not for the initial years (1980-2007), how can CLM handle this as some files (2007 onwars ) will have fine resolution PCT_PFT (if this variable I should modify ) and before that will be coarser (default resolution 0.25deg)
5- After Modifications, Do I need to add something in the user_nl_clm or any other configurations I need to change?

I look forward to your kind response with patience.

Best regards,
Rahil

oleson · Jul 8, 2025

Given the complications you've listed regarding mismatches in space and time, I've reconsidered my suggestion to use mksurfdata_esmf. I think it would be simplest to modify the landuse timeseries file you've already created at 0.05deg. You shouldn't need to modify the surface dataset since it is for 1980 and your new data starts in 2007. The potential variables to modify would be FERTNITRO_CFT, PCT_CROP, PCT_CROP_MAX, PCT_NAT_PFT, PCT_CFT, PCT_CFT_MAX. I think you'd only need to modify PCT_CROP and PCT_CROP_MAX if you changed the total crop area within a gridcell. Modifying PCT_NAT_PFT should only be necessary if you change the distribution of natural vegetated pfts.

uzairrahil · Jul 11, 2025

Dear Oleson,

Thank you very much for your helpful response.

For final confirmation, I’d like to clarify that the CDL includes crop cover for corn, soybean, and wheat—these are the crops whose coverage I intend to update in the land use time series. In this case, do you think modifying only the following variables would be sufficient: PCT_CFT, PCT_CROP, and PCT_CROP_MAX? Or would you recommend adjusting any additional variables as well?

Thank you again for your guidance.

slevis · Jul 14, 2025

You may not need to modify other variables, as long as the data in the file remain internally consistent (e.g. percentages add to 100 when needed).

uzairrahil · Jul 25, 2025

Thank you so much, dear @oleson and @slevis .
I successfully regridded the CDL to 0.05 degrees and replaced the land use time series for corn, soybean, and wheat for the years 2007-2010. The model initially gave me an error that the PCT_CFT sum is not a hundred, then I modified it so that the model ran successfully. However, looking at the results, I found that it did not improve the results as much as I expected. The modifed file is at: /glade/work/rahilmoh/CLM/surf_data/landuse.timeseries_CLM_UPDATED_(pct_cft_summed_upto_100_for_2007_2010).nc

This is my user_nl_clm:
fsurdat='/glade/work/rahilmoh/CLM/Model/my_ctsm_for_surfdata_ctsm5.3.050/CTSM/tools/mksurfdata_esmf/surfdata_LM_Rahil_5km_hist_1980_78pfts_c250610.nc'
flanduse_timeseries ='/glade/work/rahilmoh/CLM/surf_data/landuse.timeseries_CLM_UPDATED_(pct_cft_summed_upto_100_for_2007_2010).nc'
finidat = '/glade/derecho/scratch/rahilmoh/archive/MI46_141y135x_IHistClm50BgcCrop_0.05d_CTSM_DynLULC_Conus404_Spinup/rest/0036-01-01-00000/MI46_141y135x_IHistClm50BgcCrop_0.05d_CTSM_DynLULC_Conus404_Spinup.clm2.r.0036-01-01-00000.nc'
use_init_interp = .true.
check_finidat_year_consistency = .false.

limit_irrigation_if_rof_enabled = .true.
use_groundwater_irrigation = .false.

lower_boundary_condition = 3
use_bedrock = .false.
soilwater_movement_method = 1

hist_empty_htapes = .true.

! Monthly average output for both streams
hist_nhtfrq = 0, 0

! Stream 1: 1 file per year (grid-averaged output)
hist_mfilt = 12, 12

! Stream 1: grid-averaged output (.true.), Stream 2: subgrid/PFT-level output (.false.)
hist_dov2xy = .true., .false.

! Stream 1: all your standard variables (grid-averaged)
hist_fincl1 = 'EFLX_SOIL_GRND','Qh', 'FSH', 'FCEV', 'FGEV', 'FCTR', 'FSAT',
'EFLX_LH_TOT', 'H2OCAN', 'H2OSNO', 'H2OSOI',
'HK', 'QDRAI', 'QFLX_EVAP_TOT', 'QFLX_EVAP_VEG', 'QFLX_SNOW_GRND', 'QFLX_LIQ_GRND',
'QH2OSFC', 'QINFL', 'QINTR', 'QIRRIG_FROM_SURFACE', 'QIRRIG_DEMAND',
'QIRRIG_DRIP', 'QIRRIG_SPRINKLER',
'QDRAI_PERCH',
'Qle', 'TBOT', 'GRAINC',
'RH', 'Tair', 'WIND', 'FLDS', 'FSDS', 'QBOT',
'QOVER', 'QRUNOFF', 'SNOWLIQ', 'QSOIL', 'QTOPSOIL', 'QVEGE', 'QVEGT',
'RAIN', 'SNOW', 'RAIN_FROM_ATM', 'SNOW_FROM_ATM',
'VOLR', 'VOLRMCH', 'TWS', 'TOTSOILLIQ',
'ZWT', 'ZWT_CH4_UNSAT', 'ZWT_PERCH', 'H2OSOI',
'TLAI', 'GPP', 'QCHARGE'

! Stream 2: subgrid (PFT) level output for GRAINC_TO_FOOD,
hist_fincl2 = 'GRAINC_TO_FOOD'

1- I would humbly request you to look into the file above and suggest some other possible modifications for improved crop yield at the county level.

2- Moreover, I am attaching the crop yield results against observations.

A Sample Picture Indicating the Changes for a random county. Most county results are the same, not much different.

I am patiently waiting for your kind response.

Sincerely,
Rahil

slevis · Jul 28, 2025

@uzairrahil the fact that the results are
- different is a good sign, suggesting that your changes have taken effect in the simulation, but then
- so similar suggests to me that little has changed

You could compare the %area of rainfed corn in the above random county, and I suspect you would find %area almost the same in Modified_LU and Default. If %area of rainfed corn has changed, then maybe your plot does not account for that change. Are there other reasons for which your updates should improve the simulation of county-level harvest?

Regarding using high resolution USDA cropscap crop cover data for improved crop yield in CLM.

uzairrahil

Mohammad Uzair Rahil

New Member

uzairrahil

Mohammad Uzair Rahil

New Member

oleson

Keith Oleson

CSEG and Liaisons

uzairrahil

Mohammad Uzair Rahil

New Member

uzairrahil

Mohammad Uzair Rahil

New Member

oleson

Keith Oleson

CSEG and Liaisons

uzairrahil

Mohammad Uzair Rahil

New Member

slevis

Moderator

uzairrahil

Mohammad Uzair Rahil

New Member

slevis

Moderator