Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Trying to create an emsemble spread throughout time

jzweifel

jzweifel
Member
Hello CAM people,

I am working on a project where I have created a large plume of SO2 emissions on the East Coast of the United States. I have successfully been able to implement this aerosol plume into a model (F2000) that I have ran for 50 years. After the first 30 years (what my advisor suggested for spin-up time), I created monthly restart files from the following 20 years. Now that I have been able to implement my aerosol plume, what I am looking to do is create an esemble of spread of 20 models that will run 3 months starting in June of years 31-50.

What would be the best way to go about this? In the past I had created 20 clones of my main run after it had ran for 50 years and then ran each of those for 3 months but I didn't realize that I wouldn't be able to measure any natural variability this way. Should I try and make use of these 20 clones that I already had and adjust their case reference date so that each starts from a a different years June?

Let me know if this makes sense and anyone thinks they would be able to help me, I am always appreciative or the your CESM expertise!

Jack
 

jzweifel

jzweifel
Member
Hello CAM people,

I am working on a project where I have created a large plume of SO2 emissions on the East Coast of the United States. I have successfully been able to implement this aerosol plume into a model (F2000) that I have ran for 50 years. After the first 30 years (what my advisor suggested for spin-up time), I created monthly restart files from the following 20 years. Now that I have been able to implement my aerosol plume, what I am looking to do is create an esemble of spread of 20 models that will run 3 months starting in June of years 31-50.

What would be the best way to go about this? In the past I had created 20 clones of my main run after it had ran for 50 years and then ran each of those for 3 months but I didn't realize that I wouldn't be able to measure any natural variability this way. Should I try and make use of these 20 clones that I already had and adjust their case reference date so that each starts from a a different years June?

Let me know if this makes sense and anyone thinks they would be able to help me, I am always appreciative or the your CESM expertise!

Jack
still looking for an answer if anyone has one, thanks!
 

jzweifel

jzweifel
Member
As an update, I'm operating within my clone_1 and looking at xml variables like RUN_REFCASE, RUN_REFDATE, and RUN_REFDIR.

I was able to get my RUN_REFDATE to be my desired timestamp of 0031-06-01

I am now trying to change my RUN_REFDIR like so and am getting this error, does anyone know why?:

jzweifel@derecho1:/glade/work/jzweifel/cases/clones/clone_1> ./xmlchange RUN_REFDIR= --caseroot /glade/derecho/scratch/jzweifel/archive/control_F2000/rest/next_20_years/june_rests/0031-06-01-00000/
ERROR: Directory /glade/derecho/scratch/jzweifel/archive/control_F2000/rest/next_20_years/june_rests/0031-06-01-00000/ does not appear to be a valid case directory

I'm also unsure If I was successfully able to change my RUN_REFCASE, I tried to make it access my case I call control_F2000, and it gave me no errors, but when I check on ./xmlquery it looks like the change was not realized.

Any help would be great, thanks!
 

brianpm

Active Member
There are a few different things going on here. I guess I'd like to start with the beginning, and figure out what you are trying to accomplish.

If you start the model from a restart file, then the resulting simulation will be exactly the same as the original simulation. This is the main purpose of a restart file - it allows the model to stop and then restart from exactly the same state. This means that the experiments you described will be exactly the same as the original simulation that generated the restart files. At least that is how I am reading the description.

To generate an ensemble to estimate natural variability, you describe running 20 3-month simulations each started from a different year. As I said, on the face of it, this seems no different from taking JJA from each of those years from the 50-year simulation.

I'd suggest that a better approach is to use the "pertlim" method to apply a tiny perturbation to the initial conditions, and repeat that by incrementing the pertlim value 20 times. So start 20 simulations from the same initial condition, but with extremely tiny differences. Natural variability will cause these ensemble members to diverge, creating a spread of solutions. This could be done for multiple initial conditions if desired; for example this could be done for each of the June initial conditions, generating 20 20-member ensembles.

Using pertlim is pretty easy. See this thread for example: How does pertlim "work"?
 

jzweifel

jzweifel
Member
Hi Brian,

The idea here was to put my SO2 forcing file into each of the 20 clones from June 0031 to June 0050 and let the variation in the June conditions of each year interact with the SO2 plume and study that to determine natural variability.

I've made some progress in doing this and am trying to now submit a case from my clone_1 which pulls from my control_F2000 case at time 0031-06-01.

When looking at the log files for my clone_1 when I set the RUN_TYPE = hyrbid I get some of the following feedback:

... list truncated at 256 dec0790.hsn.de.hpc.ucar.edu 214: ERROR: GETFIL: FAILED to get control_F2000.cam.i.0031-06-01-00000.nc dec0790.hsn.de.hpc.ucar.edu 178: ERROR: GETFIL: FAILED to get control_F2000.cam.i.0031-06-01-00000.nc

And so on...

This seemed pretty straight-forward to me, my clone_1 couldn't access the control_F2000.cam.i.0031-06-01-00000.nc file it needed. I looked at my control_F2000, but it only created yearly input files (is that what cam.i is?), not like my monthly restart files. I then tried to find a variable that would change the .i file production frequency and it looks like their used to be a variable called INITHIST, but that doesn't exist here.

Am I not able to use my files in my RUN_REFDIR to initialize my clone?

Any help would be great, I appreciate all your patience and willingness to help!
 

brianpm

Active Member
The `cam.i` files are initial condition files. These can be used to start a new CAM simulation, but that will not be an exact restart from the case because only a subset of the model's state is written to the `.i.` file; the `.r.` files are restart files and contain the full model state.


I'm looking at one of your cases here:
`/glade/work/jzweifel/cases/clone_1`

The CaseStatus pointed me to the log file:
`/glade/derecho/scratch/jzweifel/clone_1/run/cesm.log.3684083.desched1.240301-14033`

Which says the problem is that the restart case name is not ok:

Code:
dec1496.hsn.de.hpc.ucar.edu 0:  ERROR:
 (seq_infodata_Check) : invalid continue restart case name = control_F2000
dec1496.hsn.de.hpc.ucar.edu 0: Image              PC                Routine            Line        Source
cesm.exe           00000000027AC27D  shr_abort_mod_mp_         114  shr_abort_mod.F90
cesm.exe           000000000278A2D3  seq_infodata_mod_        2737  seq_infodata_mod.F90
cesm.exe           0000000000429AAB  cime_comp_mod_mp_         907  cime_comp_mod.F90
cesm.exe           00000000004306BB  MAIN__                     98  cime_driver.F90

I couldn't remember what this could mean, and all your settings seemed good to me. I did some searching, and found this thread:

So I think if you set CONTINUE_RUN to FALSE, you should be good to go.
 

jzweifel

jzweifel
Member
Hi Brian, I saw that old thread and it did seem hopefully helpful, unfortunately I still wasn't able to submit my case.

I'm really struggling to figure out why I am having trouble here, here is some feedback I got from your machines:

2024-03-05 11:12:11 MODEL EXECUTION BEGINS HERE
run command is mpiexec --label -n 512 /glade/derecho/scratch/jzweifel/clone_1/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpiexec --label -n 512 /glade/derecho/scratch/jzweifel/clone_1/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/derecho/scratch/jzweifel/clone_1/run/cesm.log.3711693.desched1.240305-111200



When I looked at the log file, here's what I found, many errors that look like this:

dec0902.hsn.de.hpc.ucar.edu 498: ERROR: GETFIL: FAILED to get control_F2000.cam.i.0031-06-01-00000.nc

It seems that I get errors of that type when I have my CONTINUE_RUN=FALSE which is nice because I'm not getting the invalid continue restart case name, but obviously still not submitting. Am I able to have the machines utilize my restart files from the same time to push forward with the simulation? I ask this because again, I have no initial condition file from that time-stamp but do have the restart files (which you the .i files are a thinner version of).

Any help would be greatly, greatly appreciated!

Please let me know anything I could do to help you help me!

Thanks!
 

jzweifel

jzweifel
Member
As an update, I found out I needed to set my RUN_TYPE = branch and CONTINUE_RUN = FALSE, as I want to utilize my restart files to base this simulation off of!

I made those changes to RUN_TYPE and CONTINUE_RUN and rebuilt my clone_1 case.

I successfully built/submitted, but now am having my case aborted for another reason...

Looking at the log files everything looks somewhat normal-ish to me until the model starts interacting with my modified SO2 forcing files ie:

/glade/work/jzweifel/cases/control_F2000/updated_so2_emissions/so2_emissions_ag _ship_res_updated.nc 65536 dec1357.hsn.de.hpc.ucar.edu 1: NetCDF: Attribute not found dec1463.hsn.de.hpc.ucar.edu 129: NetCDF: Attribute not found dec2261.hsn.de.hpc.ucar.edu 257: NetCDF: Attribute not found dec2343.hsn.de.hpc.ucar.edu 385: NetCDF: Attribute not found dec1357.hsn.de.hpc.ucar.edu 1: Opened existing file /glade/work/jzweifel/cases/control_F2000/updated_so2_emissions/so2_emissions_ag _ship_res_updated.nc 65536 dec1357.hsn.de.hpc.ucar.edu 1: NetCDF: Variable not found dec1463.hsn.de.hpc.ucar.edu 129: NetCDF: Variable not found dec2343.hsn.de.hpc.ucar.edu 385: NetCDF: Variable not found dec2261.hsn.de.hpc.ucar.edu 257: NetCDF: Variable not found dec1357.hsn.de.hpc.ucar.edu 1: NetCDF: Variable not found

and then some messages that look like this:

MPICH ERROR [Rank 55] [job id a70ed393-e292-48d2-9a89-5122efccd4ad] [Wed Mar 6 11:24:59 2024] [dec1357] - Abort(1001) (rank 55 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 55 aborting job: application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 55 dec2343.hsn.de.hpc.ucar.edu 451: Abort(1001) (rank 451 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 451

Here's the specific log file, any idea what's happening to abort this job?

/glade/derecho/scratch/jzweifel/clone_1/run/cesm.log.3727997.desched1.240306-112433

Thanks again for all the help! Everything has been super useful :)

Jack
 

brianpm

Active Member
I'm glad you made progress by changing the RUN_TYPE!

Looking at your logs, I'm not sure what exactly is going on. One thing I see, though, is that the last file that was modified was actually the atmosphere log file:
atm.log.3727997.desched1.240306-112433

And it seems upset because:
FLDLST: so4_a2_sfgaex in fincl(17, 1) not found
FLDLST: so4_a3_sfgaex in fincl(18, 1) not found
ERROR: FLDLST: 2 errors found, see log

These variables are listed in FINCL1 in your user_nl_cam file.

I don't think this variable is in CAM, and I didn't see any CAM SourceMods in your case directory. If you need to add this, it will have to be added in the source code via ADDFLD and OUTFLD calls.

That might not be the issue you are noting about the netCDF file, however. When I took a look at one of your files:
/glade/work/jzweifel/cases/control_F2000/updated_so2_emissions/so2_emissions_ag_ship_res_updated.nc

I see that your _FillValue attribute is a NaN:

Code:
variables:
        float lon(lon) ;
                lon:_FillValue = nanf ;
        float lat(lat) ;
                lat:_FillValue = nanf ;
        float time(time) ;
                time:_FillValue = nanf ;
                time:units = "days since 1750-01-01" ;
                time:calendar = "Gregorian" ;
        int date(time) ;
                date:units = "YYYYMMDD" ;
                date:long_name = "Date" ;
                date:cell_methods = "time: mean" ;
        float emiss_ag_sol_was(time, lat, lon) ;
                emiss_ag_sol_was:_FillValue = nanf ;
        float emiss_ship(time, lat, lon) ;
                emiss_ship:_FillValue = nanf ;
        float emiss_res_tran(time, lat, lon) ;
                emiss_res_tran:_FillValue = nanf ;

In the past, I have run into trouble with this, as I don't think CAM understands the NaN value, and prefers to have a numeric value for the _FillValue attribute.

In case you are using python with xarray to produce these files, I can report that I've had pretty good luck by using xarray's encoding kwarg in the `to_netcdf` method to make sure that _FillValue is not NaN. Here's one way to do it:
Code:
import xarray as xr
import netCDF4  # netCDF4.default_fillvals is a dict of default fill values

def save_output(data, name):
    # define the encoding for the output netCDF
    enc = {}
    if not isinstance(data, xr.Dataset):
        data = data.to_dataset()

    for dv in data.data_vars:
        if data[dv].dtype == 'float32':
            enc[dv] = {"_FillValue":netCDF4.default_fillvals['f8'],
                    "complevel": 2,
                    "zlib": True}
    for cv in data.coords:
        enc[cv] = {'zlib': False, '_FillValue': None}
    data.to_netcdf(name, encoding=enc)
 
Top