Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

How to reproduce identical output?

I have been conducting a 150-year emission-driven simulation using CESM2. Unfortunately, I lost several decades' worth of data from the early period. I attempted to restore the data by using the original bld file and re-running the model, which led to the reproduction of the same output. I am currently in the process of re-running the model to recover the data, with N_STOP set to 5 years.

However, I have encountered a situation where the model's results begin to diverge at a certain point. I still have the ocean monthly output available, which allows me to verify if the results match those of the original experiment. For instance, I am currently re-generating the initial 90 years of data, but I have encountered an issue where the data suddenly deviates from the original starting at year 67.

In some instances, using the most recent Restart file and initiating a branch run brings the results back into alignment. Yet, in other cases, the divergence persists even with this approach. I am curious about the reasons behind these divergent outcomes and would greatly appreciate it if anyone might be aware of a potential solution to address this issue.
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
Thank you for your post. This sort of divergence shouldn't happen if you are using the same machine, compiler, compiler options and processor count. We do a lot of testing to try to ensure that's the case, though we do very occasionally find bugs that have made it through this testing, which we take very seriously.

Two possible code-related causes of this come to mind; these are both rare in (at least in our well-tested release code), but they are possible:

1. An issue with reproducibility when restarting the model. This is the most common source of reproducibility errors. We do extensive testing for this, but sometimes there are rare edge cases that slip through our testing. This would typically show up as differences immediately or soon after restarting the model (e.g., at the start of a new 5 year segment if STOP_N is 5 years). If this is the issue, the divergence should typically go away if you use the same STOP_N for each model restart segment as you did in the original run (in which case, if there was an issue with reproducibility on restart, the same issue would probably exist for both runs, so the two runs would be identical, even if both of them might have had the same issue in them).

2. A reproducibility bug in the code, such as due to a race condition in the parallelization. If you rerun the same model segment multiple times from the same restart files and see differences in the different runs, this could be the cause.

In addition, there are some possible machine-related causes:

3. A change in machine configuration since your initial run - for example a change in compiler or library versions.

4. A machine glitch in either your original or new run. This is rare but can happen.

5. Data corruption of some of the forcing / input data. You can force re-downloading all input data by changing your DIN_LOC_ROOT xml variable to a different, temporary location.
 
Thank you for your post. This sort of divergence shouldn't happen if you are using the same machine, compiler, compiler options and processor count. We do a lot of testing to try to ensure that's the case, though we do very occasionally find bugs that have made it through this testing, which we take very seriously.

Two possible code-related causes of this come to mind; these are both rare in (at least in our well-tested release code), but they are possible:

1. An issue with reproducibility when restarting the model. This is the most common source of reproducibility errors. We do extensive testing for this, but sometimes there are rare edge cases that slip through our testing. This would typically show up as differences immediately or soon after restarting the model (e.g., at the start of a new 5 year segment if STOP_N is 5 years). If this is the issue, the divergence should typically go away if you use the same STOP_N for each model restart segment as you did in the original run (in which case, if there was an issue with reproducibility on restart, the same issue would probably exist for both runs, so the two runs would be identical, even if both of them might have had the same issue in them).

2. A reproducibility bug in the code, such as due to a race condition in the parallelization. If you rerun the same model segment multiple times from the same restart files and see differences in the different runs, this could be the cause.

In addition, there are some possible machine-related causes:

3. A change in machine configuration since your initial run - for example a change in compiler or library versions.

4. A machine glitch in either your original or new run. This is rare but can happen.

5. Data corruption of some of the forcing / input data. You can force re-downloading all input data by changing your DIN_LOC_ROOT xml variable to a different, temporary location.

Thank you for providing specific possibilities, and I appreciate your help.
After running the experiment multiple times, I have confirmed that the results consistently vary from the same point (not immediately, STOP_N FIX). It seems that the inconsistency in results is likely due to machine glitch in the original run.
Thanks to your comments, I have a better understanding of the potential reasons behind this issue, and I appreciate your help once again.
 
Top