Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Porting issue resulting in land-only resubmission failure

hayesdev

Hayes Dev
New Member
Hello all,

I believe I'm having a porting issue resulting in land-only resubmission failures. I would love any guidance or insight that anyone can provide.

Background
I've been trying to run land-only SLIM runs on a ported machine with intel compiler that have been failing upon the first resubmission, despite coupled runs (SLIM+CAM+CICE) running perfectly fine with multiple resubmissions. Here, SLIM is the Simple Land Interface Model - a simple land model that functionally replaces CLM in CESM infrastructure.

Offline run details
These offline runs are supposed to take atmospheric coupler data from my coupled runs. First resubmission failure occurs whether offline runs take my coupler data or typical GSWP data.

To troubleshoot, I've done the following things...

  • Run offline simulations with my coupled atmospheric data for variety of years
    • These all ran up until first resubmission --> this shows its not a problem with my coupler data
      • 1 year
      • 5 years
      • 10 years
  • I got a Cheyenne account to see whether it was a problem with SLIM code or my own porting
    • I was able to successfully run land-only SLIM with multiple resubmissions using a predefined SLIM compset with GSWP data
      • This same compset failed upon first resubmission when I tried it on my machine
      • Compset
        • 1850_DATM%GSWP3v1_CLM45%SP_SICE_SOCN_SROF_SGLC_SWAV
    • --> this leads me to believe it’s an error with my initial porting
  • I tried some tests from the porting documentation
    • Script_regression_test showed failure for the test "test_run_restart"
      • I got the error "stat 100"

I realize this is CESM forum but since my problem seems to be rooted in my machine porting, I was hoping someone might be able to point me in the right direction to fix this. Any insight is greatly appreciated, please let me know if there's any other information that's needed. Thanks!

Potentially Relevant Information Attached
Porting Info
  1. config_machine
  2. config_compilers
  3. config_batch

Regression Test Info
  1. Script_regression_test.py used within SLIM
  2. Regression test output terminal

Offline Case Run Log Files

  1. cesm log
  2. lnd log
  3. atm log
 

Attachments

  • config_batch.xml.txt
    3 KB · Views: 1
  • config_compilers.xml.txt
    3 KB · Views: 0
  • config_machines.xml.txt
    6.4 KB · Views: 1
  • 6.22.23_regression_test_output.txt
    75.9 KB · Views: 1
  • scripts_regression_tests.py.txt
    139.5 KB · Views: 1
  • atm.log.203930.230518-011710.txt
    16.8 KB · Views: 1
  • cesm.log.203930.230518-011710.txt
    803 KB · Views: 3
  • lnd.log.203930.230518-011710.txt
    66.9 KB · Views: 4

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Hmmm. This is really odd, I usually tell people to simplify their cases and get something simpler to work. But, here you have the more complex case working, and you want to get the more simpler case to work.

I also really like everything you've tried, those are all things I would suggest if you hadn't already done them.

But, it also sounds like it runs for the first go through, but then fails on re-submission. So I wonder if there's something going wrong in the restarts? I'm not sure how it's specific to re-submission and only when not coupled to CAM though.

My first question is what version of SLIM you are using? If I understand this correctly it's not that it fails at a certain simulation date -- it's failing just after starting back up from restart? One suggestion would be to go through all the log files and carefully make sure every component is properly starting up from restart. It's finding the correct restart file, and starting up from the same simulation time. Maybe also save restart files more often, so that you can start from different simulation time points. Also maybe look to make sure that the restart files are saving something and not just full of zero's or NaN's or something.

Another idea is to run with different compilers, and see if you get the same problem? You should be able to fairly easily get gnu working here (and gnu is free). I know that's a fairly big ask, but I do find comparing problems with different compilers to be useful. We also run with the NAG compiler and NVHPC, but that's an even bigger ask. Also how does your version of intel compare to the version being used on cheyenne? If it's different it's possible there is a compiler bug behind this. You can look for the compiler versions in config_machines.xml in the base code for the supported machines and see if you have a match for it. And you can try to make sure you are using the same versions of things in your machine as being done on cheyenne. So match the NetCDF version for example.

Those are some ideas I have now...
 

jedwards

CSEG and Liaisons
Staff member
This is a slim restart science error and I don't think that it has anything to do with the machine port:

warning: snow<0, setting snowmasking factor to zero. (snow(g) =
0.000000000000000E+000 , overwriting so snow(g)=0.0)
testMod_mml
MML ERROR: Soil temperature energy conservation error: pre-phase change
ENDRUN:
ERROR in mml_main.F90 at line 1153
 
Top