Running prealpha tests: in SMS_D.f09_g16.I1850Clm50BgcSpinup the datm.input_data_list references glade

johnsonb · Jun 25, 2020

Hi CGD BB,

I'm in the process of porting CESM to Shaheen II, which is a Cray XC40 at King Abdullah University of Science and Technology.

I began by modifying the config_machines, config_compliers and config_batch xml files to add the new machine, using the existing cori-haswell entry (since it is also a Cray XC40) in each of those files as a starting point for the shaheen entry. After some trial and error I was able to get the prealpha tests for the intel compiler started:

cd $CIMEROOT/scripts
./create_test --xml-category prealpha --xml-machine cheyenne --xml-compiler intel --machine shaheen --compiler intel

A subset of the tests passed and I'm in the process of working through the failed tests to get them to pass.

One test, SMS_D.f09_g16.I1850Clm50BgcSpinup, is behaving peculiarly. When the case is setup it creates a datm.input_data_list that includes filepaths that don't reference $DIN_LOC_ROOT. For this port, $DIN_LOC_ROOT is set to:

/lustre/project/k1421/cesm_store/inputdata/

Instead the filepaths reference the BGCWG project space on GLADE:

/glade/p/cesm/bgcwg_dev/forcing/

This is unique to datm.input_data_list, the other data lists and stream files -- for example, clm.input_data_list -- all have filepaths that reference $DIN_LOC_ROOT. I've attached both of these files for comparison.

I've also attached the error messages that occur when trying to run case.submit.

What should I do here? Should I skip this test? Should I rewrite datm.input_data_list that it references $DIN_LOC_ROOT and try to get:

./check_input_data --download

to transfer the data?

Thank you,
Ben Johnson / johnsonb

jedwards · Jun 25, 2020

This is due to a test mod in the clm source:
components/clm/cime_config/testdefs/testmods_dirs/clm/cplhist/shell_commands:./xmlchange DATM_CPLHIST_DIR=/glade/p/cesm/bgcwg_dev/forcing/b.e20.B1850.f09_g17.pi_control.all.221.cplhist/cpl/hist.mon

I think that you can safely skip this test. Have you run the scripts_regression_tests.py in cime/scripts/tests?

johnsonb · Jun 29, 2020

Hi Jim,

Ah, thanks for the guidance. I appreciate it. I was under the impression after reading the CIME documentation that a necessary step for validation should be getting the ported code to pass all of the prealpha tests. I found your 2019 CESM Workshop talk and that, in concert with your response, leaves me with the impression that the recommended procedure is that I should be running the regression tests first and the ensemble consistency tests second.

Please let me know if these notions are incorrect.

I found Bill's CTSM system testing guide -- his guide doesn't focus on the regression tests per se, but provides an explanation and rationale for the CIME tests that leaves me with an understanding that I didn't glean from the CIME documentation. I feel like I know what I should be looking for now in the test results.

After running the regression tests I'm left with 6 cs.status reports:

cs.status.20200626_231627
SMS.f19_g16.2000_SATM_XLND_SICE_SOCN_XROF_XGLC_SWAV.shaheen_intel (Overall: FAIL)
...
FAIL SMS.f19_g16.2000_SATM_XLND_SICE_SOCN_XROF_XGLC_SWAV.shaheen_intel RUN time=5
This test ran twice -- first via scripts_regression_tests.py after which it failed so I switched DEBUG=TRUE and reran.
cs.status.20200626_232821
TESTBUILDFAIL_P1.f19_g16_rx1.A.shaheen_intel (Overall: PASS)
TESTRUNFAIL_P1.f19_g16_rx1.A.shaheen_intel (Overall: PASS)
TESTRUNPASS_P1.f19_g16_rx1.A.shaheen_intel (Overall: PASS)
cs.status.20200626_232913
TESTRUNDIFF_P1.f19_g16_rx1.A.shaheen_intel (Overall: PASS)
cs.status.20200626_232938
TESTRUNDIFF_P1.f19_g16_rx1.A.shaheen_intel (Overall: DIFF) The only failure is BASELINE
...
FAIL TESTRUNDIFF_P1.f19_g16_rx1.A.shaheen_intel BASELINE fake_testing_only_20200626_232913
...
cs.status.20200626_233009 From the status time stamps, these are the G testmods
TESTRUNPASS_P1.f19_g16_rx1.A.shaheen_intel (Overall: PASS)
TESTRUNPASS_P1.f45_g37_rx1.A.shaheen_intel (Overall: PASS)
TESTRUNPASS_P1.ne30_g16_rx1.A.shaheen_intel (Overall: PASS)
cs.status.20200626_233105 From the status time stamps, these are the C testmods
TESTRUNPASS_P1.f19_g16_rx1.A.shaheen_intel (Overall: DIFF) The only failure is BASELINE
...
FAIL TESTRUNPASS_P1.f19_g16_rx1.A.shaheen_intel BASELINE fake_testing_only_20200626_233009
...
TESTRUNPASS_P1.f45_g37_rx1.A.shaheen_intel (Overall: DIFF) The only failure is BASELINE
...
FAIL TESTRUNPASS_P1.f45_g37_rx1.A.shaheen_intel BASELINE fake_testing_only_20200626_233009
...
TESTRUNPASS_P1.ne30_g16_rx1.A.shaheen_intel (Overall: DIFF) The only failure is BASELINE
...
FAIL TESTRUNPASS_P1.ne30_g16_rx1.A.shaheen_intel BASELINE fake_testing_only_20200626_233009
...

All of the tests with an Overall: DIFF result are because the BASELINE tests failed. I haven't checked out a different version of the repository to generate baseline files for comparison and we're not doing any modifications to source code so, as far as I comprehend, it seems okay that the BASELINE tests are failing. Is this a correct interpretation?

The SMS.f19_g16.2000_SATM_XLND_SICE_SOCN_XROF_XGLC_SWAV.shaheen_intel failure leaves me concerned because this is a smoke test that we should be passing. Again, I ran this one twice, first via scripts_regression_tests.py and then a second time with DEBUG=TRUE. I've attached the cesm.log.14915610.200629-201623.txt file generated by the second attempt with DEBUG=TRUE.

Searching through this file the relevant error (maybe?) seems to be:
ERROR: if prognostic surface model must also have atm present

I'm perplexed by this because COMP_LND=XLND. I ran xml queries on this case just to double check that the xml settings match what one would deduce from reading the name of the case and they all match:
> ./xmlquery COMP_ATM
COMP_ATM: satm
> ./xmlquery COMP_LND
COMP_LND: xlnd
> ./xmlquery COMP_ICE
COMP_ICE: sice
> ./xmlquery COMP_OCN
COMP_OCN: socn
> ./xmlquery COMP_ROF
COMP_ROF: xrof
> ./xmlquery COMP_GLC
COMP_GLC: xglc
> ./xmlquery COMP_WAV
COMP_WAV: swav

Since this COMP_LND=XLND I don't understand why that error message is being generated. Is this a spurious message? Am I missing something more relevant?

Thank you again for your guidance,
Ben Johnson / johnsonb

johnsonb · Jun 30, 2020

Hi CGD BB,

I've also attached the output of describe_version as version_info.txt.

Thank you again,
Ben Johnson / johnsonb

erik · Sep 1, 2023

Hi Ben Johnson

Yes, as @jedwards says you can skip the SMS_D.f09_g16.I1850Clm50BgcSpinup test. This is due to the data for the test being only on the NCAR machine Cheyenne. I've made an issue of this so we'll remember to fix this in the future.

Remove the NCAR specific paths that don't allow things to work outside of Cheyenne · Issue #2133 · ESCOMP/CTSM

Doing a "git grep glade" I see a lot of glade specific paths in our code. Some of this is acceptable, as these are contrib support codes only expected to work on NCAR machines. Some of it is also j...

github.com

Sorry about the problem. Good luck on your work in porting.

Running prealpha tests: in SMS_D.f09_g16.I1850Clm50BgcSpinup the datm.input_data_list references glade

johnsonb

Ben Johnson

New Member

Attachments

jedwards

CSEG and Liaisons

johnsonb

Ben Johnson

New Member

Attachments

johnsonb

Ben Johnson

New Member

Attachments

erik

Erik Kluzek

CSEG and Liaisons

Remove the NCAR specific paths that don't allow things to work outside of Cheyenne · Issue #2133 · ESCOMP/CTSM