Porting CESM: Case builds, but fails with segfault

jedwards · Oct 18, 2023

Up to date instructions are here: cime/CIME/non_py/cprnc/README at master · ESMCI/cime

paulhall · Oct 18, 2023

Thanks @jedwards That seems to mostly work. Cmake appears to be having trouble finding netcdf locally, though. I'm getting an error message saying

CMake Error at CMakeLists.txt:25 (find_package):
By not providing "FindNetCDF.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "NetCDF", but
CMake did not find one.

Could not find a package configuration file provided by "NetCDF" with any
of the following names:

NetCDFConfig.cmake
netcdf-config.cmake

Add the installation prefix of "NetCDF" to CMAKE_PREFIX_PATH or set
"NetCDF_DIR" to a directory containing one of the above files. If "NetCDF"
provides a separate development package or SDK, be sure it has been
installed.

despite the relevant netcdf module being loaded (nc-config --version returns netCDF 4.9.0 and nf-config returns netCDF-Fortran 4.6.0, as a check).

Looking through the system netcdf installation, I can find configuration files named netCDFConfig.cmake.in and netCDF-FortranConfig.cmake.in to point at, but no NetCDFConfig.cmake or netcdf-config.cmake as suggested by the error message. Can I just rename netCDFConfig.cmake.in to netCDFConfig.cmake and point cmake to that directory?

jedwards · Oct 18, 2023

You can point to the file in parallelIO/cmake

paulhall · Oct 18, 2023

Hi @jedwards

That did the trick as far as building cprnc. After building I changed the config_machines.xml file accordingly and then re-ran scripts_regression_tests.py (see attached text file for output). Having cprnc appears to have eliminated one of the failures, but 5 tests are still failing. Any ideas?

Thanks!

paulhall · Oct 19, 2023

For the sake of comparison, I tried cloning cesm2.2.0 and running its scripts_regression_tests.py on my local cluster. This resulted in 364 failed tests (see attached text file), many with messages like: xml.etree.ElementTree.ParseError: no element found: line 122, column 0. Is there significant differences in the regression tests between the two versions of the code (cesm2.2.0 vs cesm2.3.beta14)?

jedwards · Oct 19, 2023

There are significant differences in the xml files between the two versions - in particular 2.2.0 uses config_compiilers.xml
while 2.3 uses the cmake modules.

paulhall · Oct 19, 2023

Thanks @jedwards

I was aware of the config_compilers.xml vs cmake differences, and included a config_compilers.xml (with what I think to be the appropriate settings) in my $HOME/.cime directory before running the test. I guess there is more to it than that?

Is there a way to get more detailed information about the tests that have failed, beyond stdout from scripts_regression_tests.py? I notice that the bldlog files mentioned in the testlogs seem to be deleted. For example, for a failed test the testlog contains the line:

ERROR: /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/cime/CIME/build_scripts/buildlib.cprnc FAILED, cat /oscar/scratch/phall1/cesm/scripts_regression_test.20231018_145814/ERS_Ln7.f19_g16_rx1.A.oscar_gnu.fake_testing_only_20231018_150900/bld/cprnc.bldlog.231018-150907

but when I try to look at cprnc.bldlog.231018-150907 after running scripts_regression_tests.py, the file (and the directory it is supposed to be in) doesn't exist.

jedwards · Oct 19, 2023

At the top (line 48) of scripts_regression_tests.py set
NO_TEARDOWN=TRUE and run again.
you can also run indivdual tests and get them working before going on to the next one, for example:
./scripts_regression_tests.py G_TestMacrosBasic

but after looking at the test results that you sent again - I see that you have a formatting error in your config_compilers.xml file.

> no element found: line 122, column 0

paulhall · Oct 19, 2023

Thanks @jedwards ! Good catch on the formatting of the config_compilers.xml. I had a typo. I fixed it and am re-running the regression tests for cesm2.2.0.

Is there a way to set NO_TEARDOWN=TRUE for cesm2.3? scripts_regression_tests.py is very different, and NO_TEARDOWN doesn't seem to be defined in this newer version (at least not in the same way).

jedwards · Oct 19, 2023

For more recent versions of scripts_regression_tests.py its a command line argument --no-teardown

./scripts_regression_tests.py --help
usage:
scripts_regression_tests.py [TEST] [TEST]
OR
scripts_regression_tests.py --help

EXAMPLES:
# Run the full suite
> scripts_regression_tests.py

# Run single test file (with or without extension)
> scripts_regression_tests.py test_unit_doctest

# Run single test class from a test file
> scripts_regression_tests.py test_unit_doctest.TestDocs

# Run single test case from a test class
> scripts_regression_tests.py test_unit_doctest.TestDocs.test_lib_docs

Script containing CIME python regression test suite. This suite should be run to confirm overall CIME correctness.

positional arguments:
tests Specific tests to run e.g. test_unit* (default: None)

options:
-h, --help show this help message and exit
--fast Skip full system tests, which saves a lot of time (default: False)
--no-batch Do not submit jobs to batch system, run locally. If false, will default to machine setting. (default: False)
--no-fortran-run Do not run any fortran jobs. Implies --fast Used for github actions (default: False)
--no-cmake Do not run cmake tests (default: False)
--no-teardown Do not delete directories left behind by testing (default: False)
--machine MACHINE Select a specific machine setting for cime (default: None)
--compiler COMPILER Select a specific compiler setting for cime (default: None)
--mpilib MPILIB Select a specific compiler setting for cime (default: None)
--test-root TEST_ROOT
Select a specific test root for all cases created by the testing (default: None)
--timeout TIMEOUT Select a specific timeout for all tests (default: None)
--verbose Enable verbose logging (default: False)
--debug Enable debug logging (default: False)
--silent Disable all logging (default: False)

paulhall · Oct 19, 2023

Thanks @jedwards . I'll give that a try. In the meantime, I re-ran scripts_regression_tests.py for cesm2.2.0 with the corrected config_compilers.xml file. Fixing config_compilers knocked it down to 328 failed tests (see attached logfile). Many of the failed tests seem to throw the error:

NameError: name 'sys' is not defined

Any chance this is another xml formatting issue? Is there any guidance for interpreting the output from these tests in the cesm or cime online docs?

jedwards · Oct 19, 2023

I think that the issue here is that you need an older python version and I think that the sys error is coming from your virtual environment.

paulhall · Oct 19, 2023

Does the older script require python2, or just an earlier version of python3?

jedwards · Oct 19, 2023

It works with 2.7.17, it should also work with an earlier version of python3 - I was trying to track down which one. 3.4 through 3.6 should work.

paulhall · Oct 20, 2023

So I have managed to whittle the number of failed tests for cesm2.3 down to 4 (and 15 skipped). One of the remaining failed tests is test_self_build_cprnc, and I suspect that is failing due to Cmake not being able to find the NetCDF package configuration file, as I experienced when building cprnc manually. Any thoughts about the 3 remaining failed tests (test_full_system, test_run_restart_too_many_fails, and test_user_concurrent_mods)?

Thanks again for all of your help with this, @jedwards !

jedwards · Oct 20, 2023

For the system tests you should go into the case directory and look at TestStatus.log to determine why they fail.
You can also try running them with create_test from the command line - for example:
./create_test SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A.oscar_gnu

and for the cprnc build you need to look at the log: cat /oscar/scratch/phall1/cesm/scripts_regression_test.20231019_162116/ERS_Ln7.f19_g16_rx1.A.oscar_gnu.fake_testing_only_20231019_163043/bld/cprnc.bldlog.231019-163050

this test should work - it was only the documentation that was out of date, not the cprnc build process itself.

paulhall · Oct 20, 2023

Hi @jedwards

So it appears that at least some of the failed tests are producing errors similar to the error that led me to start the thread: a segfault early in the run process. For example, from the cesm.log file for ERI_Ln9.f09_g16.X.oscar_gnu (I was directed here from the relevant TestStatus.log):

(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 12
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2b71012563ff in ???
#1 0x2b7111b370de in ???
#2 0x2b7113401a7d in ???
#3 0x2b71134025d1 in ???
#4 0x2b710091695a in ???
#5 0x2b71009171a8 in ???
#6 0x2b71144797da in ???
#7 0x2b71008db157 in ???
#8 0x5f5c9d in ???
#9 0x5faecc in ???
#10 0x5fa4a7 in ???
#11 0x5abb37 in ???
#12 0x5abe94 in ???
#13 0x483a08 in ???
#14 0x4ef66a in ???
#15 0x2b70fd86576b in ???
#16 0x2b70fd865ade in ???
#17 0x2b70fd86785c in ???
#18 0x2b70fda1b9e8 in ???
#19 0x2b70fe1ea46e in ???
#20 0x2b70fd63b7cb in ???
#21 0x2b70fd63bb3f in ???
#22 0x2b70fd91a7ad in ???
#23 0x2b70fd92eac2 in ???
#24 0x2b70fd639b08 in ???
#25 0x2b70fdb22a5b in ???
#26 0x2b70fdd5a007 in ???
#27 0x2b70fe19ca94 in ???
#28 0x2b70fd86576b in ???
#29 0x2b70fd865ade in ???
#30 0x2b70fd8671ee in ???
#31 0x2b70fda1ba88 in ???
#32 0x2b70fe1928bd in ???
#33 0x2b70fd63b7cb in ???
#34 0x2b70fd63bb3f in ???
#35 0x2b70fd91a7ad in ???
#36 0x2b70fd92eac2 in ???
#37 0x2b70fd639b08 in ???
#38 0x2b70fdb22a5b in ???
#39 0x2b70fdd5a007 in ???
#40 0x2b70fe19ca94 in ???
#41 0x2b70fd86576b in ???
#42 0x2b70fd865ade in ???
#43 0x2b70fd8671ee in ???
#44 0x2b70fda1ba88 in ???
#45 0x2b70fe1928bd in ???
#46 0x2b70fd63b7cb in ???
#47 0x2b70fd63bb3f in ???
#48 0x2b70fd91a7ad in ???
#49 0x2b70fd92eac2 in ???
#50 0x2b70fd639b08 in ???
#51 0x2b70fdb22a5b in ???
#52 0x2b70fdd5a007 in ???
#53 0x41a0b6 in ???
#54 0x41a79c in ???
#55 0x2b7101242554 in ???
#56 0x40dba8 in ???
#57 0xffffffffffffffff in ???
srun: error: node1911: task 1: Segmentation fault
slurmstepd: error: mpi/pmix_v4: _errhandler: node1911 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.11740130.0:1]
slurmstepd: error: *** STEP 11740130.0 ON node1911 CANCELLED AT 2023-10-20T14:53:39 ***
srun: Job step aborted: Waiting up to 182 seconds for job step to finish.
srun: error: node1911: tasks 0,2-47: Killed

It seems as though I'm back at square one. Any ideas where to go from here with this?

Thanks!

jedwards · Oct 20, 2023

Maybe try this case in debug mode to see if the result is better resolved?
You can run test
ERI_D_Ln9.f09_g16.X.oscar_gnu

or you can rerun the existing test after
./xmlchange DEBUG=TRUE
rm -fr bld
./case.build

paulhall · Oct 20, 2023

Re-running the test with:

create_test --debug ERI_D_Ln9.f09_g16.X.oscar_gnu

also results in a segfault, with only slightly more information in the backtrace in the cesm.log file:

(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 12
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2ac56e6153ff in ???
#1 0x2ac57ef530de in ???
#2 0x2ac58081da7d in ???
#3 0x2ac58081e5d1 in ???
#4 0x2ac56dcd595a in ???
#5 0x2ac56dcd61a8 in ???
#6 0x2ac5818957da in ???
#7 0x2ac56dc9a157 in ???
#8 0x691829 in PIOc_inq_type
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/clib/pio_nc.c:485
#9 0x697363 in PIOc_put_att_tc
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/clib/pio_getput_int.c:59
#10 0x6968f8 in PIOc_put_att_text
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/clib/pio_nc.c:3189
#11 0x626915 in __pionfatt_mod_MOD_put_att_id_text
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/flib/pionfatt_mod.F90.in:222
#12 0x626cf4 in __pionfatt_mod_MOD_put_att_desc_text
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/flib/pionfatt_mod.F90.in:182
#13 0x4c337a in __med_io_mod_MOD_med_io_write_int
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/components/cmeps/cime_config/../mediator/med_io_mod.F90:1131
#14 0x5562b2 in __med_phases_restart_mod_MOD_med_phases_restart_write
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/components/cmeps/cime_config/../mediator/med_phases_restart_mod.F90:437
#15 0x2ac56ac2476b in ???
#16 0x2ac56ac24ade in ???
#17 0x2ac56ac2685c in ???
#18 0x2ac56adda9e8 in ???
#19 0x2ac56b5a946e in ???
#20 0x2ac56a9fa7cb in ???
#21 0x2ac56a9fab3f in ???
#22 0x2ac56acd97ad in ???
#23 0x2ac56acedac2 in ???
#24 0x2ac56a9f8b08 in ???
#25 0x2ac56aee1a5b in ???
#26 0x2ac56b119007 in ???
#27 0x2ac56b55ba94 in ???
#28 0x2ac56ac2476b in ???
#29 0x2ac56ac24ade in ???
#30 0x2ac56ac261ee in ???
#31 0x2ac56addaa88 in ???
#32 0x2ac56b5518bd in ???
#33 0x2ac56a9fa7cb in ???
#34 0x2ac56a9fab3f in ???
#35 0x2ac56acd97ad in ???
#36 0x2ac56acedac2 in ???
#37 0x2ac56a9f8b08 in ???
#38 0x2ac56aee1a5b in ???
#39 0x2ac56b119007 in ???
#40 0x2ac56b55ba94 in ???
#41 0x2ac56ac2476b in ???
#42 0x2ac56ac24ade in ???
#43 0x2ac56ac261ee in ???
#44 0x2ac56addaa88 in ???
#45 0x2ac56b5518bd in ???
#46 0x2ac56a9fa7cb in ???
#47 0x2ac56a9fab3f in ???
#48 0x2ac56acd97ad in ???
#49 0x2ac56acedac2 in ???
#50 0x2ac56a9f8b08 in ???
#51 0x2ac56aee1a5b in ???
#52 0x2ac56b119007 in ???
#53 0x41d085 in esmapp
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/components/cmeps/cime_config/../cesm/driver/esmApp.F90:133
#54 0x41d78c in main
at /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/components/cmeps/cime_config/../cesm/driver/esmApp.F90:7
srun: error: node2301: task 1: Segmentation fault
slurmstepd: error: mpi/pmix_v4: _errhandler: node2301 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.11746329.0:1]
slurmstepd: error: *** STEP 11746329.0 ON node2301 CANCELLED AT 2023-10-20T20:58:23 ***
srun: Job step aborted: Waiting up to 182 seconds for job step to finish.
srun: error: node2301: tasks 0,2-47: Killed

This looks very similar (though not identical) to the backtrace in the very first message in this thread. Any ideas?

Thanks @jedwards !

Porting CESM: Case builds, but fails with segfault

CSEG and Liaisons

Paul Hall

Member

CSEG and Liaisons

Paul Hall

Member

Attachments

Paul Hall

Member

Attachments

CSEG and Liaisons

Paul Hall

Member

CSEG and Liaisons

Paul Hall

Member

CSEG and Liaisons

Paul Hall

Member

Attachments

CSEG and Liaisons

Paul Hall

Member

CSEG and Liaisons

Paul Hall

Member

Attachments

CSEG and Liaisons

Paul Hall

Member

CSEG and Liaisons

Paul Hall

Member