Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Porting CESM: Case builds, but fails with segfault

paulhall

Paul Hall
Member
I'm attempting to port CESM to our local cluster (Oscar, RHEL7.9, slurm), but I am running into a segfault I can't figure out. I was hoping someone might have some ideas about what is going wrong (or how to figure out exactly where things are going off the rails).

I'm using CESM2_3_beta14 and building with GNU compilers (10.2), OpenMPI (4.0.7), Python (3.9.0), ESMF (8.5.0), and NetCDF (4.7.4). As a simple test, I am attempting a short run (5 days) with the default settings for the CMOM compset on a T62_t061 set of grids using the NUOPC driver. I have run this case successfully with this version of CESM on Cheyenne.

The case builds successfully, and starts to run, but appears to stall out 2-3 minutes after the run starts (the model doesn't fail, it continues to run, but it stops writing to log files or generating other output). The only error message I can find is contained in the cesm.log file for the run, which indicates a segmentation fault (Program received signal SIGSEGV: Segmentation fault - invalid memory reference.). Setting with DEBUG=TRUE generates slightly more information in the log file, suggesting that the issue may lie with a parallelio library (see attached file)? I've run into the same issue attempting to run other scenarios that I had previously run successfully on Cheyenne.

Any ideas about what I'm missing in the porting process (e.g., compiler flags?) or how to go about debugging this?

Thanks!
 

Attachments

  • cesm.log.11590434.231010-120909.txt
    3.2 KB · Views: 18

jedwards

CSEG and Liaisons
Staff member
Update your parallelio (pio) library to version 2.6.2. To do this
edit file Externals.cfg
and change
[parallelio]
-tag = pio2_5_10
+tag = pio2_6_2

Then run manage_externals/checkout_externals to get the updated library.
 

paulhall

Paul Hall
Member
Thanks Jim! I tried updating to pio2_6_2, as you suggest. However, now the model fails to build, resulting in the error message:

ERROR: /oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/cime/CIME/build_scripts/buildlib.pio FAILED

Looking at the pio.bldlog, along with a lot of warnings, the following errors appear:

Error: Symbol pio_inq_var_filter_ids referenced at (1) not found in module pio_nf
/oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/flib/pio.F90:90:33:

90 | PIO_inq_var_filter_ids , &
| 1
Error: Symbol pio_inq_var_filter_info referenced at (1) not found in module pio_nf
/oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/flib/pio.F90:91:33:

91 | PIO_inq_var_filter_info , &
| 1
Error: Symbol pio_inq_filter_avail referenced at (1) not found in module pio_nf
/oscar/data/ccvstaff/phall1/opt/cesm/cesm2_3_beta14_floe/libraries/parallelio/src/flib/pio.F90:92:33:

92 | PIO_inq_filter_avail , &
| 1
Error: Symbol pio_def_var_szip referenced at (1) not found in module pio_nf
make[2]: *** [src/flib/CMakeFiles/piof.dir/pio.F90.o] Error 1
make[1]: *** [src/flib/CMakeFiles/piof.dir/all] Error 2
make: *** [all] Error 2

Any ideas?
 

paulhall

Paul Hall
Member
Switching to netcdf 4.9.0 produces a variety of new errors on build (see attached pio.bldlog). Do I need to modify my compiler flags in some way to get around these errors? Or is it something else entirely? Thanks!
 

Attachments

  • pio.bldlog.231010-161706.txt
    48.4 KB · Views: 4

jedwards

CSEG and Liaisons
Staff member
You need to add -std=gnu99 to your cflags - there are a lot of warnings in the log because you are passing fortran flags to the c compiler.
 

paulhall

Paul Hall
Member
I'm pretty sure I am using -std=gnu99 in the CFLAGS. See line 79 of the pio.bldlog above, for example (-std=gnu99 is the first flag listed for CFLAGS). I did build with DEBUG = TRUE and INFO_DBUG = 2. Could that be the source of all the warnings? Even if that is the case, it shouldn't be causing the actual errors (not just warnings), should it?

If I wanted to change the CFLAGS, do I just add a gnu.cmake file to my ~/.cime directory and make the changes there (my impression is that config_compilers.xml has been deprecated)?

Thanks!
 

paulhall

Paul Hall
Member
Following up on fortran flags being passed to the c compiler, it looks like two of the culprits (-fbacktrace and -fcheck=bounds) are appended to CFLAGS for DEBUG cases in the default gnu.cmake file in ccs_config. Should they be removed?
 

paulhall

Paul Hall
Member
Following up, I created a copy of the default gnu.cmake file, placed it in my ~/.cime directory, and removed the problematic flags (-fbacktrace, -fcheck=bounds, -ffpe-trap=invalid,zero,overflow) from the CFLAGS options appended for DEBUG, and then rebuilt. This eliminated the warnings showing up when building parallelio, but it did not eliminate the errors that are causing the build to fail.

It looks like the issue may be an undeclared variable in pio_nc4.c? See attached pio.bldlog. Any thoughts as to how to fix this?

Thanks!
 

Attachments

  • pio.bldlog.231011-115446.txt
    42.9 KB · Views: 1

jedwards

CSEG and Liaisons
Staff member
Yes there is a bug in pio_nc4.c in debug mode. Remove the code from the file

diff --git a/src/clib/pio_nc4.c b/src/clib/pio_nc4.c
index 3074bbd9..f8158747 100644
--- a/src/clib/pio_nc4.c
+++ b/src/clib/pio_nc4.c
@@ -1336,10 +1336,6 @@ PIOc_def_var_filter(int ncid, int varid, unsigned int id, size_t nparams, unsign
int mpierr = MPI_SUCCESS, mpierr2; /* Return code from MPI function codes. */

PLOG((1, "PIOc_def_var_filter ncid = %d varid = %d id = %d nparams = %d", ncid, varid, id, nparams));
-#ifdef DEBUG
- for(i=0; i<nparams; i++)
- PLOG(1, " param %d %d\n",i, params);
-#endif

/* Get the file info. */
 

paulhall

Paul Hall
Member
Thanks @jedwards

Commenting out those lines of code in pio_nc4.c appears to lead to additional errors. I'll work on eliminating those so I can run in DEBUG mode.

In the meantime, I tried building without DEBUG (since the PIO bug seems to be related to debug mode). The case built, but segfaults on running, as before. Without DEBUG mode the backtrace in the log file doesn't contain any useful information. If nothing else, this suggests that upgrading pio may not be the ultimate fix for whatever the underlying problem is.

Any other thoughts about fixes to try while I work on getting DEBUG mode to work would be appreciated. Thanks again for your help with this!
 

jedwards

CSEG and Liaisons
Staff member
Just for clarity: The lines above that you should comment out were only those with the - at the beginning.

For the gnu compiler try adding -g1 to the cflags, this should give you the traceback info without full DEBUG mode.
Also if you have lines like:
/var/run/palsd/c6e7d570-7f83-43d3-9649-877fc30e6fe1/files/cesm.exe() [0x9ace2a6]
in your output you can sometimes get line information using
$ addr2line -e bld/cesm.exe 0x9ace2a6
/glade/work/jedwards/sandboxes/cesm2.2/cime/src/externals/pio2/src/clib/pioc_support.c:534
 

paulhall

Paul Hall
Member
@jedwards

I see. Thanks for the clarification. As you deduced, I charged ahead and just commented out that entire code block. Commenting out just the lines you intended seems to fix the bug with PIO in debug mode (i.e., it builds!).

Unfortunately, even with the new PIO, the job still fails with the same error as originally (segfault). I'm attaching the cesm.log file. Any suggestions for how to determine exactly what is going wrong are appreciated.

Thanks!
 

Attachments

  • cesm.log.11604625.231011-171223.txt
    5.9 KB · Views: 3

jedwards

CSEG and Liaisons
Staff member
I have no idea - it's failing really early in the run is the stack size set to unlimited? ulimit -s
are you using enough tasks for the job?
 

paulhall

Paul Hall
Member
Thanks @jedwards

The stack size is unlimited, but looking at these kinds of settings is a probably a good idea.

I'm not sure about the number of tasks... I haven't been changing NTASKS from the defaults. I will look into it.

Comparing the machine settings for a successful run on Cheyenne to the ones I am currently using on Oscar, the main difference that stands out to me is OMP_STACKSIZE. I don't believe I'm building or running with threads (i.e., ompthreads=1 on Cheyenne), so it's not clear to me why this would make a difference, but it is set to 1024M on Cheyenne and just 64M on my local cluster. Is it possible that OMP_STACKSIZE could be an issue? And if so, do you have any recommendations for a good default setting?
 

jedwards

CSEG and Liaisons
Staff member
The defaults are based on cheyenne - it may not be suitable for your system. Maybe you should make sure that the scripts_regression_tests.py
works on your system. Or some simple cases like ./create_test SMS.f19_g17.X
 

paulhall

Paul Hall
Member
Hi @jedwards

I have been following your advice and attempting to run the regression test scripts for my port of Oscar on my local cluster. Running scripts_regression_tests.py results in a total of 253 tests, with 6 failures and 15 skipped (see attached txt file with output from scripts_regression_tests.py).

I'm trying to parse the output from the tests so I can address the issues, and I was wondering if you had any thoughts or advice on how to identify and address the underlying problem(s). Thanks in advance for any insight you can provide!
 

Attachments

  • testlog1.txt
    102.6 KB · Views: 1

jedwards

CSEG and Liaisons
Staff member
This looks pretty good, I think perhaps most of these failures are because you didn't install cprnc?
You need to build the cprnc app and install it in a location pointed to in the config_machines.xml file.
 

paulhall

Paul Hall
Member
Thanks! I didn't realize that the cprnc needed to be built. Looking at the README in the cprnc directory included with cime ($CESMROOT/cime/CIME/non_py/cprnc), it appears that the location of cprnc in the cime file hierarchy has been moved around and the instructions for building it, at least for the version I am using (cime6.0.125), no longer apply? For example, the build instructions assume you are running from CIME/data/cprnc and say to:

export CIMEROOT=../..


MPILIB=mpi-serial source ./.env_mach_specific.sh


../configure --macros-format=CMake --mpilib=mpi-serial

But there is no configure executable or .env_mach_specific.sh file in the relevant directory ($CESMROOT/cime/CIME/non_py). Am I missing something?

Thanks!
 
Top