I'm attempting to port CESM to our local cluster (Oscar, RHEL7.9, slurm), but I am running into a segfault I can't figure out. I was hoping someone might have some ideas about what is going wrong (or how to figure out exactly where things are going off the rails).
I'm using CESM2_3_beta14 and building with GNU compilers (10.2), OpenMPI (4.0.7), Python (3.9.0), ESMF (8.5.0), and NetCDF (4.7.4). As a simple test, I am attempting a short run (5 days) with the default settings for the CMOM compset on a T62_t061 set of grids using the NUOPC driver. I have run this case successfully with this version of CESM on Cheyenne.
The case builds successfully, and starts to run, but appears to stall out 2-3 minutes after the run starts (the model doesn't fail, it continues to run, but it stops writing to log files or generating other output). The only error message I can find is contained in the cesm.log file for the run, which indicates a segmentation fault (Program received signal SIGSEGV: Segmentation fault - invalid memory reference.). Setting with DEBUG=TRUE generates slightly more information in the log file, suggesting that the issue may lie with a parallelio library (see attached file)? I've run into the same issue attempting to run other scenarios that I had previously run successfully on Cheyenne.
Any ideas about what I'm missing in the porting process (e.g., compiler flags?) or how to go about debugging this?
Thanks!
I'm using CESM2_3_beta14 and building with GNU compilers (10.2), OpenMPI (4.0.7), Python (3.9.0), ESMF (8.5.0), and NetCDF (4.7.4). As a simple test, I am attempting a short run (5 days) with the default settings for the CMOM compset on a T62_t061 set of grids using the NUOPC driver. I have run this case successfully with this version of CESM on Cheyenne.
The case builds successfully, and starts to run, but appears to stall out 2-3 minutes after the run starts (the model doesn't fail, it continues to run, but it stops writing to log files or generating other output). The only error message I can find is contained in the cesm.log file for the run, which indicates a segmentation fault (Program received signal SIGSEGV: Segmentation fault - invalid memory reference.). Setting with DEBUG=TRUE generates slightly more information in the log file, suggesting that the issue may lie with a parallelio library (see attached file)? I've run into the same issue attempting to run other scenarios that I had previously run successfully on Cheyenne.
Any ideas about what I'm missing in the porting process (e.g., compiler flags?) or how to go about debugging this?
Thanks!