Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Segmentation fault when running mksurfdata_esmf

Carter4444

Carter Watson
New Member
Good day everyone!

I have been working fairly consistently night and day the past three weeks to try to get CTSM configured for WRF on my PI's server (I'm a master's student at the University of Georgia). Via brute force and ignorance, I have reached the final stage in compiling all necessary materials for the run, that is obtaining the fsurdat file needed in the ctsm.cfg file. However, I am being blocked by a rather annoying segmentation fault, which I am failing to get around. Below is the code that I used to build the mksurfdata file:

- conda activate ctsm_pylib
- cd $HOME/WRF/CTSM/tools/mksurfdata_esmf
- export PIO=$HOME/WRF/Downloads/ParallelIO
- export PIO_INCLUDE_DIR=$PIO_PATH/include
- export PIO_LIB_DIR=$PIO_PATH/lib
- export NETCDF=$HOME/WRF/Library
- export ESMF_F90COMPILEPATHS=/opt/home/cwat/WRF/esmf/mod/modO/Linux.gfortran.64.mpiuni.default
- export MPILIB=mpich
- export FFLAGS="$FFLAGS -fallow-argument-mismatch"
- vi src/CMakeCache.txt # changed the two .so PIO files to .a. Changed ${NETCDF} to $ENV{NETCDF}.
- vi src/nanMod.F90 # fix the BOZ syntax.
- vi src/mksurfdata.F90 # I made the following changes:
#Line 271 ##added an 8 after the I [2(a,I) —> 2(a,I8)]
#LINE 289 ##added full path to pio_iotype.txt file [after compiling the first time it said it was unable to find this file. After the fix, it no longer says this.]
#Line 295 ##added an 8 after the i [(i) —> (i8)]
#Line 315 ##error that the line was too long—I split lines in two [& *returnkey* &]
#Line 328 ##add 8s after the Is in this line [(a, I, a, I) —> (a, I8, a, I8)]
- ./gen_mksurfdata_build --machine ctsm-build --verbose

The build of this worked (the build log is attached), but I am unsure as to whether or not my little bug fixes could be contributing to the segmentation fault or not. Either way, following this I was able to successfully create the surfdata input file (attached) and make a jobscript (also attached). However, upon running ./mksurfdata_jobscript_single.sh, it gives me the following error:

(ctsm_pylib) cwat@prospero:~/WRF/CTSM/tools/mksurfdata_esmf$ ./mksurfdata_jobscript_single.sh
Attempting to initialize control settings .....
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 2578958 RUNNING AT prospero
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
real 0m5.461s
user 0m3.531s
sys 0m0.752s
Error running for namelist /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/surfdata_NA_SSP2-4.5_2023_78pfts_c240526.namelist
While I've tried to find resources to better understand this particular segmentation fault (using gdb or Valgrind), this being a BASH script accessing a file that was compiled using cmake confused Valgrind and straight-up didn't work with gdb. I've spent the past several hours trying to recompile mksurfdata while carefully tinkering around with the F90 files, but my knowledge on this stuff is very limited... I'm studying ecology, after all.

Attached I have the mksurfdata build log, a copy of the segmentation fault error (as also seen above), a copy of the jobscript, mksurfdata_esmf/src CMakeLists.txt file, and the surfdata namelist file. If anyone would like my domain file, my mesh file, and my SCRIP file, or anything else, I would be happy to send/email them. Any/all expertise would be greatly appreciated!! :)
 

Attachments

  • Build mksurfdata log.txt
    36.2 KB · Views: 2
  • CMakeLists.txt
    2.2 KB · Views: 1
  • mksurfdata_esmf segmentation fault.txt
    1.4 KB · Views: 2
  • mksurfdata_jobscript_single.sh.txt
    1.2 KB · Views: 1
  • surfdata_NA_SSP2-4.5_2023_78pfts_c240526.namelist.txt
    6.2 KB · Views: 1

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Is there a "PET" log in that directory? If so, does that provide any useful information?
 

slevis

Moderator
Staff member
And/Or possibly a mksurfdata.o1234567 (some 7-digit number)? Or is this where you see the BAD TERMINATION message?
 

Carter4444

Carter Watson
New Member
And/Or possibly a mksurfdata.o1234567 (some 7-digit number)? Or is this where you see the BAD TERMINATION message?
Is there a "PET" log in that directory? If so, does that provide any useful information?
Thank you both for your responses! There is no additional file created. The error is produced nearly immediately after running the program. Attached I have a photo of my terminal showing the contents of the command, the error, and the files in the directory.

Let me know if there are any other resources I can provide, I really appreciate the help!
 

Attachments

  • Screenshot 2024-05-28 at 4.47.16 PM.png
    Screenshot 2024-05-28 at 4.47.16 PM.png
    641.3 KB · Views: 11

slevis

Moderator
Staff member
Does the system you're on have the option of running jobs in batch mode, i.e. you submit to a batch queue and the job runs when it's turn comes?

I only have experience running mksurfdata_esmf in batch mode, and I think that the ...jobscript...sh file is written assuming that you will submit the job to a batch queue. Look inside the ...jobscript...sh file for the mpi command that it uses. To run interactively, you may need to modify that line in the script.
 

slevis

Moderator
Staff member
On our system (derecho) we submit the job to a batch queue by typing "qsub mksurfdata_jobscript_single.sh"
 

Carter4444

Carter Watson
New Member
On our system (derecho) we submit the job to a batch queue by typing "qsub mksurfdata_jobscript_single.sh"
I originally tried to use "qsub"(offered through the SGE package), but I kept getting an error in the installation that, according to their message boards, has been around for nearly a decade without a fix. So, I thought I could abandon course and just make the shell executable. What should I edit in the jobscript? My system recognizes "time mpirun".
 

slevis

Moderator
Staff member
I guess you could try that (mpirun) if you have not already. And you may need to talk to the system administrators of the machine that you're on.

Otherwise, @erik do you have suggestions for submitting mksurfdata_jobscript_single.sh in interactive mode instead of batch?
 

Carter4444

Carter Watson
New Member
I guess you could try that (mpirun) if you have not already. And you may need to talk to the system administrators of the machine that you're on.

Otherwise, @erik do you have suggestions for submitting mksurfdata_jobscript_single.sh in interactive mode instead of batch?
So, I tried to just go around the jobscript all together by doing the following:
- cd /opt/home/cwat/WRF/CTSM/python/ctsm/toolchain
- . /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/tool_bld/.env_mach_specific.sh
- time mpirun /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/tool_bld/mksurfdata < /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/surfdata_NA_SSP2-4.5_2023_78pfts_c240526.namelist

I got the same error, except for the "Error running for namelist /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/surfdata_NA_SSP2-4.5_2023_78pfts_c240526.namelist" which is built into the bash script. Is it possible that the installation of mksurfdata was corrupted? I had to make a number of changes in the mksurfdata.F90 file in order for the installation to be successful... is it possible that this is what is throwing it off? I am unsure as to why the installation of mksurfdata required so much additional editing of the dependent fortran files.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I'm not sure what to do at this point, sorry. It's difficult to support installation on a machine we don't have access to. I can offer to try to produce a surface dataset for you if you can provide the required input datasets. That won't help you in the long-term necessarily but maybe it will help you get going with the next step in your setup.
I expect I'd need the mesh file (/opt/home/cwat/WRF/CTSM/tools/site_and_regional/esmfmeshfile.nc?) and the namelist (surfdata_NA_SSP2-4.5_2023_78pfts_c240526.namelist). Maybe I can retrieve them from an ftp site or something?
 

Carter4444

Carter Watson
New Member
I'm not sure what to do at this point, sorry. It's difficult to support installation on a machine we don't have access to. I can offer to try to produce a surface dataset for you if you can provide the required input datasets. That won't help you in the long-term necessarily but maybe it will help you get going with the next step in your setup.
I expect I'd need the mesh file (/opt/home/cwat/WRF/CTSM/tools/site_and_regional/esmfmeshfile.nc?) and the namelist (surfdata_NA_SSP2-4.5_2023_78pfts_c240526.namelist). Maybe I can retrieve them from an ftp site or something?
That would be amazing, thank you! I will work to get you those files. In the meantime, I was able to make a bash script that produced a rundown of the specific segmentation faults taking place. I have copied and pasted this file below. Any thoughts on this would be exceptionally helpful:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3836640 (LWP 13066)]
[New Thread 0x7ffff3035640 (LWP 13067)]
Thread 1 "mksurfdata" received signal SIGSEGV, Segmentation fault.
0x0000555555e06d2d in get_centerCoords_from_ESMFMesh_file(int, int, char*, long long, long long, int, int*, int&, double*&) ()
#0 0x0000555555e06d2d in get_centerCoords_from_ESMFMesh_file(int, int, char*, long long, long long, int, int*, int&, double*&) ()
#1 0x0000555555b965c5 in ESMCI_mesh_create_from_ESMFMesh_file(int, char*, bool, ESMC_CoordSys_Flag, ESMCI::DistGrid*, ESMCI::Mesh**) ()
#2 0x0000555555b9830c in ESMCI_mesh_create_from_file(char*, ESMC_FileFormat_Flag, bool, bool, ESMC_CoordSys_Flag, ESMC_MeshLoc_Flag, char*, ESMCI::DistGrid*, ESMCI::DistGrid*, ESMCI::Mesh**, int*) ()
#3 0x0000555555b946c3 in ESMCI::MeshCap::meshcreatefromfilenew(char*, ESMC_FileFormat_Flag, bool, bool, ESMC_CoordSys_Flag, ESMC_MeshLoc_Flag, char*, ESMCI::DistGrid*, ESMCI::DistGrid*, int*) ()
#4 0x0000555555b994b2 in c_esmc_meshcreatefromfile_ ()
#5 0x0000555555922087 in __esmf_meshmod_MOD_esmf_meshcreatefromfile ()
#6 0x00005555555b82d3 in mklaimod::mklai (file_mesh_i=..., file_data_i=..., mesh_o=..., pioid_o=..., rc=18, _file_mesh_i=512, _file_data_i=512) at /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/src/mklaiMod.F90:92
#7 0x00005555556083c9 in mksurfdata () at /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/src/mksurfdata.F90:425
#8 0x00005555556148f4 in main (argc=1, argv=0x7fffffffe177) at /opt/home/cwat/WRF/CTSM/tools/mksurfdata_esmf/src/mksurfdata.F90:88
#9 0x00007ffff4c78d90 in __libc_start_call_main (main=main@entry=0x5555556148bb <main>, argc=argc@entry=1, argv=argv@entry=0x7fffffffde58) at ../sysdeps/nptl/libc_start_call_main.h:58
#10 0x00007ffff4c78e40 in __libc_start_main_impl (main=0x5555556148bb <main>, argc=1, argv=0x7fffffffde58, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffde48) at ../csu/libc-start.c:392
#11 0x00005555555859b5 in _start ()
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
It seems to be crashing trying to make the LAI/SAI/HEIGHTs fields (line 92 of mklaiMod.F90). Which is fairly far in the overall process of creating a surface dataset. You should be getting a "log" file (*.log) in that directory but it seems you are not, right? Not sure why. I guess I'd check to make sure the input files are valid netcdf files and are not corrupted in some way:

/opt/home/cwat/WRF/CTSM/ctsm_build_dir/inputdata/lnd/clm2/rawdata/pftcftdynharv.0.25x0.25.LUH2.histsimyr1850-2015.c20230226/mksrf_pftlaihgt_ctsm52_histLUH2_2005.c20230226.nc

/opt/home/cwat/WRF/CTSM/ctsm_build_dir/inputdata/lnd/clm2/mappingdata/grids/UNSTRUCTgrid_0.25x0.25_nomask_cdf5_c200129.nc

Another possibility is that you are running out of memory on your machine...
 

Carter4444

Carter Watson
New Member
It seems to be crashing trying to make the LAI/SAI/HEIGHTs fields (line 92 of mklaiMod.F90). Which is fairly far in the overall process of creating a surface dataset. You should be getting a "log" file (*.log) in that directory but it seems you are not, right? Not sure why. I guess I'd check to make sure the input files are valid netcdf files and are not corrupted in some way:

/opt/home/cwat/WRF/CTSM/ctsm_build_dir/inputdata/lnd/clm2/rawdata/pftcftdynharv.0.25x0.25.LUH2.histsimyr1850-2015.c20230226/mksrf_pftlaihgt_ctsm52_histLUH2_2005.c20230226.nc

/opt/home/cwat/WRF/CTSM/ctsm_build_dir/inputdata/lnd/clm2/mappingdata/grids/UNSTRUCTgrid_0.25x0.25_nomask_cdf5_c200129.nc

Another possibility is that you are running out of memory on your machine...
Thank you for the insight! It's strange that there's no .log file that's being made.

I looked into both of the input files via ncview:
- "mksrf_pftlaihgt_ctsm52..." appeared to be in reasonable shape, though I'm not entirely sure what to be looking for, as this is a massive file. It gave me a warning upon opening that says "Note: udunits: unknown units for pft: "index"", though it doesn't seem like this should be an issue.
- "UNSTRUCTgrid_0.25x0.25..." was difficult to view and was slow to open, despite the small file size. Also, it contains the variable "centerCoords" (which is referenced in the SIGSEGV), but there was no discernible evidence that it wasn't operating properly. To double check, I redownloaded it, this time from https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/. The error persisted. I aim to do the same with the first file, but it is extremely large, so ideally I will try other options first.

It is perhaps serendipitous that you brought up my machine's memory. It has 96gb of RAM at the moment, but we have been talking about adding another 128gb or so. If you think 96gb is too little for this process, we will definitely add additional memory immediately. I really appreciate all your help and insights--if you have any other thoughts, please let me know. :)
 

Carter4444

Carter Watson
New Member
It seems to be crashing trying to make the LAI/SAI/HEIGHTs fields (line 92 of mklaiMod.F90). Which is fairly far in the overall process of creating a surface dataset. You should be getting a "log" file (*.log) in that directory but it seems you are not, right? Not sure why. I guess I'd check to make sure the input files are valid netcdf files and are not corrupted in some way:

/opt/home/cwat/WRF/CTSM/ctsm_build_dir/inputdata/lnd/clm2/rawdata/pftcftdynharv.0.25x0.25.LUH2.histsimyr1850-2015.c20230226/mksrf_pftlaihgt_ctsm52_histLUH2_2005.c20230226.nc

/opt/home/cwat/WRF/CTSM/ctsm_build_dir/inputdata/lnd/clm2/mappingdata/grids/UNSTRUCTgrid_0.25x0.25_nomask_cdf5_c200129.nc

Another possibility is that you are running out of memory on your machine...
Hello, long time no talk! I really appreciate all your help! So, just a few updates:
  • I have been working with the admin of my server to install some batch software (SLURM) that can be used to manage the process.
  • From the best I can discern, you are 100% correct in that my machine is running out of memory during the file creation process. After running "ulimit -s unlimited", the process is able to now run for nearly 50 minutes before crashing. Running htop and watching memory usage, I had no idea it utilized so much. The hope is that SLURM will be able to better appropriate resources so it won't get close to the 96gb that I have.
In the meantime (because we are running into issues getting SLURM installed), I would really like to take up your offer to produce a surface dataset. I have two mesh files with four namelists (you only have to make one--just whatever you've got time for). They can be found below:
For the first mesh file:
  • carterwatson.net/ctsmerror/esmfmeshfile_d01.nc
  • carterwatson.net/ctsmerror/surfdata_NA_SSP2-4.5_2018_78pfts_c240625_d01.namelist
  • carterwatson.net/ctsmerror/surfdata_NA_SSP2-4.5_2019_78pfts_c240625_d01.namelist
For the second mesh file:
  • carterwatson.net/ctsmerror/esmfmeshfile_d02.nc
  • carterwatson.net/ctsmerror/surfdata_NA_SSP2-4.5_2018_78pfts_c240625_d02.namelist
  • carterwatson.net/ctsmerror/surfdata_NA_SSP2-4.5_2019_78pfts_c240625_d02.namelist
It is my understanding that you will likely have to recreate the namelist files because of how specific they are to my local directories (where the input data is stored). To do so, I have included a file outlining the specific commands I ran to create the files (this does not include my running of ./download_input_data, which was needed to draw the files from NCAR):
  • carterwatson.net/ctsmerror/steps_followed.txt
Please let me know if you are able to help! Have a wonderful day!
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
That site is blocked as not secure. Do you have an ftp site or something associated with your university?
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
You can find the namelists, log files, and surface datasets here:

ftp://ftp.cgd.ucar.edu/pub/oleson/surfdata_NA
 

Carter4444

Carter Watson
New Member
You can find the namelists, log files, and surface datasets here:

ftp://ftp.cgd.ucar.edu/pub/oleson/surfdata_NA
Wonderful, thank you! I am unfortunately having difficulty accessing the files (it's not accepting my ucar username and password, and it won't let me log in as a guest)... is there another way you can upload these files for me to receive? I'm terribly sorry for being such a bother, but I really appreciate your help.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
You mentioned a ucar username and password. Do you have access to Derecho/Casper?
 
Top