Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Yao

New Member
Dear helper,

I'm running F1850, res=f09_f09_mg17 with intel coompiler and impi.

It often failed at the end of simulation month, such as 0001-02-28 or 0001-08-31.

The cesm.log file showed that
Opened file F180PhCO2.cam.h2.0001-05-31-00000.nc to write 3
max rss=662.4 MB

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cesm.exe 0000000002F28A01 Unknown Unknown Unknown
cesm.exe 0000000002F26B3B Unknown Unknown Unknown
cesm.exe 0000000002EB8CB4 Unknown Unknown Unknown
cesm.exe 0000000002EB8AC6 Unknown Unknown Unknown
cesm.exe 0000000002E340F9 Unknown Unknown Unknown
cesm.exe 0000000002E3FDA6 Unknown Unknown Unknown
libpthread-2.17.s 00002B5E885575F0 Unknown Unknown Unknown
libmpifort.so.12. 00002B5E87156270 __I_MPI___intel_a Unknown Unknown
libmpi.so.12.0 00002B5E874FBC5D Unknown Unknown Unknown
libmpi.so.12 00002B5E87506584 ADIOI_GEN_WriteSt Unknown Unknown
libmpi.so.12.0 00002B5E879BC37C Unknown Unknown Unknown
libmpi.so.12 00002B5E879BD3F5 PMPI_File_write_a Unknown Unknown
cesm.exe 0000000002DF5723 Unknown Unknown Unknown
cesm.exe 0000000002DF4B1E Unknown Unknown Unknown
cesm.exe 0000000002DF77BB Unknown Unknown Unknown
cesm.exe 0000000002D5C222 Unknown Unknown Unknown
cesm.exe 0000000002BE2215 piodarray_mp_darr 1318 piodarray.F90.in
cesm.exe 0000000002BF68E5 piodarray_mp_writ 1278 piodarray.F90.in
cesm.exe 0000000002BF7E9F piodarray_mp_writ 221 piodarray.F90.in
cesm.exe 0000000001DA67F5 ncdio_pio_mp_ncd_ 1684 ncdio_pio.F90.in
cesm.exe 0000000001D2F5BF histfilemod_mp_hf 3050 histFileMod.F90
cesm.exe 0000000001D1F13C histfilemod_mp_hi 3553 histFileMod.F90
cesm.exe 0000000001CA7C9E clm_driver_mp_clm 1177 clm_driver.F90
cesm.exe 0000000001C92F92 lnd_comp_mct_mp_l 456 lnd_comp_mct.F90
cesm.exe 0000000000436540 component_mod_mp_ 728 component_mod.F90
cesm.exe 00000000004171AA cime_comp_mod_mp_ 2720 cime_comp_mod.F90
cesm.exe 00000000004361BD MAIN__ 125 cime_driver.F90
cesm.exe 000000000041465E Unknown Unknown Unknown
libc-2.17.so 00002B5E88A88505 __libc_start_main Unknown Unknown
cesm.exe 0000000000414569 Unknown Unknown Unknown

The stacksize has been set unlimited. I'm really confused with this error. Appreciate if you can give any suggestions how to go forward.
 

jedwards

CSEG and Liaisons
Staff member
What version of netcdf or pnetcdf are you using? What filesystem? Please try rerunning with DEBUG=TRUE
you can change the cam output options so that the history files are written more frequently thus speeding up the time to failure by
editing user_nl_cam and adding the line:
nhtfrq = -1,-1, -1, -1, -1
 

jedwards

CSEG and Liaisons
Staff member
My mistake, that is an error writing a clm not a cam history file so you want to change user_nl_clm and add the line:
hist_nhtfrq = -1, -1, -1
 

Yao

New Member
Hi jedwards,

my version of NetCDF is 4.5.0 and pnetcdf is 1.10.0. lustre filesystem. I will try the debug=true. Thanks for your reply.
 

jedwards

CSEG and Liaisons
Staff member
Both of those are quite old versions you might consider updating to netcdf 4.7.4 and pnetcdf 1.12.1
 

sunnyabc

In-Sun Song
New Member
Same symptom for writing cam restart files at the final month of simulation is found for cesm2.1.2 on local linux cluster in university lab. According to an IT engineer, it seems there is no significant bottle neck of IO operations when the segfault error occurred as a result of analysis of 'iostat'. CESM2.1.2 is being built and run on the small-scale linux cluster with two nodes (CentOS7). NetCDF4.6.3 and pnetcdf1.10.0 have been used. Storage disks are mounted via NFS and Infiniband switch. After setting degug=true and setting restart frequency to daily, the model dies at the second cam restart file (e.g., xxxx.cam.r.2016-12-02-000000.nc), Backend of PIO is pnetcdf. MPI version is mvapich2-2.2. Do you have any further ideas?
 

jedwards

CSEG and Liaisons
Staff member
Netcdf and pnetcdf versions are both very old. Current Netcdf version is 4.8.1, pnetcdf is in version 1.12.3. Parallel IO operations are not recommended on NSF filesystems. We recommend either GPFS or lustre for paralllel IO. You didn't indicate the compiler you are using. As a first step try setting PIO_TYPENAME=netcdf.
 

sunnyabc

In-Sun Song
New Member
Hi jedwards, Thank you so much for your reply. Let me try setting PIO_TYPENAME=netcdf. I'm using intel compiler 19.0.3.199 (20190206). -- In-Sun
 
Top