Scheduled Downtime
On Wednesday 09 March 2022 from 6am to 10am MT, the website will be down for maintenance

BWmaHIST mpirun fail

engeir

Eirik Enger
New Member
Description
When trying to run the BWmaHIST compset the model fails after submitting, with error:
Code:
ERROR: RUN FAIL: Command 'mpirun  /cluster/work/users/een023/cesm/e_BWmaHIST_vanilla/bld/cesm.exe  >> cesm.log.$LID 2>&1 ' failed
See log file for details: /cluster/work/users/een023/cesm/e_BWmaHIST_vanilla/cesm.log.3831477.211221-142951
In the log file (attached) the lines where the error seem to occur is
Code:
Opened existing file
/cluster/projects/nn9348k/cesm/input-data/atm/cam/chem/emis/CMIP6_emissions_175 0_2015_2deg/emissions-cmip6_so4_a2_contvolcano_vertical_850-5000_1.9x2.5_c20190 417.nc          44
pio_support::pio_die:: myrank=          -1 : ERROR: ionf_mod.F90:         235 :
NetCDF: Attempting netcdf-3 operation on netcdf-4 file
pio_support::pio_die:: myrank=          -1 : ERROR: ionf_mod.F90:         235 :
NetCDF: Attempting netcdf-3 operation on netcdf-4 file
pio_support::pio_die:: myrank=          -1 : ERROR: ionf_mod.F90:         235 :
NetCDF: Attempting netcdf-3 operation on netcdf-4 file
pio_support::pio_die:: myrank=          -1 : ERROR: ionf_mod.F90:         235 :
NetCDF: Attempting netcdf-3 operation on netcdf-4 file

The file ionf_mod.F90 I see is located at
Code:
$CESMROOT/cime/src/externals/pio1/pio/ionf_mod.F90
with a call to check_netcdf on line 235, found in file pio_utils.F90 found in the same directory above.

I tried changing the .nc file with a different version, that is, in user_ln_cam I include
Code:
/cluster/projects/nn9348k/cesm/input-data/atm/cam/chem/emis/CMIP6_emissions_175 0_2015_2deg/emissions-cmip6_so4_a2_contvolcano_vertical_850-5000_1.9x2.5_c20190 417-v_nc4.nc
where this file was created with command
Code:
nccopy -k nc4 infile.nc infile-v_nc4.nc
This make the error for that specific file go away, but an equivalent error pop up for a different .nc file, and fixing this next .nc file make the error occur for a third, and so on.

At which point I am lost.



Steps to reproduce:
Bash:
./create_newcase --case /cluster/home/een023/model/CESM/cesm2.1.3/CESM/cime/cases/e_BWmaHIST_vanilla --res f19_g17 --compset=BWmaHIST --mach fram
# cd into case directory...
./case.setup
./case.build
./case.submit
 

Attachments

  • cesm.log.3831477.211221-142951.txt
    44.1 KB · Views: 3
  • README.case.txt
    2.7 KB · Views: 2
  • version_info.txt
    6.4 KB · Views: 2

jedwards

CSEG and Liaisons
Staff member
Files should NOT be in netcdf4 format. Convert to cdf5 using nccopy -k cdf5. Also the logging creates a problem since the error belongs to the next file
to be opened not the previous one.
 

engeir

Eirik Enger
New Member
Ok, I see. But if the error is not related to the last opened file, how do I know what casued the error?
 

engeir

Eirik Enger
New Member
While trying to run the same compset as described above, but with a custom forcing file, say
Code:
VolcanEESMv3.11_SO2_850-2016_Mscale_Zreduc_2deg_c220111.nc
I get a segmentation fault (at least that is what it seems to be, the below snippet is printed repeatedly on the last 2000 lines of the log file cesm.log.xxx):
Code:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
cesm.exe           00000000033C074D  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AD781C0E630  Unknown               Unknown  Unknown
cesm.exe           0000000001495F5F  mo_util_mp_rebin_          52  mo_util.F90
cesm.exe           0000000000F5ED0C  tracer_data_mp_in        1675  tracer_data.F90
cesm.exe           0000000000F6F6E4  tracer_data_mp_ad         667  tracer_data.F90
cesm.exe           00000000012F180E  mo_extfrc_mp_extf         336  mo_extfrc.F90
cesm.exe           00000000010FF1CB  chemistry_mp_chem        1179  chemistry.F90
cesm.exe           00000000006CA5DF  physpkg_mp_phys_r        2351  physpkg.F90
cesm.exe           00000000004EE672  cam_comp_mp_cam_r         258  cam_comp.F90
cesm.exe           00000000004E4B22  atm_comp_mct_mp_a         287  atm_comp_mct.F90
cesm.exe           0000000000434AC6  component_mod_mp_         267  component_mod.F90
cesm.exe           0000000000429167  cime_comp_mod_mp_        2015  cime_comp_mod.F90
cesm.exe           0000000000431C5E  MAIN__                    114  cime_driver.F90
cesm.exe           000000000041522E  Unknown               Unknown  Unknown
libc-2.17.so       00002AD78213F555  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000415129  Unknown               Unknown  Unknown

The file VolcanEESMv3.11_SO2_850-2016_Mscale_Zreduc_2deg_c220111.nc I create with the script createVolcEruptV3.ncl (same as is described in the meta data of the original VolcanEESMv3.11_SO2_850-2016_Mscale_Zreduc_2deg_c191125.nc) followed by the command
Code:
nccopy -k cdf5 in.nc out.nc

Since I could not find the coordinate file used in the NCL-script (coords_1.9x2.5_L88_c150828.nc) I instead used fv_1.9x2.5_L30.nc. Could this be the whole reason why the segmentation fault arises? Or is the error perhaps related to a different thing?

From this resource it seem it might be related to pnetcdf?
 

jedwards

CSEG and Liaisons
Staff member
I think that the number of vertical levels used in your file (L30) needs to match that used in the model (L88)
 

engeir

Eirik Enger
New Member
Okay, that makes sense. Would you know where such a coordinate file might be found? I see the original NCL-script loads in
/glade/work/mmills/inputdata/grids/coords_1.9x2.5_L88_c150828.nc
which I suppose is a private directory.
 

engeir

Eirik Enger
New Member
With the L88 coords file I still get the same segmentation fault.
Looking at the creation script createVolcEruptV3.ncl and the forcing files it generates, and comparing this to the attributes in VolcanEESMv3.11_SO2_850-2016_Mscale_Zreduc_2deg_c191125.nc it almost seems like the latter was created from some different NCL-script?
E.g. the variable altitude_int is a dimension in VolcanEESMv3.11_SO2_850-2016_Mscale_Zreduc_2deg_c191125.nc, but a variable in the corresponding forcing file generated from createVolcEruptV3.ncl.
Could that be something that would cause a segmentation fault?
 
Top