Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM defaults to using netcdf instead of pnetcdf

johnsonb

Ben Johnson
New Member
Hi DiscussCESM,

I've ported CESM2_1_3 to Shaheen II, which is a Cray XC40 at King Abdullah University of Science and Technology. The port to Shaheen is intended to enable the Data Assimilation Research Section to run 1000-member CAM ensembles to test data assimilation algorithms.

The port has passed the regression tests and the UFCAM and POP ensemble consistency tests. I believe it hasn't hasn't passed the regular CAM ECT because there weren't the necessary files available for comparison.

Since we are running such large ensembles with DART, I am trying to get a small, three member ensemble to run with pnetCDF to improve performance but can't seem to get past the runtime error, "NetCDF: Attempt to use feature that was not turned on when netCDF was built."

When building CESM, I followed @jedwards instructions on this previous DiscussCESM thread, "You should build and install netcdf and pnetcdf separately and link both."

netcdf and pnetcdf are built on Shaheen with the intel compiler and this is the same compiler we are using for CESM.

I tried to get pnetcdf to link following the IO Through Data Models section in the CIME documentation, but the instructions seem incomplete, and I suspect interpreted them incorrectly. For example, it says to, "make sure the Macros.make variables PNETCDF_PATH, INC_PNETCDF, and LIB_PNETCDF "

So, wIthin config_compilers.xml, I attempted to set INC_PNETCDF and LIB_PNETCDF with paths to the pnetcdf include and lib directories, respectively, but got an error, "Schemas validity error : Element 'INC_PNETCDF': This element is not expected."

So I commented those key/values out and just set NETCDF and PNETCDF respectively within config_compilers.xml.

At the end of the cesm.bldlog (attached) it appears that only PNETCDF gets linked:
ftn -o /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/cesm.exe cime_comp_mod.o cime_driver.o component_mod.o component_type_mod.o cplcomp_exchange_mod.o map_glc2lnd_mod.o map_lnd2glc_mod.o map_lnd2rof_irrig_mod.o mrg_mod.o prep_aoflux_mod.o prep_atm_mod.o prep_glc_mod.o prep_ice_mod.o prep_lnd_mod.o prep_ocn_mod.o prep_rof_mod.o prep_wav_mod.o seq_diag_mct.o seq_domain_mct.o seq_flux_mct.o seq_frac_mct.o seq_hist_mod.o seq_io_mod.o seq_map_mod.o seq_map_type_mod.o seq_rest_mod.o t_driver_timers_mod.o -L/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/lib/ -latm -L/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/lib/ -lice - L../../intel/mpt/debug/nothreads/mct/noesmf/lib/ -lclm -L/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/lib/ -locn -L/lustre/scratch/x_johnsobk/FHIST_BGC. f09_d025.075.e03/bld/lib/ -lrof -L/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/lib/ -lglc -L/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/ lib/ -lwav -L/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/lib/ -lesp -L../../intel/mpt/debug/nothreads/mct/noesmf/c3a1l1i1o1r1g1w1e1/lib -lcsm_share - L../../intel/mpt/debug/nothreads/lib -lpio -lgptl -lmct -lmpeu -mkl=cluster -L/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0/lib -lpnetcdf -mkl

This produces a runtime error. The cesm.log says:
1: Rank 1 [Tue Oct 27 21:25:24 2020] [c9-3c0s2n2] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
353: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
353: NetCDF: Attempt to use feature that was not turned on when netCDF was built.


Is this caused by the fact that netcdf is not linked? Here is the relevant output of nc-config --all and the full output is attached as "nc-config.txt":
This netCDF 4.6.3 has been built with the following features:
. . .
--has-pnetcdf -> no
--has-szlib -> no
--has-cdf5 -> yes
--has-parallel4 -> no
--has-parallel -> no
. . .


My next attempt is to try to append the netcdf link using the SLIBS tag in config_compilers.xml:
<SLIBS>
<append>-L$(NETCDF_PATH_KAUST) -lnetcdff</append>
</SLIBS>


Here are some relevant xmlquery settings from the case:
./xmlquery MPILIB
MPILIB: mpt
./xmlquery PIO_TYPENAME
PIO_TYPENAME: ['CPL:pnetcdf', 'ATM:pnetcdf', 'LND:pnetcdf', 'ICE:pnetcdf', 'OCN:pnetcdf', 'ROF:pnetcdf', 'GLC:pnetcdf', 'WAV:pnetcdf', 'ESP:pnetcdf']


Any advice would be greatly appreciated.

Thank you,
Ben Johnson / johnsonb
 

Attachments

  • describe_version.txt
    5.5 KB · Views: 0
  • nc-config.txt
    990 bytes · Views: 0
  • cesm.bldlog.201027-211925.gz
    1.3 KB · Views: 2
  • cesm.log.16060302.201027-212502.txt
    929.4 KB · Views: 0
  • config_compilers.xml.txt
    44.1 KB · Views: 4
  • config_machines.xml.txt
    123.2 KB · Views: 4

jedwards

CSEG and Liaisons
Staff member
I think that this may be a problem in config_machines.xml:
<!-- BKJ 10-15-2020 Switching out the cray netcdf hdf5 parallel libraries because DART needs nco -->
<!--<command name="load">cray-netcdf-hdf5parallel/4.6.3.2</command>
<command name="load">cray-hdf5-parallel/1.10.5.2</command>-->

So you don't have any netcdf loaded at runtime as far as I can tell. Perhaps you should try getting cesm performing well with pnetcdf before
adding dart to the mix. It might also help to know what file it was trying to read when you got the error, usually that will be in one of the component logs.
 

johnsonb

Ben Johnson
New Member
Hi Jim,

Thank you for your prompt response. Yes, I agree with you -- it is unwise to attempt to get pnetcdf and DART running in one go.

I am trying to get CESM running with pnetcdf alone first using a three-member CAM ensemble.

The "Switching out the cray netcdf hdf5 parallel" comment is from a couple of weeks ago when I got an 80-member CAM ensemble working with DART. To avoid a DART conflict with nco, I just loaded cray-netcdf and cray-parallel-netcdf rather than cray-netcdf-hdf5parallel.

However, that configuration was running slowly because it was using serial netcdf, so now I am attempting to get CESM alone working with pnetcdf.

If you look at the line above the one you copied, it reads:

<command name="load">cray-parallel-netcdf/1.11.1.1</command>
<!-- BKJ 10-15-2020 Switching out the cray netcdf hdf5 parallel libraries because DART needs nco -->
<!--<command name="load">cray-netcdf-hdf5parallel/4.6.3.2</command>
<command name="load">cray-hdf5-parallel/1.10.5.2</command>-->


So both netcdf and pnetcdf are loaded at runtime, as far as I can tell.

I've attached software_environment.txt for this case. Modulefiles 30-32 are:
30) cray-parallel-netcdf/1.11.1.1
31) cmake/3.13.4
32) cray-netcdf/4.6.3.2


As far as the individual file goes the last lines of all three atm_000?.logs (sample log 0003 attached) read:
(GETFIL): using
/lustre/project/k1421/cesm_store/inputdata/atm/cam/tracer_cnst/tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc
open_trc_datafile:
/lustre/project/k1421/cesm_store/inputdata/atm/cam/tracer_cnst/tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc


The same file is on glade here:
/glade/p/cesmdata/cseg/inputdata/atm/cam/tracer_cnst/tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc

Additionally, the cesm.log reports:
257: /lustre/project/k1421/cesm_store/inputdata/atm/cam/dst/dst_source2x2tunedcam6-2
257: x2-04062017.nc 3
1: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
1: NetCDF: Attempt to use feature that was not turned on when netCDF was built.


Thank you,
Ben Johnson / johnsonb
 

Attachments

  • software_environment.txt
    38.1 KB · Views: 4
  • atm_0003.log.16060302.201027-212502.txt
    72.7 KB · Views: 2

jedwards

CSEG and Liaisons
Staff member
I think that you might try converting the file tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc
to the classic cdf5 model - do this with nccopy -k cdf5 oldname newname
It looks like the cray build of netcdf is not reading the netcdf4/hdf5 format correctly.
 

johnsonb

Ben Johnson
New Member
Hi Jim,

Thank you for your advice. You were correct on the file format -- converting the halons file to the cdf5 model allow the run to complete with this build of parallel-netcdf.

I'm using the load balancing tools to determine the optimal PE layout for a given ensemble size, but thus far the largest changes in performance that I've encountered come from which netCDF library CESM is built with. The ensembles were actually running faster with the serial netCDF that I started with.

I plan to iterate through the different netCDF/parallel netCDF/netcdf-hdf5parallel libraries available on Shaheen (there are many) so I was wondering if you could clarify my original point of confusion.

You mention in this thread that, "You should build and install netcdf and pnetcdf separately and link both."

Is that what I should be doing in this case by using the SLIBS tag in config_compilers.xml to add the serial library :
<SLIBS>
<append>-L$(NETCDF_PATH_KAUST) -lnetcdff</append>
</SLIBS>

in addition to specifying the parallel library using the PNETCDF_PATH key? Using the SLIBS tag is what is done for the cori-haswell entry in config_compilers.xml -- cori-haswell and Shaheen are both Cray XC-40s.

Or should I only be building with parallel netCDF using the PNETCDF_PATH key in config_compilers.xml?

Thank you,
Ben Johnson / johnsonb
 
Top