Main menu

Navigation

valgrind

7 posts / 0 new
Last post
mark@...
valgrind

Is it possible to run cesm2 with valgrind turned on?  I tried doing this on Cori by doing these two things:

1) added a "module load valgrind" to cime/config/cesm/machines/config_machines.xml in the cori-haswell section

      <modules mpilib="!mpi-serial">

        <command name="rm">cray-netcdf-hdf5parallel</command>

        <command name="load">cray-netcdf-hdf5parallel/4.4.1.1.6</command>

        <command name="load">cray-hdf5-parallel/1.10.1.1</command>

        <command name="load">cray-parallel-netcdf/1.8.1.3</command>

        <command name="load">valgrind</command>

      </modules>

 

2) modified env_mach_specific.xml

<mpirun mpilib="default">

    <executable>srun</executable>

    <arguments>

      <arg name="label"> --label</arg>

      <arg name="num_tasks"> -n {{ total_tasks }}</arg>

      <arg name="valgrind"> valgrind --leak-check=yes</arg>

      <arg name="binding"> -c {{ srun_binding }}</arg>

    </arguments>

  </mpirun>

 

The log file in the case directory itself showed this:

run command is srun  --label  -n 45  valgrind --leak-check=yes  -c 2 /global/cscratch1/sd/mbranson/ne5-has/bld/cesm.exe  >> cesm.log.$LID 2>&1  

 

But the cesm log file had this:

16: valgrind: 2: command not found

19: valgrind: 2: command not found

18: valgrind: 2: command not found

 6: valgrind: 2: command not found

 1: valgrind: 2: command not found

 7: valgrind: 2: command not found

 4: valgrind: 2: command not found

17: valgrind: 2: command not found

 5: valgrind: 2: command not found

 

so I suspect that adding a module load for valgrind in the config_machines.xml file is only getting utilized in the build stage (i.e., the module is not being loaded when the model is actually executed).  

Is there any workaround for this?

Thanks,

Mark Branson

 
jedwards

After you modified config_machines.xml did you create a new case?  Otherwise this isn't used.   Since you don't intend this to be a permanent change you should modify env_mach_specific.xml in the case instead of config_machines.xml - then you can see the enviroment that cesm will use with the command

source .env_mach_specific.sh  (for bash users)

or 

source .env_mach_specific.csh  (for csh and tcsh users)

 

then do module list - if valgrind is there it should also be on the compute nodes.

CESM Software Engineer

mark@...

Thanks for the reply, Jim.  I followed your advice and added the module load for valgrind into env_mach_specific.xml, and now by doing a source .env_mach_specific.csh I can see that the valgrind module is indeed being loaded.  But I still get "valgrind: command not found" in the cesm log when I try to run the model.  

I was able to successfully run a helloWorld sample fortran program through the batch scheduler using valgrind and it worked so I feel confident that it is indeed available on the compute nodes on Cori.

Mark

 

jedwards

Add some code to your hello world to show you the full path to valgrind and look at the runenv file printed on the case log directory.

CESM Software Engineer

mark@...

Here's the pertinent parts of my run_environment file (to avoid posting all 500 lines of it).  You can see that it seems to load the valgrind module correctly.

Currently Loaded Modulefiles:

  1) modules/3.2.11.1                  11) cray-libsci/19.02.1

  2) altd/2.0                          12) pmi/5.0.14

  3) darshan/3.1.7                     13) atp/2.1.3

  4) cray-hdf5-parallel/1.10.2.0       14) PrgEnv-intel/6.0.5

  5) valgrind/3.15.0                   15) intel/19.0.0.117

  6) craype-haswell                    16) cray-netcdf-hdf5parallel/4.6.1.3

  7) craype-hugepages2M                17) cray-parallel-netcdf/1.8.1.4

  8) craype-network-aries              18) git/2.21.0

  9) craype/2.6.0                      19) cmake/3.14.4

 10) cray-mpich/7.7.8

LD_LIBRARY_PATH=/global/common/cori_cle6/software/intel/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64:/global/common/cori_cle6/software/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64:/opt/cray/job/2.2.4-7.0.0.1_3.26__g36b56f4.ari/lib64:/usr/common/software/valgrind/3.15.0/intel/lib/valgrind:/usr/syscom/nsg/lib

VALGRIND_INCLUDE=-I/usr/common/software/valgrind/3.15.0/intel/include/valgrind

VALGRIND_DIR=/usr/common/software/valgrind/3.15.0/intel

VALGRIND_LINK_OPTS=-L/usr/common/software/valgrind/3.15.0/intel/lib/valgrind -lcoregrind-amd64-linux -lvex-amd64-linux -lgcc

LOADEDMODULES=modules/3.2.11.1:nsg/1.2.0:altd/2.0:darshan/3.1.7:cray-hdf5-parallel/1.10.2.0:valgrind/3.15.0:udreg/2.3.2-7.0.0.1_4.23__g8175d3d.ari:ugni/6.0.14.0-7.0.0.1_7.25__ge78e5b0.ari:dmapp/7.1.1-7.0.0.1_5.15__g25e5077.ari:gni-headers/5.0.12.0-7.0.0.1_7.30__g3b1768f.ari:xpmem/2.2.17-7.0.0.1_3.20__g7acee3a.ari:job/2.2.4-7.0.0.1_3.26__g36b56f4.ari:dvs/2.11_2.2.131-7.0.0.1_7.3__gd2a05f7e:alps/6.6.50-7.0.0.1_3.30__g962f7108.ari:rca/2.2.20-7.0.0.1_4.29__g8e3fb5b.ari:craype-haswell:craype-hugepages2M:craype-network-aries:craype/2.6.0:cray-mpich/7.7.8:cray-libsci/19.02.1:pmi/5.0.14:atp/2.1.3:PrgEnv-intel/6.0.5:intel/19.0.0.117:cray-netcdf-hdf5parallel/4.6.1.3:cray-parallel-netcdf/1.8.1.4:git/2.21.0:cmake/3.14.4

VALGRIND_MPI_LINK=-L/usr/common/software/valgrind/3.15.0/intel/lib/valgrind -lmpiwrap-amd64-linux

mark@...

I finally got it to work.  My original change to env_mach_specific.xml which gave "valgrind: command not found" was this:

  <mpirun mpilib="default">

    <executable>srun</executable>

    <arguments>

      <arg name="label"> --label</arg>

      <arg name="num_tasks"> -n {{ total_tasks }}</arg>

      <arg name="valgrind"> valgrind --leak-check=full --dsymutil=yes --track-origins=yes --log-file=vallog</arg>

      <arg name="binding"> -c {{ srun_binding }}</arg>

    </arguments>

 

and when I changed it to this (made valgrind the last argument) then it worked.  

 

  <mpirun mpilib="default">

    <executable>srun</executable>

    <arguments>

      <arg name="label"> --label</arg>

      <arg name="num_tasks"> -n {{ total_tasks }}</arg>

      <arg name="binding"> -c {{ srun_binding }}</arg>

      <arg name="valgrind"> valgrind --leak-check=full --dsymutil=yes --track-origins=yes --log-file=vallog</arg>

    </arguments>

 
jedwards

I guess that makes it an srun error then?

CESM Software Engineer

Log in or register to post comments

Who's new

  • jwolff
  • tinna.gunnarsdo...
  • sarthak2235@...
  • eolivares@...
  • shubham.gandhi@...