Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

ERROR in building GPTL without PAPI

bidyut

BIDYUT BIKASH GOSWAMI
Member
Thank you in advance.


What version of the code are you using?

bgoswami:CESM220$ ./describe_version
------------------------------------------------------------------------
git describe:
cesm2.2.0-0-g332937b
------------------------------------------------------------------------

------------------------------------------------------------------------
git status:
Not currently on any branch.
Untracked files:
(use "git add <file>..." to include in what will be committed)
xmlchange_before_run.md

nothing added to commit but untracked files present (use "git add" to track)
------------------------------------------------------------------------

------------------------------------------------------------------------
manage_externals status:
Processing externals description file : Externals.cfg
Processing externals description file : Externals_CAM.cfg
Processing externals description file : .gitmodules
Processing submodules description file : .gitmodules
Processing externals description file : ../Externals_cime.cfg
Processing externals description file : Externals_CISM.cfg
Processing externals description file : Externals_CLM.cfg
Processing externals description file : Externals_POP.cfg
Checking status of externals: cam, chem_proc, carma, cosp2, clubb, silhs, pumas, atmos_phys, atmos_cubed_sphere, cice, cdeps, fox, cime, cmeps, cism, source_cism, clm, fates, ptclm, fms, mom, mosart, pop, cvmix, marbl, rtm, ww3,
M ./cime
modified sandbox, on cime5.8.32
HEAD detached at cime5.8.32
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: config/cesm/machines/config_batch.xml
modified: config/cesm/machines/config_compilers.xml
modified: config/cesm/machines/config_machines.xml
modified: scripts/lib/CIME/XML/env_mach_specific.py




Have you made any changes to files in the source tree?
  • Changes in config_batch.xml file:
bgoswami:CESM220$ sed -n '169,169p' cime/config/cesm/machines/config_batch.xml
<directive> --mem-per-cpu=2g </directive>

bgoswami:CESM220$ sed -n '643,652p' cime/config/cesm/machines/config_batch.xml
<batch_system MACH="bbg" type="slurm">
<batch_submit>sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-p" name="$JOB_QUEUE"/>
</submit_args>
<queues>
<queue default="true">defaultp</queue>
</queues>
</batch_system>


  • Changes in config_compilers.xml file:
bgoswami:CESM220$ sed -n '1367,1388p' cime/config/cesm/machines/config_compilers.xml
<compiler MACH="bbg" COMPILER="gnu">
<CFLAGS>
<append DEBUG="FALSE"> -O2 </append>
</CFLAGS>
<CONFIG_ARGS>
<base> --host=Linux </base>
</CONFIG_ARGS>
<CPPDEFS>
<append> -DLINUX </append>
</CPPDEFS>
<FFLAGS>
<append DEBUG="FALSE"> -fallow-invalid-boz -fallow-argument-mismatch -O2 </append>
</FFLAGS>
<NETCDF_PATH>/mnt/nfs/clustersw/Debian/bookworm/openmpi/4.1.8/usr/netcdf/4.8.1</NETCDF_PATH>
<!-- <PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS> -->
<SLIBS>
<base> -L${NETCDF_PATH}/lib -lnetcdf -lnetcdff -L/mnt/nfs/clustersw/Debian/bookworm/openblas/0.3.29/lib -lopenblas </base>
</SLIBS>
<CPPDEFS>
<append MODEL="gptl"> -DHAVE_SLASHPROC </append>
</CPPDEFS>
</compiler>

  • Changes in config_machine.xml file:
bgoswami:CESM220$ sed -n '53,112p' cime/config/cesm/machines/config_machines.xml
<machine MACH="bbg" >
<DESC>ISTA HPC, batch system is SLURM</DESC>
<OS>LINUX</OS>
<COMPILERS>gnu</COMPILERS>
<MPILIBS>openmpi</MPILIBS>
<PROJECT>CESM</PROJECT>
<CIME_OUTPUT_ROOT>/nfs/scistore16/mullegrp/bgoswami/CESM220_output</CIME_OUTPUT_ROOT>
<DIN_LOC_ROOT>/nfs/scistore16/mullegrp/bgoswami/model_input/inputdata</DIN_LOC_ROOT>
<DIN_LOC_ROOT_CLMFORC>$DIN_LOC_ROOT</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>${CIME_OUTPUT_ROOT}/archive/$CASE</DOUT_S_ROOT>
<BASELINE_ROOT>${CIME_OUTPUT_ROOT}/cesm_baselines</BASELINE_ROOT>
<CCSM_CPRNC>/nfs/scistore16/mullegrp/bgoswami/CESM220/cime/tools/cprnc</CCSM_CPRNC>
<GMAKE_J>12</GMAKE_J>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>bgoswami -at- ist.ac.at</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>12</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>12</MAX_MPITASKS_PER_NODE>
<PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED>
<mpirun mpilib="default">
<executable>srun</executable>
<arguments>
<arg name="num_tasks"> -n {{ total_tasks }}</arg>
<arg name="thread_count"> -d $ENV{OMP_NUM_THREADS}</arg>
</arguments>
</mpirun>
<module_system type="module">
<init_path lang="perl">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/init/perl</init_path>
<init_path lang="python">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/init/env_modules_python.py</init_path>
<init_path lang="sh">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/init/bash</init_path>
<init_path lang="bash">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/init/bash</init_path>
<cmd_path lang="perl">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/libexec/lmod perl</cmd_path>
<cmd_path lang="python">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/libexec/lmod python</cmd_path>
<cmd_path lang="tcsh">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/libexec/lmod module</cmd_path>
<cmd_path lang="bash">/mnt/nfs/clustersw/Debian/bookworm/lmod/lmod/libexec/lmod module</cmd_path>
<modules>
<!--command name="purge"></command-->
<command name="purge"/>
<!--command name="load">scicomp-formats/20220527</command-->
<!--command name="load">gcc/12.2</command-->
<command name="load">git-lfs/3.6.1</command>
<command name="load">openmpi/4.1.8</command>
<command name="load">netcdf/4.8.1</command>
<command name="load">pnetcdf/1.12.3</command>
<command name="load">python/3.10.6</command>
<command name="load">perl/5.38.0</command>
<command name="load">gptl/8.1.1</command>
<command name="load">openblas/0.3.29</command>
<command name="load">cmake/3.24.2</command>
<command name="load">papi/7.0.1</command>
<!--command name="load">pgi/2019.04</command-->
</modules>
</module_system>
<environment_variables>
<env name="OMP_STACKSIZE">64M</env>
<!--env name="GPTL_VERBOSE">0</env>
<env name="GPTL_MEMORY">0</env>
<env name="CESM_GPTL_NOMEMORY">TRUE</env-->
</environment_variables>

</machine>

  • Changes in cime/scripts/lib/CIME/XML/env_mach_specific.py file:
bgoswami:CESM220$ sed -n '133,133p' cime/scripts/lib/CIME/XML/env_mach_specific.py
return run_cmd_no_fail("bash -c '{}module list'".format(source_cmd), combine_output=True)


Describe every step you took leading up to the problem:

  1. Downloaded CESM2.2.0
  2. Placed it in /nfs/scistore16/mullegrp/bgoswami/CESM220
  3. Edited the files mentioned above. Except, for config_compilers.xml, I was compiling GPTL with -DHAVE_PAPI. I could successfully do ./case.setup and ./case.build but while running the job, I got an ERROR that said, ERROR: (shr_mem_init): GPTLget_memusage mrss0 failed
  4. I contacted our HPC system admin and I was informed that GPTL with -DHAVE_PAPI did not work because I do not have root access. Then I tried to compile GPTL without PAPI. I could build the model successfully (apparently not !).
  5. But I am still getting the same error : "ERROR: (shr_mem_init): GPTLget_memusage mrss0 failed"


If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.

  • Attached:
    • config_batch.xml, config_compilers.xml, config_machines.xml, and env_mach_specific.py (I modified this python script so that bash runs the commands in the xml files, and not the default /bin/sh)
    • gptl.bldlog
    • cesm.log and cpl.log


Describe your problem or question:
  1. Is it OK to build GPTL without PAPI ?
  2. If yes, kindly check my xml and log files and let me know what should I do to address the error I am getting.

Regards,
Bidyut
 

Attachments

  • gptl.bldlog.250627-150958.gz
    1.2 KB · Views: 0

bidyut

BIDYUT BIKASH GOSWAMI
Member
Please find the remaining files attached here. Thank you.
 

Attachments

  • cesm.log.36624671.zip
    4.8 KB · Views: 1
  • cpl.log.36624671.zip
    617 bytes · Views: 0
  • XML_and_py_scripts.zip
    33.1 KB · Views: 0

jedwards

CSEG and Liaisons
Staff member
First I highly recommend not using cesm2.2.0 - consider using the latest cesm3 or go back to cesm2.1.5.

Is it possible that you have the -DHAVE_SLASHPROC on a machine that does not have this mechanism? From the
command line try:
Code:
ls /proc
It is fine to compile gptl without papi but you should also inform your sys admin that using papi should not require root access.
 

bidyut

BIDYUT BIKASH GOSWAMI
Member
Thank you. I did ls /proc and it returned a bunch of files and directories. If I am right, it means -DHAVE_SLASHPROC is OK. Please suggest otherwise.
 

jedwards

CSEG and Liaisons
Staff member
It should then be okay - do you see any errors in the submit log - the one generated by the batch control system?
I see a couple of problems in your log that you will need to consult with your sysad about. Have you tried running a
mpi hello world program across multiple nodes?
[1751144239.431866] [eta354:3298996:0] ucp_context.c:1263 UCX WARN network device 'ib0' is not available, please use one or more of: 'ens22f0'(tcp), 'ibp66s0:1'(ib), 'ibs3'(tcp), 'lo'(tcp)
This error indicates that your mpi and network are not getting along as they should.

Try replacing lines 119-125 of GPTLget_memusage.c with:

Code:
pid = (int) getpid ();                                                                                                                                                         
if (pid <= 0) {                                                                                                                                                               
fprintf (stderr, "get_memusage: pid %d is non-positive\n", pid);                                                                                                             
return -1;                                                                                                                                                                   
}
 
Top