Issue installing on Centos 8 with slurm and lmod

william.wilson · Feb 4, 2021

case.setup cannot find our module command. We use bash primarily on our system. As noted in the title we are on Centos 8 using slurm for our scheduler and lmod for modules.

When we run setup_newcase we do not get any errors. This is the last bit of output:

*********************************************************************************************************************************
This compset and grid combination is not scientifically supported, however it is used in 10 tests.
*********************************************************************************************************************************

Using project from config_machines.xml: none
No charge_account info available, using value from PROJECT
cesm model version found: cesm2.2.0
Batch_system_type is slurm
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
job is case.st_archive USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
Creating Case directory /scratch/wew/cesmtest

When we then run case.setup the program cannot find module:

ERROR: module command None load openmpi/3.1.6 netcdf-c/4.7.4 anaconda2/2019.10 failed with message:
/bin/sh: None: command not found

In config_machines.xml we have an entry for our machine as follows:

<machine MACH="monsoon">
<DESC>
Example port to centos8 linux system with gcc, netcdf, pnetcdf and mpich
using modules from Environment Modules – A Great Tool for Clusters » ADMIN Magazine
</DESC>
<NODENAME_REGEX>cn*</NODENAME_REGEX>
<OS>LINUX</OS>
<PROXY> </PROXY>
<COMPILERS>gnu</COMPILERS>
<MPILIBS>openmpi</MPILIBS>
<PROJECT>none</PROJECT>
<SAVE_TIMING_DIR> </SAVE_TIMING_DIR>
<CIME_OUTPUT_ROOT>/scratch/$USER/cesm/scratch</CIME_OUTPUT_ROOT>
<DIN_LOC_ROOT>/common/contrib/cesm/inputdata</DIN_LOC_ROOT>
<DIN_LOC_ROOT_CLMFORC>common/contrib/cesm/inputdata/lmwg</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>/common/contrib/cesm/archive/$CASE</DOUT_S_ROOT>
<BASELINE_ROOT>/common/contrib/cesm/cesm_baselines</BASELINE_ROOT>
<CCSM_CPRNC>/scratch/$USER/cesm2/tools/cime/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE>make</GMAKE>
<GMAKE_J>8</GMAKE_J>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>hpcsupport -at- nau.edu</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>8</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>8</MAX_MPITASKS_PER_NODE>
<PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED>
<mpirun mpilib="openmpi" compiler="gnu">
<executable>mpirun</executable>
<arguments>
<arg name="ntasks"> -np {{ total_tasks }} </arg>
</arguments>
</mpirun>
<module_system type="module">
<init_path lang="bash">/packages/lmod/lmod/init/bash</init_path>
<cmd_path lang="bash">module</cmd_path>
<modules compiler="gnu">
<command name="load">openmpi/3.1.6</command>
<command name="load">netcdf-c/4.7.4</command>
<command name="load">anaconda2/2019.10</command>
</modules>
</module_system>

<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="MODULEPATH">/packages/modulefiles</env>
<env name="TMPDIR">/tmp/$SLURM_JOB_USER</env>
<env name="JOBDIR">$ENV{TMPDIR}/$SLURM_JOB_ID</env>
</environment_variables>
<resource_limits>
<resource name="RLIMIT_STACK">-1</resource>
</resource_limits>
</machine>

We also tried having an entry for csh and sh like with bash and it made no difference. What do we need to look at to take care of the module issue? Thanks.

jedwards · Feb 4, 2021

CIME uses python interface to modules: you must have the lang="python" init_path and cmd_path entries.

william.wilson · Feb 4, 2021

Thanks, that helped get past that issue with module. Now getting an env issue. Currently we have this set:
<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="MODULEPATH">/packages/modulefiles</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">$ENV{TMPDIR}/$SLURM_JOB_ID</env>
</environment_variables>

Getting this error:
ERROR: Undefined env var 'TMPDIR'

jedwards · Feb 4, 2021

$USER and $SLURM_JOB_ID are both environment variables and should use $ENV{}
You shouldn't need MODULEPATH here.
If you still get the error after making these changes you may need to do
<env name="JOBDIR">/tmp/$ENV{USER}/$ENV{SLURM_JOB_ID}</env>

william.wilson · Feb 4, 2021

I appreciate your help. I think we are getting closer. So I have the following just to simplify things:
<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">/scratch/$ENV{USER}</env>
</environment_variables>

Now getting a compile error. I should note we are using anaconda3. python version is 3.8.5
Setting Environment OMP_STACKSIZE=256M
Setting Environment TMPDIR=/tmp/wew
Setting Environment JOBDIR=/scratch/wew
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
ERROR: Command: '/packages/anaconda3/2020.11/bin/xmllint --xinclude --noout --schema /scratch/wew/cesm2/cime/config/xml_schemas/config_compilers_v2.xsd /scratch/wew/cesm2/cime/config/cesm/machines/config_compilers.xml' failed with error '/scratch/wew/cesm2/cime/config/cesm/machines/config_compilers.xml:77: element ADD_FFLAGS: Schemas validity error : Element 'ADD_FFLAGS': This element is not expected.
/scratch/wew/cesm2/cime/config/cesm/machines/config_compilers.xml fails to validate' from dir '/scratch/wew/cesmtest'

jedwards · Feb 4, 2021

The message is pretty self explanatory:

/cesm/machines/config_compilers.xml:77: element ADD_FFLAGS: Schemas validity error : Element 'ADD_FFLAGS': This element is not expected.

Attach your config_compilers.xml if you can't figure it out.

william.wilson · Feb 4, 2021

I think we may have cesm going. Thanks again for the assist.

william.wilson · Feb 4, 2021

We're a lot closer. We use spack for software installs on our system and have netcdf-c and netcdf-fortran installed. I've added the setup to machines file. When I do the case.build I get

/scratch/wew/cesm/scratch/cesmtest/bld/glc/lib//libglimmercismfortran.a(dgmres.f.o): In function `dgmres_':
dgmres.f:(.text+0x23e7): undefined reference to `dcopy_'
dgmres.f:(.text+0x24cf): undefined reference to `dnrm2_'
dgmres.f:(.text+0x2599): undefined reference to `dcopy_'
collect2: error: ld returned 1 exit status
make: *** [/scratch/wew/cesmtest/Tools/Makefile:985: /scratch/wew/cesm/scratch/cesmtest/bld/cesm.exe] Error 1

My relevant portion of config_machines.xml is
<module_system type="module">
<init_path lang="bash">/packages/lmod/lmod/init/bash</init_path>
<init_path lang="python">/packages/lmod/lmod/init/env_modules_python.py</init_path>
<cmd_path lang="bash">module</cmd_path>
<cmd_path lang="python">module</cmd_path>
<modules compiler="gnu">
<command name="load">openmpi/3.1.6</command>
<command name="load">netcdf-c/4.7.4</command>
<command name="load">netcdf-fortran/4.5.3-gtyy5o4</command>
<command name="load">anaconda2/2019.10</command>
</modules>
</module_system>

<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="NETCDF_C_PATH">/packages/gcc-8.3.1/netcdf-c/4.7.4-opdm2fw</env>
<env name="NETCDF_FORTRAN_PATH">/packages/openmpi-3.1/netcdf-fortran/4.5.3-gtyy5o4</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">/scratch/$ENV{USER}</env>
</environment_variables>

jedwards · Feb 4, 2021

I believe that that link error is about lapack vendors often supply a version, but you can also get it here LAPACK: Main Page

william.wilson · Feb 5, 2021

Still having issues with the case.submit. We're still getting the following despite having lapack and lapack-devel installed and I even installed the latest lapack.

/scratch/wew/cesm/scratch/cesmtest/bld/glc/lib//libglimmercismfortran.a(dgmres.f.o): In function `dgmres_':
dgmres.f:(.text+0x23e7): undefined reference to `dcopy_'
dgmres.f:(.text+0x24cf): undefined reference to `dnrm2_'
dgmres.f:(.text+0x2599): undefined reference to `dcopy_'
collect2: error: ld returned 1 exit status
make: *** [/scratch/wew/cesmtest/Tools/Makefile:985: /scratch/wew/cesm/scratch/cesmtest/bld/cesm.exe] Error 1

Current LD_LIBRARY_PATH is /packages/openmpi/3.1.6/lib:/packages/openmpi/3.1.6/lib64:/packages/lapack/3.9.0/lib64

From config_machines.xml

<module_system type="module">
<init_path lang="bash">/packages/lmod/lmod/init/bash</init_path>
<init_path lang="python">/packages/lmod/lmod/init/env_modules_python.py</init_path>
<cmd_path lang="bash">module</cmd_path>
<cmd_path lang="python">module</cmd_path>
<modules compiler="gnu">
<command name="load">openmpi/3.1.6</command>
<command name="load">netcdf-c/4.7.4</command>
<command name="load">lapack/3.9.0</command>
<command name="load">netcdf-fortran/4.5.3-gtyy5o4</command>
<command name="load">anaconda3/2020.11</command>
</modules>
</module_system>

<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="NETCDF_C_PATH">/packages/gcc-8.3.1/netcdf-c/4.7.4-opdm2fw</env>
<env name="NETCDF_FORTRAN_PATH">/packages/openmpi-3.1/netcdf-fortran/4.5.3-gtyy5o4</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">/scratch/$ENV{USER}</env>
</environment_variables>

jedwards · Feb 5, 2021

Have you added the lapack and blas libraries to SLIBS in config_compilers.xml?
<SLIBS>
<append> -llapack -lblas </append>
</SLIBS>

william.wilson · Feb 5, 2021

Is there an in-depth document on setting up cesm when setting it up on new machines?

jedwards · Feb 5, 2021

7. Porting and validating CIME on a new platform — CIME master documentation

william.wilson · Feb 9, 2021

we are currently hung up on the submit portion of case.submit where it downloads using svn protocol. For example we get this on the slurm job that is submitted. We do not have svn installed and wget reports a cert error. What is the simplest way to fix this?

svn export failed with output: and errput svn: E170000: URL 'https://svn-ccsm-inputdata.cgd.ucar.../clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1920-12.nc' doesn't exist

jedwards · Feb 9, 2021

The file in your post does not exist on the server - what are you doing that generates that error?

We do support several methods for access to our inputdata server - see file cime/config/cesm/config_inputdata.xml
We do have a fix for the wget issue. You can get it here: ESMCI/cime

william.wilson · Feb 9, 2021

I went throught the following procedure:
created a newcase
did .case.setup
did .case.build
did .case.submit and get the svn errors.

william.wilson · Feb 9, 2021

I went through the following procedure:
created a newcase
did .case.setup
did .case.build
did .case.submit and get the svn errors.

Also getting an ftp error when paging back thru the output.

Trying to download file: 'lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Precip/clmforc.GSWP3.c2011.0.5x0.5.Prec.1909-05.nc' to path '/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Precip/clmforc.GSWP3.c2011.0.5x0.5.Prec.1909-05.nc' using FTP protocol.
ERROR from ftp server, trying next server

Also for wget.

Model datm missing file file220 = '/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc'
Trying to download file: 'lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc' to path '/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc' using WGET protocol.
wget failed with output: and errput --2021-02-09 13:33:32-- ftp://ftp.cgd.ucar.edu/cesm/inputdata/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc
=> ‘/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc’

The professor asking for the software to be set up gave me the following for a test case:

scripts/create_newcase --case /scratch/$USER/cesmtest --res f09_g17 --compset I1850Clm50Sp --machine monsoon -i <input directory to be used>

$USER is of course the userid of the person in question. He gave the compset and the res. If there is a better compset and res to use for a quick test, please let me know.

erik · Feb 10, 2021

William sorry about the trouble you are running into.

We talk about some of these issues in the CTSM User's Guide here...

Redirecting...

There's also a cime issue that relates to this...

Remove DIN_LOC_ROOT_CLMFORC variable in favor of softlinks on machines that need it · Issue #3097 · ESMCI/cime

In our CSEG meeting this morning we decided we should remove the DIN_LOC_ROOT_CLMFORC env variable, who's purpose is to put DATM forcing files on a different disk than the main DIN_LOC_ROOT directo...

github.com

But, in your specific case you need to change your definition of DIN_LOC_ROOT_CLMFORC. As talked about above on cheyenne because of disk-space limitation we had to put the forcing data under a different disk (and hence directory) than the rest of the inputdata. This results in the problems downloading the data that you are running into. However, in your case there is no reason that you have to put your DIN_LOC_ROOT_CLMFORC in a different directly.

You have defined it here...

<DIN_LOC_ROOT_CLMFORC>common/contrib/cesm/inputdata/lmwg</DIN_LOC_ROOT_CLMFORC>

Change it so that it points to this...

<DIN_LOC_ROOT_CLMFORC>$DIN_LOC_ROOT/atm/datm7</DIN_LOC_ROOT_CLMFORC>

that should then get the paths to line up so it can download the data. As we talk about in the first link I share above I recommend that you only download the forcing data a few years at a time. It's likely going to take days to download and as such you don't want to do it all once.

Let us know if you continue to have trouble.

jonwells04 · Feb 22, 2021

Hi Erik,

Thanks so much for helping with the machine setup. I'm going to post what we've done so far and our current status.

The DIN_LOC_ROOT_CLMFORC setting you describe above allowed us to start downloading the data.

Once we could access the forcing data we ran into the same error as the following post :
Problem in downloading input data when submit the case(error: 'UNSET/atm_forcing.datm7..')

We deleted the same aerosol deposition file, redownloaded it, and the "ERROR: (shr_stream_verifyTCoord) ERROR: calendar dates must be increasing" went away.

A new problem arose running clm5: ./create_newcase --compset I1850Clm50BgcCropG --res f45_g37 --machine monsoon --case /scratch/$USER/ctsm_test --run-unsupported

The case would submit and run for about 3-4 minutes and then error out. The run.ctsm_test file showed the following error:
run command is mpirun -np 8 /scratch/jw2636/cesm/scratch/ctsm_test/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpirun -np 8 /scratch/jw2636/cesm/scratch/ctsm_test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /scratch/jw2636/cesm/scratch/ctsm_test/run/cesm.log.37361198.210222-221913
slurmstepd: error: Detected 1 oom-kill event(s) in step 37361198.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The cesm log is attached and shows strange "netCDF: Invalid dimension ID or name" and "netCDF: Variable not found" messages but no obvious error message as far as I could tell.

I followed the following post and used ./case.submit --resubmit-immediate to try to get past the memory error:
Resubmit memory failure

But ./case.submit --resubmit-immediate didn't work (maybe I did it wrong?).

The other suggestion in the above thread was to ssh back to the login node before resubmitting the job so I added the following to the config_batch.xml based on the stampede-skx example:
<batch_system MACH="monsoon" type="slurm" >
<batch_submit>ssh monsoon.hpc.nau.edu cd $CASEROOT ; sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-p" name="$JOB_QUEUE"/>
</submit_args>
<queues>
<queue walltimemax="48:00:00" nodemin="1" nodemax="256" default="true">normal</queue>
<queue walltimemax="02:00:00" nodemin="1" nodemax="8" >dev</queue>
</queues>
</batch_system>

After recreating the case, case.setup, case.build, and case.submit --resubmit-immediate:
Submitting job script ssh monsoon.hpc.nau.edu 'cd /scratch/jw2636/ctsm_test ; sbatch --time 48:00:00 -p normal .case.run --completion-sets-continue-run'
jw2636@monsoon.hpc.nau.edu's password:
ERROR: Command: 'ssh monsoon.hpc.nau.edu 'cd /scratch/jw2636/ctsm_test ; sbatch --time 48:00:00 -p normal .case.run --completion-sets-continue-run'' failed with error 'sbatch: error: invalid partition specified: normal
sbatch: error: Batch job submission failed: Invalid partition name specified' from dir '/scratch/jw2636/ctsm_test'

William, what are the correct partition names on Monsoon/slurm (to replace normal/dev in the above example) and can we avoid the password sign-in if ssh back to the login node is the necessary way to go?

Erik, can you think of any other work around or a config file that maybe hasn't been setup correctly that would cause slurm's "out-of-memory handler" to kill jobs and keep slurm from resubmitting? I attached our machine folder as well.

Thanks!

jedwards · Feb 23, 2021

I think that you need to remove the ssh from batch_submit and just use sbatch in that line.

If you really think you need ssh there, you should be able to do
```
ssh monsoon.hpc.nau.edu date
```
and get a date back without a password prompt.

Issue installing on Centos 8 with slurm and lmod

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

William Wilson

New Member

Erik Kluzek

CSEG and Liaisons

Jon Wells

New Member

Attachments

CSEG and Liaisons