Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Issue installing on Centos 8 with slurm and lmod

william.wilson

William Wilson
New Member
case.setup cannot find our module command. We use bash primarily on our system. As noted in the title we are on Centos 8 using slurm for our scheduler and lmod for modules.

When we run setup_newcase we do not get any errors. This is the last bit of output:

*********************************************************************************************************************************
This compset and grid combination is not scientifically supported, however it is used in 10 tests.
*********************************************************************************************************************************

Using project from config_machines.xml: none
No charge_account info available, using value from PROJECT
cesm model version found: cesm2.2.0
Batch_system_type is slurm
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
job is case.st_archive USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
Creating Case directory /scratch/wew/cesmtest

When we then run case.setup the program cannot find module:

ERROR: module command None load openmpi/3.1.6 netcdf-c/4.7.4 anaconda2/2019.10 failed with message:
/bin/sh: None: command not found

In config_machines.xml we have an entry for our machine as follows:

<machine MACH="monsoon">
<DESC>
Example port to centos8 linux system with gcc, netcdf, pnetcdf and mpich
using modules from Environment Modules – A Great Tool for Clusters » ADMIN Magazine
</DESC>
<NODENAME_REGEX>cn*</NODENAME_REGEX>
<OS>LINUX</OS>
<PROXY> </PROXY>
<COMPILERS>gnu</COMPILERS>
<MPILIBS>openmpi</MPILIBS>
<PROJECT>none</PROJECT>
<SAVE_TIMING_DIR> </SAVE_TIMING_DIR>
<CIME_OUTPUT_ROOT>/scratch/$USER/cesm/scratch</CIME_OUTPUT_ROOT>
<DIN_LOC_ROOT>/common/contrib/cesm/inputdata</DIN_LOC_ROOT>
<DIN_LOC_ROOT_CLMFORC>common/contrib/cesm/inputdata/lmwg</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>/common/contrib/cesm/archive/$CASE</DOUT_S_ROOT>
<BASELINE_ROOT>/common/contrib/cesm/cesm_baselines</BASELINE_ROOT>
<CCSM_CPRNC>/scratch/$USER/cesm2/tools/cime/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE>make</GMAKE>
<GMAKE_J>8</GMAKE_J>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>hpcsupport -at- nau.edu</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>8</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>8</MAX_MPITASKS_PER_NODE>
<PROJECT_REQUIRED>FALSE</PROJECT_REQUIRED>
<mpirun mpilib="openmpi" compiler="gnu">
<executable>mpirun</executable>
<arguments>
<arg name="ntasks"> -np {{ total_tasks }} </arg>
</arguments>
</mpirun>
<module_system type="module">
<init_path lang="bash">/packages/lmod/lmod/init/bash</init_path>
<cmd_path lang="bash">module</cmd_path>
<modules compiler="gnu">
<command name="load">openmpi/3.1.6</command>
<command name="load">netcdf-c/4.7.4</command>
<command name="load">anaconda2/2019.10</command>
</modules>
</module_system>

<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="MODULEPATH">/packages/modulefiles</env>
<env name="TMPDIR">/tmp/$SLURM_JOB_USER</env>
<env name="JOBDIR">$ENV{TMPDIR}/$SLURM_JOB_ID</env>
</environment_variables>
<resource_limits>
<resource name="RLIMIT_STACK">-1</resource>
</resource_limits>
</machine>

We also tried having an entry for csh and sh like with bash and it made no difference. What do we need to look at to take care of the module issue? Thanks.
 

jedwards

CSEG and Liaisons
Staff member
CIME uses python interface to modules: you must have the lang="python" init_path and cmd_path entries.
 

william.wilson

William Wilson
New Member
Thanks, that helped get past that issue with module. Now getting an env issue. Currently we have this set:
<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="MODULEPATH">/packages/modulefiles</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">$ENV{TMPDIR}/$SLURM_JOB_ID</env>
</environment_variables>

Getting this error:
ERROR: Undefined env var 'TMPDIR'
 

jedwards

CSEG and Liaisons
Staff member
$USER and $SLURM_JOB_ID are both environment variables and should use $ENV{}
You shouldn't need MODULEPATH here.
If you still get the error after making these changes you may need to do
<env name="JOBDIR">/tmp/$ENV{USER}/$ENV{SLURM_JOB_ID}</env>
 

william.wilson

William Wilson
New Member
I appreciate your help. I think we are getting closer. So I have the following just to simplify things:
<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">/scratch/$ENV{USER}</env>
</environment_variables>

Now getting a compile error. I should note we are using anaconda3. python version is 3.8.5
Setting Environment OMP_STACKSIZE=256M
Setting Environment TMPDIR=/tmp/wew
Setting Environment JOBDIR=/scratch/wew
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
ERROR: Command: '/packages/anaconda3/2020.11/bin/xmllint --xinclude --noout --schema /scratch/wew/cesm2/cime/config/xml_schemas/config_compilers_v2.xsd /scratch/wew/cesm2/cime/config/cesm/machines/config_compilers.xml' failed with error '/scratch/wew/cesm2/cime/config/cesm/machines/config_compilers.xml:77: element ADD_FFLAGS: Schemas validity error : Element 'ADD_FFLAGS': This element is not expected.
/scratch/wew/cesm2/cime/config/cesm/machines/config_compilers.xml fails to validate' from dir '/scratch/wew/cesmtest'
 

jedwards

CSEG and Liaisons
Staff member
The message is pretty self explanatory:

/cesm/machines/config_compilers.xml:77: element ADD_FFLAGS: Schemas validity error : Element 'ADD_FFLAGS': This element is not expected.

Attach your config_compilers.xml if you can't figure it out.
 

william.wilson

William Wilson
New Member
We're a lot closer. We use spack for software installs on our system and have netcdf-c and netcdf-fortran installed. I've added the setup to machines file. When I do the case.build I get

/scratch/wew/cesm/scratch/cesmtest/bld/glc/lib//libglimmercismfortran.a(dgmres.f.o): In function `dgmres_':
dgmres.f:(.text+0x23e7): undefined reference to `dcopy_'
dgmres.f:(.text+0x24cf): undefined reference to `dnrm2_'
dgmres.f:(.text+0x2599): undefined reference to `dcopy_'
collect2: error: ld returned 1 exit status
make: *** [/scratch/wew/cesmtest/Tools/Makefile:985: /scratch/wew/cesm/scratch/cesmtest/bld/cesm.exe] Error 1

My relevant portion of config_machines.xml is
<module_system type="module">
<init_path lang="bash">/packages/lmod/lmod/init/bash</init_path>
<init_path lang="python">/packages/lmod/lmod/init/env_modules_python.py</init_path>
<cmd_path lang="bash">module</cmd_path>
<cmd_path lang="python">module</cmd_path>
<modules compiler="gnu">
<command name="load">openmpi/3.1.6</command>
<command name="load">netcdf-c/4.7.4</command>
<command name="load">netcdf-fortran/4.5.3-gtyy5o4</command>
<command name="load">anaconda2/2019.10</command>
</modules>
</module_system>

<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="NETCDF_C_PATH">/packages/gcc-8.3.1/netcdf-c/4.7.4-opdm2fw</env>
<env name="NETCDF_FORTRAN_PATH">/packages/openmpi-3.1/netcdf-fortran/4.5.3-gtyy5o4</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">/scratch/$ENV{USER}</env>
</environment_variables>
 

william.wilson

William Wilson
New Member
Still having issues with the case.submit. We're still getting the following despite having lapack and lapack-devel installed and I even installed the latest lapack.

/scratch/wew/cesm/scratch/cesmtest/bld/glc/lib//libglimmercismfortran.a(dgmres.f.o): In function `dgmres_':
dgmres.f:(.text+0x23e7): undefined reference to `dcopy_'
dgmres.f:(.text+0x24cf): undefined reference to `dnrm2_'
dgmres.f:(.text+0x2599): undefined reference to `dcopy_'
collect2: error: ld returned 1 exit status
make: *** [/scratch/wew/cesmtest/Tools/Makefile:985: /scratch/wew/cesm/scratch/cesmtest/bld/cesm.exe] Error 1

Current LD_LIBRARY_PATH is /packages/openmpi/3.1.6/lib:/packages/openmpi/3.1.6/lib64:/packages/lapack/3.9.0/lib64

From config_machines.xml

<module_system type="module">
<init_path lang="bash">/packages/lmod/lmod/init/bash</init_path>
<init_path lang="python">/packages/lmod/lmod/init/env_modules_python.py</init_path>
<cmd_path lang="bash">module</cmd_path>
<cmd_path lang="python">module</cmd_path>
<modules compiler="gnu">
<command name="load">openmpi/3.1.6</command>
<command name="load">netcdf-c/4.7.4</command>
<command name="load">lapack/3.9.0</command>
<command name="load">netcdf-fortran/4.5.3-gtyy5o4</command>
<command name="load">anaconda3/2020.11</command>
</modules>
</module_system>

<environment_variables>
<env name="OMP_STACKSIZE">256M</env>
<env name="NETCDF_C_PATH">/packages/gcc-8.3.1/netcdf-c/4.7.4-opdm2fw</env>
<env name="NETCDF_FORTRAN_PATH">/packages/openmpi-3.1/netcdf-fortran/4.5.3-gtyy5o4</env>
<env name="TMPDIR">/tmp/$USER</env>
<env name="JOBDIR">/scratch/$ENV{USER}</env>
</environment_variables>
 

jedwards

CSEG and Liaisons
Staff member
Have you added the lapack and blas libraries to SLIBS in config_compilers.xml?
<SLIBS>
<append> -llapack -lblas </append>
</SLIBS>
 

jedwards

CSEG and Liaisons
Staff member
The file in your post does not exist on the server - what are you doing that generates that error?

We do support several methods for access to our inputdata server - see file cime/config/cesm/config_inputdata.xml
We do have a fix for the wget issue. You can get it here: ESMCI/cime
 

william.wilson

William Wilson
New Member
I went throught the following procedure:
created a newcase
did .case.setup
did .case.build
did .case.submit and get the svn errors.
 

william.wilson

William Wilson
New Member
I went through the following procedure:
created a newcase
did .case.setup
did .case.build
did .case.submit and get the svn errors.

Also getting an ftp error when paging back thru the output.


Trying to download file: 'lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Precip/clmforc.GSWP3.c2011.0.5x0.5.Prec.1909-05.nc' to path '/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Precip/clmforc.GSWP3.c2011.0.5x0.5.Prec.1909-05.nc' using FTP protocol.
ERROR from ftp server, trying next server

Also for wget.


Model datm missing file file220 = '/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc'
Trying to download file: 'lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc' to path '/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc' using WGET protocol.
wget failed with output: and errput --2021-02-09 13:33:32-- ftp://ftp.cgd.ucar.edu/cesm/inputdata/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc
=> ‘/scratch/wew/cesminput/lmwg/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1919-04.nc’

The professor asking for the software to be set up gave me the following for a test case:

scripts/create_newcase --case /scratch/$USER/cesmtest --res f09_g17 --compset I1850Clm50Sp --machine monsoon -i <input directory to be used>

$USER is of course the userid of the person in question. He gave the compset and the res. If there is a better compset and res to use for a quick test, please let me know.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
William sorry about the trouble you are running into.

We talk about some of these issues in the CTSM User's Guide here...


There's also a cime issue that relates to this...



But, in your specific case you need to change your definition of DIN_LOC_ROOT_CLMFORC. As talked about above on cheyenne because of disk-space limitation we had to put the forcing data under a different disk (and hence directory) than the rest of the inputdata. This results in the problems downloading the data that you are running into. However, in your case there is no reason that you have to put your DIN_LOC_ROOT_CLMFORC in a different directly.

You have defined it here...

<DIN_LOC_ROOT_CLMFORC>common/contrib/cesm/inputdata/lmwg</DIN_LOC_ROOT_CLMFORC>

Change it so that it points to this...

<DIN_LOC_ROOT_CLMFORC>$DIN_LOC_ROOT/atm/datm7</DIN_LOC_ROOT_CLMFORC>

that should then get the paths to line up so it can download the data. As we talk about in the first link I share above I recommend that you only download the forcing data a few years at a time. It's likely going to take days to download and as such you don't want to do it all once.

Let us know if you continue to have trouble.
 

jonwells04

Jon Wells
New Member
Hi Erik,

Thanks so much for helping with the machine setup. I'm going to post what we've done so far and our current status.

The DIN_LOC_ROOT_CLMFORC setting you describe above allowed us to start downloading the data.

Once we could access the forcing data we ran into the same error as the following post :
Problem in downloading input data when submit the case(error: 'UNSET/atm_forcing.datm7..')

We deleted the same aerosol deposition file, redownloaded it, and the "ERROR: (shr_stream_verifyTCoord) ERROR: calendar dates must be increasing" went away.

A new problem arose running clm5: ./create_newcase --compset I1850Clm50BgcCropG --res f45_g37 --machine monsoon --case /scratch/$USER/ctsm_test --run-unsupported

The case would submit and run for about 3-4 minutes and then error out. The run.ctsm_test file showed the following error:
run command is mpirun -np 8 /scratch/jw2636/cesm/scratch/ctsm_test/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpirun -np 8 /scratch/jw2636/cesm/scratch/ctsm_test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /scratch/jw2636/cesm/scratch/ctsm_test/run/cesm.log.37361198.210222-221913
slurmstepd: error: Detected 1 oom-kill event(s) in step 37361198.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The cesm log is attached and shows strange "netCDF: Invalid dimension ID or name" and "netCDF: Variable not found" messages but no obvious error message as far as I could tell.

I followed the following post and used ./case.submit --resubmit-immediate to try to get past the memory error:
Resubmit memory failure

But ./case.submit --resubmit-immediate didn't work (maybe I did it wrong?).

The other suggestion in the above thread was to ssh back to the login node before resubmitting the job so I added the following to the config_batch.xml based on the stampede-skx example:
<batch_system MACH="monsoon" type="slurm" >
<batch_submit>ssh monsoon.hpc.nau.edu cd $CASEROOT ; sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-p" name="$JOB_QUEUE"/>
</submit_args>
<queues>
<queue walltimemax="48:00:00" nodemin="1" nodemax="256" default="true">normal</queue>
<queue walltimemax="02:00:00" nodemin="1" nodemax="8" >dev</queue>
</queues>
</batch_system>

After recreating the case, case.setup, case.build, and case.submit --resubmit-immediate:
Submitting job script ssh monsoon.hpc.nau.edu 'cd /scratch/jw2636/ctsm_test ; sbatch --time 48:00:00 -p normal .case.run --completion-sets-continue-run'
jw2636@monsoon.hpc.nau.edu's password:
ERROR: Command: 'ssh monsoon.hpc.nau.edu 'cd /scratch/jw2636/ctsm_test ; sbatch --time 48:00:00 -p normal .case.run --completion-sets-continue-run'' failed with error 'sbatch: error: invalid partition specified: normal
sbatch: error: Batch job submission failed: Invalid partition name specified' from dir '/scratch/jw2636/ctsm_test'

William, what are the correct partition names on Monsoon/slurm (to replace normal/dev in the above example) and can we avoid the password sign-in if ssh back to the login node is the necessary way to go?

Erik, can you think of any other work around or a config file that maybe hasn't been setup correctly that would cause slurm's "out-of-memory handler" to kill jobs and keep slurm from resubmitting? I attached our machine folder as well.

Thanks!
 

Attachments

  • cesm.log.12500.201120-104804 (1).txt
    12.5 KB · Views: 3
  • machines.zip
    228.4 KB · Views: 10

jedwards

CSEG and Liaisons
Staff member
I think that you need to remove the ssh from batch_submit and just use sbatch in that line.

If you really think you need ssh there, you should be able to do
```
ssh monsoon.hpc.nau.edu date
```
and get a date back without a password prompt.
 
Top