Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Issue installing on Centos 8 with slurm and lmod

jonwells04

Jon Wells
New Member
We ran the scripts_regression_tests.py and found several failures (results attached). Good news is most things seem to work.

The errors center around the CMakeMacro and CXX compilation:
CMake Error in CMakeLists.txt:
No CMAKE_CXX_COMPILER could be found.
Tell CMake where to find the compiler by setting either the environment
variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.

We tried several iterations of adding the path but still receive the same errors (attached)
here is our current config_compilers.xml settings:
<compiler COMPILER="gnu">
<CFLAGS>
<base> -std=gnu99 </base>
<append compile_threaded="TRUE"> -fopenmp </append>
<append DEBUG="TRUE"> -g -Wall -Og -fbacktrace -ffpe-trap=invalid,zero,overflow -fcheck=bounds </append>
<append DEBUG="FALSE"> -O </append>
</CFLAGS>
<CPPDEFS>
<!-- Top (The GNU Fortran Compiler) -->
<append> -DFORTRANUNDERSCORE -DNO_R16 -DCPRGNU</append>
</CPPDEFS>
<CXX_LINKER>FORTRAN</CXX_LINKER>
<FC_AUTO_R8>
<base> -fdefault-real-8 </base>
</FC_AUTO_R8>
<FFLAGS>
<!-- -ffree-line-length-none and -ffixed-line-length-none need to be in FFLAGS rather than in FIXEDFLAGS/FREEFLAGS
so that these are passed to cmake builds (cmake builds don't use FIXEDFLAGS and FREEFLAGS). -->
<base> -fconvert=big-endian -ffree-line-length-none -ffixed-line-length-none </base>
<append compile_threaded="TRUE"> -fopenmp </append>
<!-- Ideally, we would also have 'invalid' in the ffpe-trap list. But at
least with some versions of gfortran (confirmed with 5.4.0, 6.3.0 and
7.1.0), gfortran's isnan (which is called in cime via the
CPRGNU-specific shr_infnan_isnan) causes a floating point exception
when called on a signaling NaN. -->
<append DEBUG="TRUE"> -g -Wall -Og -fbacktrace -ffpe-trap=zero,overflow -fcheck=bounds </append>
<append DEBUG="FALSE"> -O </append>
</FFLAGS>
<FFLAGS_NOOPT>
<base> -O0 </base>
</FFLAGS_NOOPT>
<FIXEDFLAGS>
<base> -ffixed-form </base>
</FIXEDFLAGS>
<FREEFLAGS>
<base> -ffree-form </base>
</FREEFLAGS>
<HAS_F2008_CONTIGUOUS>FALSE</HAS_F2008_CONTIGUOUS>
<LDFLAGS>
<append compile_threaded="TRUE"> -fopenmp </append>
</LDFLAGS>
<MPICC> mpicc </MPICC>
<MPICXX> mpicxx </MPICXX>
<MPIFC> mpif90 </MPIFC>
<SCC> gcc </SCC>
<SCXX> g++ </SCXX>
<SFC> gfortran </SFC>
<SUPPORTS_CXX>TRUE</SUPPORTS_CXX>
</compiler>

<compiler MACH="monsoon" COMPILER="gnu">
<SLIBS>
<append> -llapack -lblas </append>
</SLIBS>
<CXX_LIBS>
<append> -L/usr/bin/g++ -lfoo </append>
</CXX_LIBS>
<CXX_LDFLAGS>
<append> -cxxlib </append>
</CXX_LDFLAGS>
</compiler>

Any suggestions?
 

Attachments

  • cesm2-test7.zip
    17.5 KB · Views: 2

jonwells04

Jon Wells
New Member
update on scripts_regression_tests:

After some trial and error with the configuration, and following the directions to run the scripts_regression_tests.py on the login node within the $CIME_HOME/scripts/tests folder, we have just 4 FAILS:

1) test_cime_case_test_custom_project
2) test_bless_test_results
3) test_run_restart
4) test_full_system

We also have a handful of skipped tests. Are these tests critical to pass before moving on to pre-alpha testing? Results are attached.

I've also attached our machine, compiler, and batch xml files for anyone who finds this thread in the future.
 

Attachments

  • config.zip
    5.5 KB · Views: 3
  • Reg_test_results.zip
    6.9 KB · Views: 4

jedwards

CSEG and Liaisons
Staff member
I wouldn't worry about #1 & 2, but 3 and 4 should probably pass.

For 4 you need to go to the /scratch/jw2636/cesmoutput/scripts_regression_test.20210313_050515
and look into why they are failing - or you can look at an individual test from test_full_system by running for example:
$ ./create_test SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A.wind_gnu

from the scripts directory. It will give you a path to the generated case directory which
you can examine to discover the reason for the failure.
 

jonwells04

Jon Wells
New Member
Great, thank you!

It seems like the base-restart comparison process is failing in both test_run_restart and test_full_system...
The history files from the restarted run are missing but I'm not sure why. Any other places to look for errors to troubleshoot? I'm going to run the pre-alpha test to see if I can gain any new info.

From test_run_restart:
TestStatus (everything else passed):
FAIL NODEFAIL_P1.f09_g16.X.wind_gnu COMPARE_base_rest

TestStatus.log (setup, sharedlid_build, model_build, and submit passed):
comparing model 'xatm'
no hist files found for model xatm
comparing model 'xlnd'
no hist files found for model xlnd
comparing model 'xice'
no hist files found for model xice
comparing model 'xocn'
no hist files found for model xocn
comparing model 'xrof'
no hist files found for model xrof
comparing model 'xglc'
no hist files found for model xglc
comparing model 'xwav'
no hist files found for model xwav
comparing model 'siac'
no hist files found for model siac
comparing model 'sesp'
no hist files found for model sesp
comparing model 'cpl'
NODEFAIL_P1.f09_g16.X.wind_gnu.20210313_202450_o1jhok.cpl.hi.0001-01-01-19800.nc.base did NOT match NODEFAIL_P1.f09_g16.X.wind_gnu.20210313_202450_o1jhok.cpl.hi.0001-01-01-19800.nc.rest

From test_full_system:
TestStatus (everything else passed):
FAIL DAE.ww3a.ADWAV.wind_gnu COMPARE_base_da

TestStatus.log (setup, sharedlid_build, model_build, and submit passed):
comparing model 'satm'
no hist files found for model satm
comparing model 'slnd'
no hist files found for model slnd
comparing model 'sice'
no hist files found for model sice
comparing model 'socn'
no hist files found for model socn
comparing model 'srof'
no hist files found for model srof
comparing model 'sglc'
no hist files found for model sglc
comparing model 'dwav'
no hist files found for model dwav
comparing model 'siac'
no hist files found for model siac
comparing model 'sesp'
no hist files found for model sesp
comparing model 'cpl'
DAE.ww3a.ADWAV.wind_gnu.20210313_212752_6m6oad.cpl.hi.0001-01-05-00000.nc.base did NOT match DAE.ww3a.ADWAV.wind_gnu.20210313_212752_6m6oad.cpl.hi.0001-01-05-00000.nc.da
 

ykp990521

ykp990521
Member
Hi Erik,

Thanks so much for helping with the machine setup. I'm going to post what we've done so far and our current status.

The DIN_LOC_ROOT_CLMFORC setting you describe above allowed us to start downloading the data.

Once we could access the forcing data we ran into the same error as the following post :
Problem in downloading input data when submit the case(error: 'UNSET/atm_forcing.datm7..')

We deleted the same aerosol deposition file, redownloaded it, and the "ERROR: (shr_stream_verifyTCoord) ERROR: calendar dates must be increasing" went away.

A new problem arose running clm5: ./create_newcase --compset I1850Clm50BgcCropG --res f45_g37 --machine monsoon --case /scratch/$USER/ctsm_test --run-unsupported

The case would submit and run for about 3-4 minutes and then error out. The run.ctsm_test file showed the following error:
run command is mpirun -np 8 /scratch/jw2636/cesm/scratch/ctsm_test/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpirun -np 8 /scratch/jw2636/cesm/scratch/ctsm_test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /scratch/jw2636/cesm/scratch/ctsm_test/run/cesm.log.37361198.210222-221913
slurmstepd: error: Detected 1 oom-kill event(s) in step 37361198.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The cesm log is attached and shows strange "netCDF: Invalid dimension ID or name" and "netCDF: Variable not found" messages but no obvious error message as far as I could tell.

I followed the following post and used ./case.submit --resubmit-immediate to try to get past the memory error:
Resubmit memory failure

But ./case.submit --resubmit-immediate didn't work (maybe I did it wrong?).

The other suggestion in the above thread was to ssh back to the login node before resubmitting the job so I added the following to the config_batch.xml based on the stampede-skx example:
<batch_system MACH="monsoon" type="slurm" >
<batch_submit>ssh monsoon.hpc.nau.edu cd $CASEROOT ; sbatch</batch_submit>
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
<arg flag="-p" name="$JOB_QUEUE"/>
</submit_args>
<queues>
<queue walltimemax="48:00:00" nodemin="1" nodemax="256" default="true">normal</queue>
<queue walltimemax="02:00:00" nodemin="1" nodemax="8" >dev</queue>
</queues>
</batch_system>

After recreating the case, case.setup, case.build, and case.submit --resubmit-immediate:
Submitting job script ssh monsoon.hpc.nau.edu 'cd /scratch/jw2636/ctsm_test ; sbatch --time 48:00:00 -p normal .case.run --completion-sets-continue-run'
jw2636@monsoon.hpc.nau.edu's password:
ERROR: Command: 'ssh monsoon.hpc.nau.edu 'cd /scratch/jw2636/ctsm_test ; sbatch --time 48:00:00 -p normal .case.run --completion-sets-continue-run'' failed with error 'sbatch: error: invalid partition specified: normal
sbatch: error: Batch job submission failed: Invalid partition name specified' from dir '/scratch/jw2636/ctsm_test'

William, what are the correct partition names on Monsoon/slurm (to replace normal/dev in the above example) and can we avoid the password sign-in if ssh back to the login node is the necessary way to go?

Erik, can you think of any other work around or a config file that maybe hasn't been setup correctly that would cause slurm's "out-of-memory handler" to kill jobs and keep slurm from resubmitting? I attached our machine folder as well.

Thanks!
Hello, do you know which compset the resolution f45_g37 supports for 2.1.3 or 2.2? It seems that low resolution won't support for a large number of compsets I've tried. Thanks a lot!
 
Top