Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM2.3.beta08 porting issue

James King

James King
Member
Hi all,

I'm porting CESM2.3.beta08 to the ARCHER2 HPC here in the UK. My colleagues have previously successfully ported CESM2.1.3 but we have a research need for more recent model features. We have a script to install the model and a module file to go with it (attached).

Following the usual steps I am able to download the model and create a test case. The error arises when invoking ./case.build in the test case:

Traceback (most recent call last):
File "/mnt/lustre/a2fs-work2/work/n02/n02/jking/cesm/CESM2.3.beta08/cases/F2000climo_test/./case.build", line 147, in <module>
_main_func(__doc__)
File "/mnt/lustre/a2fs-work2/work/n02/n02/jking/cesm/CESM2.3.beta08/cases/F2000climo_test/./case.build", line 140, in _main_func
success = build.case_build(caseroot, case=case, sharedlib_only=sharedlib_only,
File "/mnt/lustre/a2fs-work2/work/n02/n02/jking/cesm/CESM2.3.beta08/my_cesm_sandbox/cime/scripts/Tools/../../scripts/lib/CIME/build.py", line 570, in case_build
return run_and_log_case_status(functor, "case.build", caseroot=caseroot)
File "/mnt/lustre/a2fs-work2/work/n02/n02/jking/cesm/CESM2.3.beta08/my_cesm_sandbox/cime/scripts/Tools/../../scripts/lib/CIME/utils.py", line 1684, in run_and_log_case_status
rv = func()
File "/mnt/lustre/a2fs-work2/work/n02/n02/jking/cesm/CESM2.3.beta08/my_cesm_sandbox/cime/scripts/Tools/../../scripts/lib/CIME/build.py", line 568, in <lambda>
functor = lambda: _case_build_impl(caseroot, case, sharedlib_only, model_only, buildlist,
File "/mnt/lustre/a2fs-work2/work/n02/n02/jking/cesm/CESM2.3.beta08/my_cesm_sandbox/cime/scripts/Tools/../../scripts/lib/CIME/build.py", line 519, in _case_build_impl
logs = _build_libraries(case, exeroot, sharedpath, caseroot,
File "/mnt/lustre/a2fs-work2/work/n02/n02/jking/cesm/CESM2.3.beta08/my_cesm_sandbox/cime/scripts/Tools/../../scripts/lib/CIME/build.py", line 257, in _build_libraries
if comp_lnd == "clm" and "clm4_0" not in clm_config_opts:
TypeError: argument of type 'NoneType' is not iterable

Someone else has posted this error before on the forums but it doesn't appear to have been resolved. As this appears to be related to CLM I have attached the CLM namelist for the case as well. One thing to be aware of is that the CESM2.1.3 installation required modifying the default version of CIME (prior to running ./manage_externals/checkout_externals) to branch = maint-5.6 (which contains machine-specific settings for ARCHER2). I have done this again here but I am not sure if this older CIME version would work with CESM2.3.beta08.

Any advice on how we should proceed would be much appreciated.

Many thanks,

James
 

Attachments

  • 2.3.beta08.txt
    2 KB · Views: 2
  • setup_cesm23beta.txt
    5.7 KB · Views: 4
  • lnd_in.txt
    7.2 KB · Views: 1

jedwards

CSEG and Liaisons
Staff member
You are correct - the cime maint-5.6 branch is not appropriate for cesm2.3
Our latest beta tag is cesm2_3_beta15 - I suggest that you update to that version.
Then you will need to port the cime master branch to archer2 - you can use the maint-5.6 port as a guide
but cut and paste won't work as the format of many of the xml files has changed.
 

James King

James King
Member
Thanks for the quick response - I suspected this wasn't going to be a straightforward matter of repeating the process for the previous port. When you say 'port the cime master branch', is this just a case of applying the ARCHER2-specific machine settings (which are in maint-5.6) into the CIME master branch once it's been installed, or is there more to it than that?
 

jedwards

CSEG and Liaisons
Staff member
In cesm2.3 machine and compiler configuration files are no longer part of cime and are in their own repository.
look in ccs_config/machines for config_machines.xml - you should be able to copy the settings from maint-5.6 into
that file but there are a few format changes so be careful with cut and paste. Also in cesm2.3 config_compilers.xml no
longer exists and has been replaced by ccs_config/machines/cmake_macros/ I think it should be clear how to translate your config_compilers.xml settings to cmake_macros settings but let me know if it isn't.
 

James King

James King
Member
Thanks for this Jim. In ccs_config/machines/cmake_macros/ I can see a list of files, some just with machine names (e.g. cheyenne.cmake) and some with both machine names and compilers (e.g. intel_cheyenne.cmake). Do I need to create both a file for archer2 (archer2.cmake) and then macro settings for its compiler (gnu_archer2.cmake)? I'm afraid it's not clear to me how to translate the ARCHER2 settings I have in config_compilers.xml (attached) into the cmake_macros directory (I am a mere scientist and not a software engineer).
 

Attachments

  • config_compilers_archer2.txt
    1.7 KB · Views: 7

James King

James King
Member
In addition, having put the machine settings for ARCHER2 from the CESM2.1.3 installation into the new one, when attempting to build a case it fails with

cesm model version found: cesm2_3_beta08
Batch_system_type is slurm
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
WARNING: No queue on this system met the requirements for this job. Falling back to defaults
ERROR: No queues found
 

James King

James King
Member
In addition, having put the machine settings for ARCHER2 from the CESM2.1.3 installation into the new one, when attempting to build a case it fails with

cesm model version found: cesm2_3_beta08
Batch_system_type is slurm
job is case.run USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
WARNING: No queue on this system met the requirements for this job. Falling back to defaults
ERROR: No queues found
EDIT - fixed this, ignore. Previous question about cmake macros still stands.
 

jedwards

CSEG and Liaisons
Staff member
Yes from your config_compilers.xml file above you should create two new files archer2.cmake and gnu_archer2.cmake
But you should not need most of the content therein - it will be inherited from gnu.cmake.
It could be that all you need is:

string(APPEND SLIBS "-L/work/n02/shared/ESMF/esmf_8.2.0/lib/libO/Linux.gfortran.64.mpich.archer2/ -ldl")

I'm not sure why you have defined ESMF_LIBDIR, instead you should have an entry in the config_machines.xml
something like:

<environment_variables COMPILER="gnu">
<env name="ESMFMKFILE">/work/n02/shared/ESMF/esmf_8.2.0/lib/libO/Linux.gfortran.64.mpich.archer2/esmf.mk</env>
</environment_variables>
 

James King

James King
Member
OK got it - I've created those two files. Should I put this line:

string(APPEND SLIBS "-L/work/n02/shared/ESMF/esmf_8.2.0/lib/libO/Linux.gfortran.64.mpich.archer2/ -ldl")

in both of them? And what lines from the config_compilers.xml file need to go in there too?

I'm not sure why that ESMF_LIBDIR line is there, it was put there by a software engineer at our end last year during the CESM2.1.3 installation. I've updated the config_machines file for the new installation as per your suggestion.
 

jedwards

CSEG and Liaisons
Staff member
No - that line is specific to both archer2 and gnu and so that line belongs in the gnu_archer2.cmake file.
Most of what you have in those config_compilers.xml entries are repeated from further up in the hierarchy.
You should make the minimal changes needed for your machine and use the inheritance provided to get the more general settings.
 

James King

James King
Member
OK thanks. With this line in gnu_archer2.cmake, I can build and submit a case. However we're still having fun with compilers as the run fails with this line in the cesm.log:

/work/n02/n02/jking/cesm/CESM2.3.beta08/cesm_sims/runs/F2000climo_test/bld/cesm.exe: error while loading shared libraries: libnetcdf_parallel_gnu_91.so.18: cannot open shared object file: No such file or directory'

I have the module

cray-parallel-netcdf/1.12.3.1

loaded in my environment.
 

jedwards

CSEG and Liaisons
Staff member
The module cray-parallel-netcdf/1.12.3.1 is for pnetcdf/1.12.3
I believe that the error you see is from the netcdf library. Can you find
that file libnetcdf_parallel_gnu_91.so.18 on your system?
you can use
module show cray-parallel-netcdf/1.12.3.1 (or whatever module name)
to see how each module affects your environment. The path to that shared
library should be provided by the cray-netcdf module.
 

James King

James King
Member
I can't find that specific file anywhere on the system, which makes sense given the error. The question is how to point the model to a file which does exist, and which file that should be. A keyword search turns up the following files with similar names:

opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib/libnetcdf_parallel_gnu_91.a
opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib/libnetcdf_parallel_gnu_91.so
opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib/libnetcdf_parallel_gnu_91.so.19
opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib/libnetcdf_parallel_gnu_91.so.19.1.0

I have these modules loaded:

cray-netcdf-hdf5parallel/4.9.0.1
cray-parallel-netcdf/1.12.3.1

The result of module show cray-netcdf-hdf5parallel/4.9.0.1 is:

------------------------------------------------------------------------------------------------------
/opt/cray/pe/lmod/modulefiles/hdf5-parallel/gnu/8.0/ofi/1.0/cray-mpich/8.0/cray-hdf5-parallel/1.12.2/cray-netcdf-hdf5parallel/4.9.0.1.lua:
------------------------------------------------------------------------------------------------------
family("netcdf")
conflict("PrgEnv-pathscale")
help([[Release info: /opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/release_info]])
whatis("NetCDF (Network Common Data Form) is a set of interfaces for array-oriented data access and a collection of data access libraries for C, Fortran, and C++.")
prepend_path("PATH","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/bin")
prepend_path("MANPATH","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/share/man")
prepend_path("PE_PKGCONFIG_PRODUCTS","PE_NETCDF_HDF5PARALLEL")
prepend_path("PKG_CONFIG_PATH","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib/pkgconfig")
setenv("PE_NETCDF_HDF5PARALLEL_PKGCONFIG_LIBS","netcdf_parallel")
setenv("PE_NETCDF_HDF5PARALLEL_FORTRAN_PKGCONFIG_LIBS","netcdf-fortran_parallel")
setenv("PE_NETCDF_HDF5PARALLEL_CXX_PKGCONFIG_LIBS","netcdf-cxx4_parallel")
setenv("CRAY_NETCDF_HDF5PARALLEL_DIR","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1")
setenv("PE_NETCDF_HDF5PARALLEL_DIR","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1")
setenv("CRAY_NETCDF_HDF5PARALLEL_VERSION","4.9.0.1")
setenv("CRAY_NETCDF_HDF5PARALLEL_PREFIX","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1")
setenv("NETCDF_DIR","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1")
prepend_path("CRAY_LD_LIBRARY_PATH","/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib")


And the result of module show cray-parallel-netcdf/1.12.3.1 is:

------------------------------------------------------------------------------------------------------
/opt/cray/pe/lmod/modulefiles/mpi/gnu/8.0/ofi/1.0/cray-mpich/8.0/cray-parallel-netcdf/1.12.3.1.lua:
------------------------------------------------------------------------------------------------------
family("parallel_netcdf")
family("pnetcdf")
help([[Release info: /opt/cray/pe/parallel-netcdf/1.12.3.1/release_info]])
whatis("Parallel I/O library for NetCDF file access")
prepend_path("PATH","/opt/cray/pe/parallel-netcdf/1.12.3.1/bin")
prepend_path("MANPATH","/opt/cray/pe/parallel-netcdf/1.12.3.1/share/man")
prepend_path("PKG_CONFIG_PATH","/opt/cray/pe/parallel-netcdf/1.12.3.1/gnu/9.1/lib/pkgconfig")
prepend_path("PE_PKGCONFIG_PRODUCTS","PE_PARALLEL_NETCDF")
setenv("PE_PARALLEL_NETCDF_PKGCONFIG_LIBS","pnetcdf")
setenv("CRAY_PARALLEL_NETCDF_DIR","/opt/cray/pe/parallel-netcdf/1.12.3.1")
setenv("PE_PARALLEL_NETCDF_DIR","/opt/cray/pe/parallel-netcdf/1.12.3.1")
setenv("CRAY_PARALLEL_NETCDF_VERSION","1.12.3.1")
setenv("CRAY_PARALLEL_NETCDF_PREFIX","/opt/cray/pe/parallel-netcdf/1.12.3.1/gnu/9.1")
setenv("PNETCDF_DIR","/opt/cray/pe/parallel-netcdf/1.12.3.1/gnu/9.1")
prepend_path("CRAY_LD_LIBRARY_PATH","/opt/cray/pe/parallel-netcdf/1.12.3.1/gnu/9.1/lib")

I can see that in the former, the path "CRAY_LD_LIBRARY_PATH" is pointing toward the directory containing the missing file. Am I heading in the right direction here? I doubt I have the permissions to edit the contents of module files but it may be a matter of loading a different version of the module?
 

jedwards

CSEG and Liaisons
Staff member
Something in your stack was compiled with libnetcdf_parallel_gnu_91.so.18
instead of libnetcdf_parallel_gnu_91.so.19.
I'm not sure how to find it.
 

James King

James King
Member
Fair enough - thanks for all your help. I'll see if tech support at ARCHER2 can help with this - it's a bit outside my expertise.
 

jedwards

CSEG and Liaisons
Staff member
try ldd --verbose .. and if that doesn't give enough, you could set export LD_DEBUG=libs before calling ldd on cesm.exe.
 

James King

James King
Member
As a possible line of enquiry, the module file which my colleagues made to use every time CESM is used contains the following:

setenv ( "ESMF_NETCDF_INCLUDE" , "/opt/cray/pe/netcdf-hdf5parallel/4.7.4.3/gnu/9.1/include" )
setenv ( "ESMF_NETCDF_LIBPATH" , "/opt/cray/pe/netcdf-hdf5parallel/4.7.4.3/gnu/9.1/lib" )
setenv ( "ESMF_NETCDFF_INCLUDE" , "/opt/cray/pe/netcdf-hdf5parallel/4.7.4.3/gnu/9.1/include" )

There is no such directory as far as I can see:

"/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/" is the current version. This may be because the work to install the older CESM version was done before a major operating system upgrade to the ARCHER2 system.
 

jedwards

CSEG and Liaisons
Staff member
That's why you should really try to avoid hardcoded paths. All of these ESMF_ variables
are set by ESMFMKFILE and they should not need to be set individually. Just make sure
that ESMFMKFILE is in the environment.
 

James King

James King
Member
Running ldd -v on cesm.exe gives:

libnetcdf_parallel_gnu_91.so.18 => not found

Doing so after LD_DEBUG=libs gives some output:

218355: find library=libnetcdf_parallel_gnu_91.so.18 [0]; searching
218355: search path=/opt/cray/pe/parallel-netcdf/1.12.3.1/gnu/9.1/lib:/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib:/opt/cray/pe/hdf5-parallel/1.12.2.1/gnu/9.1/lib:/opt/cray/pe/libsci/22.12.1.1/GNU/9.1/x86_64/lib:/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib:/opt/cray/pe/mpich/8.1.23/gtl/lib:/opt/cray/pe/dsmml/0.2.2/dsmml/lib:/opt/cray/pe/libsci/22.12.1.1/CRAY/9.0/x86_64/lib:/opt/cray/pe/perftools/22.12.0/lib64:/opt/cray/pe/python/3.9.13.1/lib (LD_LIBRARY_PATH)
218355: trying file=/opt/cray/pe/parallel-netcdf/1.12.3.1/gnu/9.1/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/netcdf-hdf5parallel/4.9.0.1/gnu/9.1/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/hdf5-parallel/1.12.2.1/gnu/9.1/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/libsci/22.12.1.1/GNU/9.1/x86_64/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/mpich/8.1.23/gtl/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/dsmml/0.2.2/dsmml/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/libsci/22.12.1.1/CRAY/9.0/x86_64/lib/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/perftools/22.12.0/lib64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/python/3.9.13.1/lib/libnetcdf_parallel_gnu_91.so.18
218355: search path=/opt/cray/pe/gcc-libs (RPATH from file ./cesm.exe)
218355: trying file=/opt/cray/pe/gcc-libs/libnetcdf_parallel_gnu_91.so.18
218355: search path=/opt/cray/pe/gcc/11.2.0/snos/lib64:/opt/cray/pe/papi/6.0.0.17/lib64:/opt/cray/libfabric/1.12.1.2.2.0.0/lib64 (LD_LIBRARY_PATH)
218355: trying file=/opt/cray/pe/gcc/11.2.0/snos/lib64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/pe/papi/6.0.0.17/lib64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/opt/cray/libfabric/1.12.1.2.2.0.0/lib64/libnetcdf_parallel_gnu_91.so.18
218355: search path=/work/y07/shared/utils/core/tk/8.6.10/lib/tls/x86_64/x86_64:/work/y07/shared/utils/core/tk/8.6.10/lib/tls/x86_64:/work/y07/shared/utils/core/tk/8.6.10/lib/tls/x86_64:/work/y07/shared/utils/core/tk/8.6.10/lib/tls:/work/y07/shared/utils/core/tk/8.6.10/lib/x86_64/x86_64:/work/y07/shared/utils/core/tk/8.6.10/lib/x86_64:/work/y07/shared/utils/core/tk/8.6.10/lib/x86_64:/work/y07/shared/utils/core/tk/8.6.10/lib (RUNPATH from file /work/n02/shared/ESMF/esmf_8.2.0/lib/libO/Linux.gfortran.64.mpich.archer2/libesmf.so)
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/tls/x86_64/x86_64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/tls/x86_64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/tls/x86_64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/tls/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/x86_64/x86_64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/x86_64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/x86_64/libnetcdf_parallel_gnu_91.so.18
218355: trying file=/work/y07/shared/utils/core/tk/8.6.10/lib/libnetcdf_parallel_gnu_91.so.18
 

jedwards

CSEG and Liaisons
Staff member
That's what I was afraid of - you get lots of information about where it's looking for the file but
nothing about where it was requested.
 
Top