Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Problems porting CESM on local machine with CentOS 8

gabriel2029

Gabriel Dengler
New Member
Hello,
I have some problems porting CESM to a local machine witch CentOS 8. Operating system is CentOS8. I have tried two compilers, Intel Parallelstudio 20up02 as well as GNU 8.3.1 in combination with OpenMPI 4.0.4, and created a machine for each of them. I built NetCDF from source; for completion, I attached all important configuration files and scripts.
  1. The module_system in config_machines.xml is only working with Parallelstudio but not with OpenMPI. If you load the modules via command line using “module load”, the modules can be loaded on both systems. After some investigations it turned out that the purge-command seems to work but not the load command (because modules loaded with “module load” are available without module system, but not available when you define purge in the corresponding configuration file). Why?
  2. When you compile with GNU and OpenMPI, the binding of the libraries fails due to the following error:GNU-Compilation-Error.PNG I also tried to add the compiler flag -mcmodel=medium, but this did not help.
  3. When you compile with Parallelstudio, the compilation sometimes completes and sometimes fails with the same error as with GNU. I did not found a clear pattern under which circumstances the compilation is successfully, but it seems to me that this happens when you compile different test cases with different compilers (but I can also be wrong on this point). When the compilation succeeds, the application's run fails, because the __libm_feature_flag cannot be found in any library.
  4. Some strange behavior regarding config files: Currently, I replace only the necessary components in config_machines.xml and config_compilers.xml with the content of the attached files and leave the rest unchanged. When you replace the complete config_machines.xml and config_compilers.xml with the attached files, ./create_testcase fails with the “ERROR: Expected one child”.
Thanks in advance,
Gabriel
 

Attachments

  • config_machines.xml.txt
    4.8 KB · Views: 13
  • config_compilers.xml.txt
    4.6 KB · Views: 7
  • install_netcdf_gnu.sh.txt
    2.8 KB · Views: 5
  • install_netcdf_intel.sh.txt
    2.8 KB · Views: 3

jedwards

CSEG and Liaisons
Staff member
In the config_machines.xml file you have <MPILIBS>mpich</MPILIBS> for both machines. But the gnu one should be openmpi,
it's not clear to me what the intel one should be - impi? I'm not sure what you mean about the purge command - it may help to examine the file
.env_machine_specific.sh in your case directory to see how the system is interpreting the xml file commands. If you source that file in your environment (or file .env_machine_specific.csh if you are a csh/tcsh user) you will put your login shell in the environment that the cesm build and run steps uses. This is a helpful debugging tool. You may also want to remove the allow_error="true" from the module_system line, unless you know that you need it.

Your config_compilers.xml file should not try to change the general intel and gnu compiler definitions, rather you should use a MACH= attribute to modify the defaults for your particular machine. This will make it easier to merge updates later. Finally your machine definition suggests that you have a single node 2 processor system - and your log above indicates that you want to build a B1850.f19_g17 case. This case is much too large for a system of this size and the errors indicate that the compiled program exceeds the capacity of the system. Perhaps you should start by following the porting guide
 

gabriel2029

Gabriel Dengler
New Member
At first, thanks for your answer!

I was able to compile the testcase with GNU and OpenMPI by increasing the max (mpi)tasks per node to 8, but then it is not running obviously. What would be an example for a smaller test case to test on a single node (except the prealpha tests from the porting guide) before runnning on multiple nodes using slurm. (Where can I define the number of nodes and how does CESM compute the -np {{ total_tasks }} parameter?)

Regarding modules: What I mean with "modules work only with Parallelstudio but not with GNU and OpenMPI" is based on the following observations with GNU and OpenMPI:
  • When you disable the module system in config_machines.xml and load the modules (OpenMPI) over the console, the compilation runs without any issues.
  • When you enable the module system in config_machines.xml, the compilation fails (and says that make did not find some files to compile), no matter if you load the modules over the console or not.
  • When you disable only the purge command in config_machines.xml, the compilations succeeds if and only if you load the modules over the console.
However, the content of the file .env_machine_specific.sh is (so it seems to me, that openmpi/4.0.4 should be loaded):
Bash:
# This file is for user convenience only and is not used by the model
# Changes to this file will be ignored and overwritten
# Changes to the environment should be made in env_mach_specific.xml
# Run ./case.setup --reset to regenerate this file
source /mnt/nfs_shares/apps/Modules/init/sh
module load openmpi/4.0.4
export OMP_STACKSIZE=256M
export NETCDF_C_PATH=/scratch/netcdf/gnu/netcdf-c
export NETCDF_FORTRAN_PATH=/scratch/netcdf/gnu/netcdf-fortran

Best regards,
Gabriel
 
Top