Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM link error during build

Gabriel.Hes

Hes
Member
Hi all,
I am trying to run CESM2.2.0 on pizdaint machine (CSCS, Switzerland) with intel compiler, and I am encountering a problem during the build case phase which blocks the generation of an .exe file.

The following is printed out:

Building cesm from /scratch/snx3000/ghes/cesm2.2.0/cime/src/drivers/mct/cime_config/buildexe with output to /scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld/cesm.bldlog.220622-114855
ERROR: BUILD FAIL: buildexe failed, cat /scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld/cesm.bldlog.220622-114855


When executing this command to check what there is in the file /scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld/cesm.bldlog.220622-114855
I see the following:

ERROR: Command gmake exec_se -j 12 EXEC_SE=/scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld/cesm.exe MODEL=driver CIME_MODEL=cesm SMP=FALSE CASEROOT="/scratch/snx3000/ghes/cases/B1850_f09 g17_daint_intel_4_1" CASETOOLS="/scratch/snx3000/ghes/cases/B1850_f09g17_daint_intel_4_1/Tools" CIMEROOT="/scratch/snx3000/ghes/cesm2.2.0/cime" COMP_INTERFACE="mct" COMPILER="intel" DEBUG="FALSE" EXEROOT="/scrat ch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld" INCROOT="/scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld/lib/include" LIBROOT="/scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld/lib" MACH="pizdaint" MPILIB="mpich" NINST_VALUE="c1a1l1i1o1r1g1w1i1e1" OS="CNL" PIO_VERSION="1" SHAREDLIBROOT="/scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld" SMP_PRESENT="FALSE" USE_ESMF_LIB="FALSE" USE_MOAB="FALSE" CAM_CONFIG_OPTS="-phys cam6 -co2_cycle" COMP_LND="clm" COMPARE_TO_NUOPC="FALSE" CISM_USE_TRILINOS="FALSE" USE_TRILINOS="FALSE" USE_ALBANY="FALSE" USE_PETSC="FALSE" -f /scratch/snx3000/ghes/cases/B1850_f09g17_daint_intel_4_1/Tools/Makefile failed rc=2
out=cat: Srcfiles: No such file or directory

I cannot find the cause of this error. Did anyone already encouter it and manage to fix it?
 

jedwards

CSEG and Liaisons
Staff member
The Srcfiles file should have been generated in an earlier step - I can't tell why it wasn't created from this snippet please attach the entire log next time.
However I do see something that may explain the problem. The EXEROOT variable is defined as EXEROOT="/scrat ch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld"

It could just be a cut and paste issue but it seems that there a space in the file path.
 

Gabriel.Hes

Hes
Member
Thank you for your answer. I do think it is a copy-paste error because the space is not present in the original file attached (cesm.bldlog). Does this file help in anyway?
 

Attachments

  • cesm.bldlog.220622-114855.txt
    71.1 KB · Views: 17

jedwards

CSEG and Liaisons
Staff member
Yes the actual error is at the end of the file.

/scratch/snx3000/ghes/cesm2.2.0/cime/src/drivers/mct/main/cime_comp_mod.F90:2602:(.text+0x1ede): relocation truncated to fit: R_X86_64_32S against symbol `seq_comm_mct_mp_iac_layout_' defined in COMMON section in /scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_4_1/bld/intel/mpich/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/lib/libcsm_share.a(seq_comm_mct.o)

This is an error indicating that there is one or more static array declaration which is too large for the memory model. This can often be solved by increasing the pelayout and recompiling.
 

Gabriel.Hes

Hes
Member
Thank you, I now know where to look at. I will modify the pelayout configuration and tell you if I get rid of the error.
 

Gabriel.Hes

Hes
Member
I managed to get rid of this by not forcing any pelayout configuration. Now the code compiles but I get the following error during the run for a time limit of 6 hours to simulate 1 day with 8 nodes.
End of cesm.log:
slurmstepd: error: *** STEP 39516187.0 ON nid03919 CANCELLED AT 2022-06-27T21:00:53 DUE TO TIME LIMIT ***
and the cpl.log ends abruptly this way:
(seq_timemgr_clockPrint) Intervl yms = 9999 0 0

tfreeze_option is mushy

(seq_mct_drv) : Initialize each component: atm, lnd, rof, ocn, ice, glc, wav, esp, iac
(component_init_cc:mct) : Initialize component atm
(component_init_cc:mct) : Initialize component lnd


Do you have any idea of what the problem could be caused by?
 

jedwards

CSEG and Liaisons
Staff member
It appears that the model is hanging during lnd model initialization. What is in the lnd log?
 

Gabriel.Hes

Hes
Member
I cannot track any explicit error in the lnd log files.
The end of the lnd.log is the following:

Interpolating: sabs_shadewall_dir => sabs_shadewall_dir: Copy levels
Interpolating: sabs_shadewall_dif => sabs_shadewall_dif: Copy levels
Interpolating: sabs_improad_dir => sabs_improad_dir: Copy levels
Interpolating: sabs_improad_dif => sabs_improad_dif: Copy levels
Interpolating: sabs_perroad_dir => sabs_perroad_dir: Copy levels
Interpolating: sabs_perroad_dif => sabs_perroad_dif: Copy levels
Interpolating: par240d => par240d: Copy levels
Interpolating: par24d => par24d: Copy levels
Interpolating: par240x => par240x: Copy levels
Interpolating: par24x => par24x: Copy levels
Interpolating: parsun => parsun: Copy levels
Interpolating: parsha => parsha: Copy levels
Interpolating: T_SOISNO => T_SOISNO:
Split levels: Copy snow-covered levels using SNLSNO + Interpolate using COL_Z


And the end of the lnd.bldlog is the following:
a - subgridRestMod.o
a - subgridWeightsMod.o
a - surfrdMod.o
a - surfrdUtilsMod.o
rm dynVarTimeUninterpMod.F90 ncdio_pio.F90 initInterp2dvar.F90 dynVarMod.F90 dynVarTimeInterpMod.F90 restUtilMod.F90 array_utils.F90

err:

cat: Srcfiles: No such file or directory
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/init_interp/initInterpMultilevelCopy.F90(47): warning #6178: The return value of this FUNCTION has not been defined. [CONSTRUCTOR]
type(interp_multilevel_copy_type) function constructor()
---------------------------------------------^
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/utils/restUtilMod.F90.in(79): warning #6843: A dummy argument with an explicit INTENT(OUT) declaration is not given an explicit value. [READVAR]
long_name, units, interpinic_flag, data, readvar, &
------------------------------------------------^
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/biogeophys/SoilWaterRetentionCurveClappHornberg1978Mod.F90(35): warning #6178: The return value of this FUNCTION has not been defined. [CONSTRUCTOR]
type(soil_water_retention_curve_clapp_hornberg_1978_type) function constructor()
---------------------------------------------------------------------^
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/biogeophys/SoilWaterRetentionCurveVanGenuchten1980Mod.F90(35): warning #6178: The return value of this FUNCTION has not been defined. [CONSTRUCTOR]
type(soil_water_retention_curve_vangenuchten_1980_type) function constructor()
-------------------------------------------------------------------^
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/biogeophys/SnowSnicarMod.F90(913): remark #8291: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descripto
r is 'W>=D+7'.
write (iulog,"(a,e12.6,a,i6,a,i6)") "SNICAR ERROR: Energy conservation error of : ", energy_sum, &
------------------------------------^
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/biogeochem/NutrientCompetitionCLM45defaultMod.F90(54): warning #6178: The return value of this FUNCTION has not been defined. [CONSTRUCTOR]
type(nutrient_competition_clm45default_type) function constructor()
--------------------------------------------------------^
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/biogeochem/NutrientCompetitionFlexibleCNMod.F90(65): warning #6178: The return value of this FUNCTION has not been defined. [CONSTRUCTOR]
type(nutrient_competition_FlexibleCN_type) function constructor()
------------------------------------------------------^
/scratch/snx3000/ghes/cesm2.2.0/components/clm/src/main/clm_driver.F90(77): remark #6536: All symbols from this module are already visible due to another USE; the ONLY clause will have no effect. Rename clauses,
if any, will be honored. [CLM_INSTMOD]
use clm_instMod , only : nutrient_competition_method
------^
ar: creating /scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_8_1/bld/intel/mpich/nodebug/nothreads/mct/mct/noesmf/lib/libclm.a
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
I'd like to try to determine if this could be either (1) a compiler-specific problem or (2) a memory problem.

For (1): What version of the intel compiler are you using?

For (2): Can you try a coarse-resolution I compset (land-only forced by data atmosphere): --res f10_f10_mg37 --compset I1850Clm50BgcCrop. (Note: that standard compset requires (automatically) downloading a large amount of forcing data. If that is a problem, I can suggest an alternative with a lower data requirement.)
 

Gabriel.Hes

Hes
Member
Yes this would be good to know indeed.

For (1): These two modules are loaded: intel/2021.3.0 and PrgEnv-intel/6.0.10

For (2): I tried to compile and run the compset and resolution your proposed. Here is what I get in my run directory (there is no .log file, does this mean that the run completed well?):

atm_modelio.nml
CASEROOT
cism.config
cism_in
cpl_modelio.nml
datm_in
datm.streams.txt.CLMGSWP3v1.Precip
datm.streams.txt.CLMGSWP3v1.Solar
datm.streams.txt.CLMGSWP3v1.TPQW
datm.streams.txt.presaero.clim_1850
datm.streams.txt.topo.observed
drv_flds_in
drv_in
esp_modelio.nml
finidat_interp_dest.nc
glc_modelio.nml
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cism.r.0001-01-02-00000.nc
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cism.tavg_helper.0000-00-00-00000.nc
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.r.0001-01-02-00000.nc
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.rh0.0001-01-02-00000.nc
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cpl.r.0001-01-02-00000.nc
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.datm.rs1.0001-01-02-00000.bin
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.mosart.r.0001-01-02-00000.nc
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.mosart.rh0.0001-01-02-00000.nc
iac_modelio.nml
ice_modelio.nml
inputdata_checksum.dat
lnd_in
lnd_modelio.nml
mosart_in
ocn_modelio.nml
rof_modelio.nml
rpointer.atm
rpointer.drv
rpointer.glc
rpointer.lnd
rpointer.rof
seq_maps.rc
timing
wav_modelio.nml


So do you think that I am encountering a memory problem when running my other experiment ?
 

jedwards

CSEG and Liaisons
Staff member
Log files and model output are in your archive directory - check the variable DOUT_S_ROOT for
the location.
 

Gabriel.Hes

Hes
Member
Thank you: I checked my archive directory, and the log files are in the $DOUT_S_ROOT/$CASE/logs directory so this is shows that the simulation completed, isn't it?

The lnd.log file ends like this so there seems not to be any problem:
clm: completed timestep 47
clm: completed timestep 48
hist_htapes_wrapup : history tape 1 : no open file to close
writing restart file
./I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.r.0001-01-02-00000.nc
for model date = 0001-01-02-00000

restFile_open: writing restart dataset at
./I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.r.0001-01-02-00000.nc
at nstep = 48

htape_create : Opening netcdf rhtape
./I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.rh0.0001-01-02-00000.nc
htape_create : Successfully defined netcdf restart history file 1
Successfully wrote local restart file
./I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.r.0001-01-02-00000.nc
------------------------------------------------------------

(OPNFIL): Successfully opened file ./rpointer.lnd on unit= 93
Successfully wrote local restart pointer file
Successfully wrote out restart data at nstep = 48


The cesm.log ends like this:
Creating variable rofl_tavg
Writing to file I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cism.r.0001-0
1-02-00000.nc at time 0.000000000000000E+000
Setting mpi info: striping_factor=2
Setting mpi info: striping_unit=1048576
GPTLprint_memusage: Using Kbytesperpage=4
sysmem size=2233.3 MB rss=255.5 MB share=34.6 MB text=23.1 MB datastack=0.0 MB
Closing input file /scratch/snx3000/ghes/cesm_inputdata/glc/cism/Greenland/gl
issade/init/greenland_4km_epsg3413_c171126.nc
Closing output file I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cism.tavg_
helper.0000-00-00-00000.nc
Some Stats
Maximum temperature iterations: 0


And the end of glc.log is like this:
Creating variable rofl_tavg
*******************************************************************************
Writing to file I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cism.r.0001-0
1-02-00000.nc at time 0.000000000000000E+000
(glc_final_mct) -------------------------------------------------------------------------
(glc_final_mct) GLC: end of main integration loop
(glc_final_mct) -------------------------------------------------------------------------
*******************************************************************************
Closing input file /scratch/snx3000/ghes/cesm_inputdata/glc/cism/Greenland/gl
issade/init/greenland_4km_epsg3413_c171126.nc
*******************************************************************************
Closing output file I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cism.tavg_
helper.0000-00-00-00000.nc
Some Stats
Maximum temperature iterations: 0


And I do have restart files:
ghes@daint102:/scratch/snx3000/ghes/cesm2.2.0/archive/I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1/rest/0001-01-02-00000> ls
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cism.r.0001-01-02-00000.nc I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.cpl.r.0001-01-02-00000.nc rpointer.atm rpointer.lnd
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.r.0001-01-02-00000.nc I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.mosart.r.0001-01-02-00000.nc rpointer.drv rpointer.rof
I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.clm2.rh0.0001-01-02-00000.nc I1850Clm50BgcCrop_f10_f10_mg37_daint_intel_8_1.mosart.rh0.0001-01-02-00000.nc rpointer.glc
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
Great, thanks for running those tests. This tells me that the relevant code *can* work on your machine/compiler (I think we've had issues with some of the land interpolation code with old compilers), but it's running into problems in this particular case. My first guess would be a memory issue. Is it possible for you to use more nodes for this simulation? If so, I would suggest trying that.

One other thing you could do if you can't / don't want to use more than 8 nodes for your full simulation is: Run a short case (e.g., 1 day) with compset I1850Clm50BgcCrop and resolution f09_g17. This should produce an finidat_interp_dest.nc file in your run directory. It's possible that this will work on only 8 nodes since there are fewer components competing for memory, but you may need to run this on more than 8 nodes to get past this memory bottleneck. If you can successfully get past initialization with this configuration, then I think it will work to use that finidat_interp_dest.nc file in your full, B-compset simulation. To do that, copy the generated finidat_interp_dest.nc file into the run directory of your B compset simulation and then add the following to the user_nl_clm file in your B case:

finidat = 'finidat_interp_dest.nc'
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
Actually, I just realized that my suggested workaround with running the I1850Clm50BgcCrop case to get an finidat_interp_dest.nc file probably won't do the right thing, because I think it will start from the wrong initial conditions file. You can do something similar, though: run your desired B1850 configuration for one day on more nodes to generate an finidat_interp_dest.nc file, then you can use that to start your actual case on however many or few nodes you want.
 

Gabriel.Hes

Hes
Member
Thank you for this advice. I have already tried to run the B1850 case for 1 day with more nodes (eg 40) but I get the following error during compilation:
ERROR: Command: 'sbatch --time 00:30:00 -p normal --account sm62 .case.run --res submit' failed with error 'sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available' from dir '/scratch/snx3000/ghes/cases/B1850_f09g17_daint_intel_40_1'

This was done without specifying the pelayout though.
 

jedwards

CSEG and Liaisons
Staff member
Please run ./preview_run for this case and send the output. This may be a system issue or it may be something in the way you've defined the system in config_machines.xml or config_batch.xml. Can you run an mpi hello world program that spans multiple nodes?
 

Gabriel.Hes

Hes
Member
Here is the output of the ./preview_run for case B1850_f09g17_daint_intel_40_1. It is surprising because I set the number of cores to 40 but here it states 16:

CASE INFO:
nodes: 16
total tasks: 288
tasks per node: 18
thread count: 2

BATCH INFO:
FOR JOB: case.run
ENV:
module command is /opt/modules/default/bin/modulecmd python rm PrgEnv-intel PrgEnv-cray PrgEnv-gnu PrgEnv-pgi perftools-base
module command is /opt/modules/default/bin/modulecmd python load craype cray-mpich cray-netcdf-hdf5parallel cray-parallel-netcdf daint-gpu PrgEnv-intel craype cray-mpich cray-netcdf-hdf5parallel cray-parallel-netcdf daint-gpu perftools-base perftools-preload
Setting Environment OMP_STACKSIZE=64M
Setting Environment OMP_STACKSIZE=64M
Setting Environment OMP_NUM_THREADS=2

SUBMIT CMD:
sbatch --time 00:30:00 -p normal --account sm62 .case.run --resubmit

MPIRUN (job=case.run):
srun -prepend-rank --cpu_bind=rank --hint=nomultithread /scratch/snx3000/ghes/builds_cesm2.2.0/B1850_f09g17_daint_intel_40_1/bld/cesm.exe >> cesm.log.$LID 2>&1

FOR JOB: case.st_archive
ENV:
module command is /opt/modules/default/bin/modulecmd python rm PrgEnv-intel PrgEnv-cray PrgEnv-gnu PrgEnv-pgi perftools-base
module command is /opt/modules/default/bin/modulecmd python load craype cray-mpich cray-netcdf-hdf5parallel cray-parallel-netcdf daint-gpu PrgEnv-intel craype cray-mpich cray-netcdf-hdf5parallel cray-parallel-netcdf daint-gpu perftools-base perftools-preload
Setting Environment OMP_STACKSIZE=64M
Setting Environment OMP_STACKSIZE=64M
Setting Environment OMP_NUM_THREADS=2

SUBMIT CMD:
sbatch --time 00:20:00 -p normal --account sm62 --dependency=afterok:0 case.st_archive --resubmit
 

Gabriel.Hes

Hes
Member
Thank you for this advice. I have already tried to run the B1850 case for 1 day with more nodes (eg 40) but I get the following error during compilation:
ERROR: Command: 'sbatch --time 00:30:00 -p normal --account sm62 .case.run --res submit' failed with error 'sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available' from dir '/scratch/snx3000/ghes/cases/B1850_f09g17_daint_intel_40_1'

This was done without specifying the pelayout though.
I just reran a B1850 case by specifying the following pelayout (https://csegweb.cgd.ucar.edu/timing..._084708_1w5oyo.2365516.chadmin1.180904-085437) and the code run correctly. This layout uses 720 PEs so since my machine uses 12 PE/node, I ran the case with 60 nodes. This seems to solve the error I got previously: error 'sbatch: error: CPU count per node can not be satisfied
 

Gabriel.Hes

Hes
Member
I am now trying to use the same config files and workflow in my script to run a land-atmosphere simulation with the CLM code provided here (Supplementary Material of "Impacts of a Revised Surface Roughness Parameterization in the Community Land Model 5.1" - Research Collection) however I am encountering problems during the run. It seems that the error comes from PIO (see line 658 in the cesm.log below from the run directory). Interestingly, when I ran the previous CESM simulations, I had PIO=1 by default and now with my CLM simulation I have PIO=2 by default. Do you know why and if I have to change this?
 

Attachments

  • cesm.log.39711641.220706-084712.zip
    23.2 KB · Views: 7

jedwards

CSEG and Liaisons
Staff member
PIO=1 is a reference to the 1.x version of PIO, PIO=2 is a reference to pio version 2.x.
The error you are getting is not due to pio but due to the netcdf library -
Abort with message NetCDF: Attempt to use feature that was not turned on when netCDF was built.

This means that you are trying to open a hdf5 format file without hdf5 support or you are trying to open
a cdf5 file with a very old version of netcdf. Figure out what file it is by looking at the end of the lnd.log file
and use ncdump -k filename to get the format of the file. You may need to build and install a newer version of netcdf.
 
Top