Main menu

Navigation

Build Error while building CESM 1.2.2 using threads

7 posts / 0 new
Last post
nitkbhat@...
Build Error while building CESM 1.2.2 using threads

I was running CESM 1.2.2 with a threaded version. I intend to run openmp thread along with mpi tasks on a 16 core node.

The component set is B. and the resolution is f19_g16.


I changed the variable in the BUILD_THREADED to true.

I changed the mach_env_pes.xml to the following:

<entry id="NTASKS_ATM"   value="2"  />   
<entry id="NTHRDS_ATM"   value="2"  />   
<entry id="ROOTPE_ATM"   value="0"  />   
<entry id="NINST_ATM"   value="1"  />   
<entry id="NINST_ATM_LAYOUT"   value="concurrent"  />   

<entry id="NTASKS_LND"   value="2"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="2"  />   
<entry id="NINST_LND"   value="1"  />   
<entry id="NINST_LND_LAYOUT"   value="concurrent"  />   

<entry id="NTASKS_ICE"   value="2"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="4"  />   
<entry id="NINST_ICE"   value="1"  />   
<entry id="NINST_ICE_LAYOUT"   value="concurrent"  />   

<entry id="NTASKS_OCN"   value="2"  />   
<entry id="NTHRDS_OCN"   value="2"  />   
<entry id="ROOTPE_OCN"   value="6"  />   
<entry id="NINST_OCN"   value="1"  />   
<entry id="NINST_OCN_LAYOUT"   value="concurrent"  />   

<entry id="NTASKS_CPL"   value="1"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="8"  />   

<entry id="NTASKS_GLC"   value="1"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="9"  />   
<entry id="NINST_GLC"   value="1"  />   
<entry id="NINST_GLC_LAYOUT"   value="concurrent"  />   

<entry id="NTASKS_ROF"   value="1"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="10"  />   
<entry id="NINST_ROF"   value="1"  />   
<entry id="NINST_ROF_LAYOUT"   value="concurrent"  />   

<entry id="NTASKS_WAV"   value="1"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="11"  />   
<entry id="NINST_WAV"   value="1"  />   
<entry id="NINST_WAV_LAYOUT"   value="concurrent"  />   

<entry id="PSTRID_ATM"   value="1"  />   
<entry id="PSTRID_LND"   value="1"  />   
<entry id="PSTRID_ICE"   value="1"  />   
<entry id="PSTRID_OCN"   value="1"  />   
<entry id="PSTRID_CPL"   value="1"  />   
<entry id="PSTRID_GLC"   value="1"  />   
<entry id="PSTRID_ROF"   value="1"  />   
<entry id="PSTRID_WAV"   value="1"  />   

<entry id="TOTALPES"   value="16"  />   
<entry id="PES_LEVEL"   value="1r"  />   
<entry id="MAX_TASKS_PER_NODE"   value="16"  />   
<entry id="PES_PER_NODE"   value="$MAX_TASKS_PER_NODE"  />   
<entry id="COST_PES"   value="0"  />   
<entry id="CCSM_PCOST"   value="1"  />   
<entry id="CCSM_TCOST"   value="0"  />   
<entry id="CCSM_ESTCOST"   value="4"  />   

</config_definition>

 

But, I am getting an error while I build the model.

 

[nitin@master B_f19_g16_1node_omp]$ ./B_f19_g16_1node_omp.build
-------------------------------------------------------------------------
 CESM BUILDNML SCRIPT STARTING
 - To prestage restarts, untar a restart.tar file into /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/run
 infile is /storage/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/Buildconf/cplconf/cesm_namelist
CAM writing dry deposition namelist to drv_flds_in
CAM writing namelist to atm_in
CLM configure done.
CLM adding use_case 2000_control defaults for var sim_year with val 2000
CLM adding use_case 2000_control defaults for var sim_year_range with val constant
CLM adding use_case 2000_control defaults for var use_case_desc with val Conditions to simulate 2000 land-use
CICE configure done.
POP2 build-namelist: ocn_grid is gx1v6
POP2 build-namelist: ocn_tracer_modules are  iage
 CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
-------------------------------------------------------------------------
 CESM PRESTAGE SCRIPT STARTING
 - Case input data directory, DIN_LOC_ROOT, is /home/nitin/CESM_NEW/input_data
 - Checking the existence of input datasets in DIN_LOC_ROOT
 CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
-------------------------------------------------------------------------
 CESM BUILDEXE SCRIPT STARTING
rm: No match.
 COMPILER is intel
 - Build Libraries: mct gptl pio csm_share
Wed Feb 11 14:09:25 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/mct.bldlog.150211-140918
Wed Feb 11 14:09:26 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/gptl.bldlog.150211-140918
Wed Feb 11 14:09:26 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/pio.bldlog.150211-140918
Wed Feb 11 14:09:27 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/csm_share.bldlog.150211-140918
Wed Feb 11 14:09:27 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/atm.bldlog.150211-140918
Wed Feb 11 14:10:39 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lnd.bldlog.150211-140918
Wed Feb 11 14:11:14 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/ice.bldlog.150211-140918
Wed Feb 11 14:11:48 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/ocn.bldlog.150211-140918
Wed Feb 11 14:13:27 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/glc.bldlog.150211-140918
Wed Feb 11 14:13:27 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/wav.bldlog.150211-140918
Wed Feb 11 14:13:28 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/rof.bldlog.150211-140918
Wed Feb 11 14:13:39 IST 2015 /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/cesm.bldlog.150211-140918
ERROR: cesm.buildexe.csh failed, see /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/cesm.bldlog.150211-140918
ERROR: cat /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/cesm.bldlog.150211-140918

The error in the log is as follows:

mpiifort -o /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/cesm.exe ccsm_comp_mod.o ccsm_driver.o mrg_mod.o seq_avdata_mod.o seq_diag_mct.o seq_domain_mct.o seq_flux_mct.o seq_frac_mct.o seq_hist_mod.o seq_map_esmf.o seq_map_mod.o seq_mctext_mod.o seq_rest_mod.o  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -latm  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lice  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -llnd  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -locn  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lrof  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lglc  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lwav -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share -lcsm_share -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/lib -lpio -lgptl -lmct -lmpeu -L/storage/softwares/installedsoftware/netcdf_4.4.0/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lz -lm -L/opt/intel/impi/4.1.3.048/intel64/lib -lmpich  -L/storage/softwares/installedsoftware/netcdf_4.4.0/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lz -lm -openmp
ld: MPIR_Thread: TLS definition in /opt/intel/impi/4.1.3.048/intel64/lib/libmpi_mt.so section .tbss mismatches non-TLS definition in /opt/intel/impi/4.1.3.048/intel64/lib/libmpich.so section .bss
/opt/intel/impi/4.1.3.048/intel64/lib/libmpi_mt.so: could not read symbols: Bad value
gmake: *** [/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/cesm.exe] Error 1

 

Please find attached the env_mach_pes.xml and the error log file.

Thanks. 

jedwards

This appears to be a system configuration error - please consult with your local system administrators.   

Have you tried a simple program like the mpi hello world in the users guide?   

CESM Software Engineer

nitkbhat@...

Yes, I tried running programs with hybrid mpi and openmp. (in addition to the mpi programs). I didn't face any issues with the compilation and running. 

nitkbhat@...

The logs suggested that -lmpich and -lmpi were getting added. 

for building a mutithreaded version, I had to remove -lmpich and -lmpi from the compiler options. It was not added externally. 

-lmpich was getting added because of the environment variable MPI_LIB_NAME in the Macros being given as "mpich". Once, I unset that variable -lmpich is not added. Additionally, the -lmpi is added because of the MPI_PATH pointed to impi.  Now, I am getting successful builds for certain configurations. However, when I change the configuration to accomodate more threads. (1 task for each component and each having threads), I get the following error. I have attached the pe layout file along with the log file.  mpiifort -o /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/cesm.exe ccsm_comp_mod.o ccsm_driver.o mrg_mod.o seq_avdata_mod.o seq_diag_mct.o seq_domain_mct.o seq_flux_mct.o seq_frac_mct.o seq_hist_mod.o seq_map_esmf.o seq_map_mod.o seq_mctext_mod.o seq_rest_mod.o  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -latm  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lice  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -llnd  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -locn  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lrof  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lglc  -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/lib/ -lwav -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share -lcsm_share -L/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/lib -lpio -lgptl -lmct -lmpeu -L/storage/softwares/installedsoftware/netcdf_4.4.0/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lz -lm  -L/storage/softwares/installedsoftware/netcdf_4.4.0/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lz -lm -openmp ccsm_comp_mod.o: In function `ccsm_comp_mod_mp_ccsm_run_':/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x2a67): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_frc_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x2d91): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_frc_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x3cba): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_xao_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x3cd7): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_frc_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x3e36): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_xao_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x3e5d): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_frc_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x5046): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_frc_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x5324): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_xao_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x5340): relocation truncated to fit: R_X86_64_PC32 against symbol `seq_comm_mct_mp_num_inst_frc_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x7840): relocation truncated to fit: R_X86_64_32S against symbol `seq_comm_mct_mp_cplocnid_' defined in COMMON section in /home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/intel/mpich/nodebug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share/libcsm_share.a(seq_comm_mct.o)/storage/home/nitin/CESM_NEW/cesm1_2_2/models/drv/driver/ccsm_comp_mod.F90:(.text+0x7936): additional relocation overflows omitted from the outputgmake: *** [/home/nitin/CESM_NEW/cesm1_2_2/cases/B_f19_g16_1node_omp/cesm.exe] Error 1  Thanks.
jedwards

I think maybe in doing multiple builds you have some incompatability in the object files.   Do a $CASE.clean_build all 

and rebuild.   You still have a -lmpich on the link line as well. 

CESM Software Engineer

nitkbhat@...

I was able to solve the build error (relocation fit) by adding the -mcmodel=medium flag to the linker flags and the compiler flags in the $CASE. Now, I am successfully able to build the model 

 

However, I am getting a run time error when I try to run a threaded model with just 2 threads per MPI task. (PFA the env_mach_pes.xml file). The model runs for some time and gives the following error in the log. (PFA the cesm log file)

 

MCT::m_Router::initp_: RGSMap indices not increasing...Will correct

MCT::m_Router::initp_: GSMap indices not increasing...Will correct

(seq_domain_areafactinit) : min/max mdl2drv   0.999841513526222       1.00031732638246    areafact_a_ATM

(seq_domain_areafactinit) : min/max drv2mdl   0.999682774281628       1.00015851159572    areafact_a_ATM

(seq_domain_areafactinit) : min/max mdl2drv   0.999841513526350       1.00076245909423    areafact_l_LND

(seq_domain_areafactinit) : min/max drv2mdl   0.999238121806731       1.00015851159559    areafact_l_LND

(seq_domain_areafactinit) : min/max mdl2drv   0.999996826904345      0.999996826905162    areafact_r_ROF

(seq_domain_areafactinit) : min/max drv2mdl    1.00000317310491       1.00000317310572    areafact_r_ROF

(seq_domain_areafactinit) : min/max mdl2drv   0.999565456406962       1.00000000000000    areafact_o_OCN

(seq_domain_areafactinit) : min/max drv2mdl    1.00000000000000       1.00043473250326    areafact_o_OCN

(seq_domain_areafactinit) : min/max mdl2drv   0.999565456406962       1.00000000000000    areafact_i_ICE

(seq_domain_areafactinit) : min/max drv2mdl    1.00000000000000       1.00043473250326    areafact_i_ICE

(seq_mct_drv) : Initialize atm component phase 2 ATM

[48:node2] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 48

[56:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

[52:node2] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 56

internal ABORT - process 52

[50:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 50

[58:node1] unexpected disconnect completion event from [48:node2]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 58

[63:node1] unexpected disconnect completion event from [48:node2]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 63

[55:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 55

[59:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 59

[54:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 54

[53:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 53

[61:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 61

[60:node1] unexpected disconnect completion event from [77:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 60

[57:node1] unexpected disconnect completion event from [48:node2]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 57

[51:node2] unexpected disconnect completion event from [56:node1]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 51

[62:node1] unexpected disconnect completion event from [78:node7]

Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0

internal ABORT - process 62

APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

 

santos

Your env_mach_pes.xml is somewhat strange to me, since GLC and WAV are not on in most runs (what compset are you using?). However, that's probably not related to the error.

The error looks very much like a system message to me. My best guess would be that either you ran out of wall time (which would be strange, since it should have gotten farther as long as you gave the run at least a few minutes), or there was an error on your system. You might want to just try again, or to check the atm.log and see where the run stops in the atm initialization.

Sean Patrick Santos

CESM Software Engineering Group

Log in or register to post comments

Who's new

  • 1658093099@...
  • mborreggine@...
  • kabirtam@...
  • suns@...
  • liangpeng0405@...