Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Failed to run cam4 with ifort 11.0 and netcdf 4

Hi, all,

Recently, I was trying to run cam4 with ifort 11.0 and netcdf 4. It failed at the building process. Here is part of the Make.out file reporting error. I wish someone can help me to diagnose this issue at this forum. It seems to complain about netcdf but I believe the version of ifort and netcdf are pretty new enough.

"
/u/home/rum06003/ccsm4_0/models/atm/cam/src/control/error_messages.F90(92): error #7013: This module file was not generated by any release of this compiler. [NETCDF]
use netcdf
---------^
/u/home/rum06003/ccsm4_0/models/atm/cam/src/control/error_messages.F90(102): error #6404: This name does not have a type, and must have an explicit type. [NF90_NOERR]
if ( ret .ne. NF90_NOERR ) then
--------------------^
/u/home/rum06003/ccsm4_0/models/atm/cam/src/control/error_messages.F90(108): error #6404: This name does not have a type, and must have an explicit type. [NF90_STRERROR]
write(iulog,*) nf90_strerror( ret )
------------------------^
compilation aborted for /u/home/rum06003/ccsm4_0/models/atm/cam/src/control/error_messages.F90 (code 1)
gmake: *** [error_messages.o] Error 1
gmake: *** Waiting for unfinished jobs....
"

Thanks,

Rui
 
Hi, all,

The above probelm is fixed and the problem is related to NETCDF version.
Now the error seems related to MPI. Here is the last portion of message:

"ifort -o /data/scratch/rum06003/CTL/bld/cam BalanceCheckMod.o BareGroundFluxesMod.o BiogeophysRestMod.o Biogeophysics1Mod.o Biogeophysics2Mod.o BiogeophysicsLakeMod.o C13SummaryMod.o CASAMod.o CASAPhenologyMod.o CASAiniTimeVarMod.o CICE.o CICE_InitMod.o CICE_RunMod.o CNAllocationMod.o CNAnnualUpdateMod.o CNBalanceCheckMod.o CNC13FluxMod.o CNC13StateUpdate1Mod.o CNC13StateUpdate2Mod.o CNC13StateUpdate3Mod.o CNCStateUpdate1Mod.o CNCStateUpdate2Mod.o CNCStateUpdate3Mod.o CNDVEcosystemDynIniMod.o CNDVEstablishmentMod.o CNDVLightMod.o CNDVMod.o CNDecompMod.o CNEcosystemDynMod.o CNFireMod.o CNGRespMod.o CNGapMortalityMod.o CNMRespMod.o CNNDynamicsMod.o CNNStateUpdate1Mod.o CNNStateUpdate2Mod.o CNNStateUpdate3Mod.o CNPhenologyMod.o CNPrecisionControlMod.o CNSetValueMod.o CNSummaryMod.o CNVegStructUpdateMod.o CNWoodProductsMod.o CNiniSpecial.o CNiniTimeVar.o CNrestMod.o CanopyFluxesMod.o DUSTMod.o DryDepVelocity.o ESMF_AlarmClockMod.o ESMF_AlarmMod.o ESMF_BaseMod.o ESMF_BaseTimeMod.o ESMF_CalendarMod.o ESMF_ClockMod.o ESMF_FractionMod.o ESMF_Mod.o ESMF_Stubs.o ESMF_TimeIntervalMod.o ESMF_TimeMod.o FVperf_module.o FracWetMod.o FrictionVelocityMod.o GPTLget_memusage.o GPTLprint_memusage.o GPTLutil.o Hydrology1Mod.o Hydrology2Mod.o HydrologyLakeMod.o Meat.o QSatMod.o RtmMod.o RunoffMod.o SNICARMod.o STATICEcosysDynMod.o SnowHydrologyMod.o SoilHydrologyMod.o SoilTemperatureMod.o SurfaceAlbedoMod.o SurfaceRadiationMod.o TridiagonalMod.o UrbanInitMod.o UrbanInputMod.o UrbanMod.o VOCEmissionMod.o abortutils.o accFldsMod.o accumulMod.o advect_tend.o advnce.o aer_rad_props.o aerdepMod.o aerodep_flx.o aerosol_intr.o alloc_mod.o aoa_tracers.o areaMod.o atm_comp_mct.o benergy.o binary_io.o bnddyi.o boundarydata.o box_rearrange.o buffer.o cam3_aero_data.o cam3_ozone_data.o cam_comp.o cam_control_mod.o cam_diagnostics.o cam_history.o cam_logfile.o cam_pio_utils.o cam_restart.o camsrfexch_types.o ccsm_driver.o cd_core.o check_energy.o chem_surfvals.o chemistry.o cldconst.o cldinti.o cldsav.o cldwat.o clm_atmlnd.o clm_comp.o clm_driver.o clm_driverInitMod.o clm_initializeMod.o clm_mct_mod.o clm_time_manager.o clm_varcon.o clm_varctl.o clm_varorb.o clm_varpar.o clm_varsur.o clmtype.o clmtypeInitMod.o cloud_diagnostics.o cloud_fraction.o cloud_rad_props.o cloudsimulator.o cmparray_mod.o co2_cycle.o co2_data_flux.o comhd.o commap.o comspe.o comsrf.o constituent_burden.o constituents.o controlMod.o convect_deep.o convect_shallow.o cpslec.o ctem.o d2a3dijk.o d2a3dikj.o dadadj.o datetime.o debugutilitiesmodule.o decompInitMod.o decompMod.o decompinit.o decompmodule.o diag_dynvar_ic.o diag_module.o diffusion_solver.o do_close_dispose.o domainMod.o dp_coupling.o dryairm.o drydep_mod.o dust_intr.o dust_sediment_mod.o dycore.o dyn_comp.o dyn_grid.o dyn_internal_state.o dynamics_vars.o dynconst.o dynlandMod.o epvd.o error_function.o error_messages.o esinti.o f_wrappers.o fft99.o filenames.o fileutils.o fill_module.o filterMod.o flux_avg.o fv_control_mod.o fv_prints.o gauaw_mod.o geopk.o geopotential.o get_memusage.o get_zeits.o getdatetime.o gffgch.o ghg_data.o ghostmodule.o glc_comp_mct.o gptl.o gptl_papi.o gw_drag.o hb_diff.o histFileMod.o histFldsMod.o history_defaults.o history_scam.o hk_conv.o hycoef.o icarus_scops.o ice_FY.o ice_aerosol.o ice_age.o ice_atmo.o ice_blocks.o ice_boundary.o ice_broadcast.o ice_calendar.o ice_communicate.o ice_comp_mct.o ice_constants.o ice_diagnostics.o ice_distribution.o ice_domain.o ice_domain_size.o ice_dyn_evp.o ice_exit.o ice_fileunits.o ice_flux.o ice_forcing.o ice_gather_scatter.o ice_global_reductions.o ice_grid.o ice_history.o ice_history_fields.o ice_history_write.o ice_init.o ice_itd.o ice_kinds_mod.o ice_lvl.o ice_mechred.o ice_meltpond.o ice_ocean.o ice_orbital.o ice_pio.o ice_prescaero_mod.o ice_prescribed_mod.o ice_probability.o ice_probability_tools.o ice_read_write.o ice_restart.o ice_restoring.o ice_scam.o ice_shortwave.o ice_spacecurve.o ice_state.o ice_step_mod.o ice_therm_itd.o ice_therm_vertical.o ice_timers.o ice_transport_driver.o ice_transport_remap.o ice_work.o infnan.o iniTimeConst.o inicFileMod.o inidat.o initGridCellsMod.o initSurfAlbMod.o inital.o initcom.o initindx.o interpolate_data.o intp_util.o ioFileMod.o io_dist.o iobinary.o iompi_mod.o ionf_mod.o iop_surf.o lnd_comp_mct.o m_Accumulator.o m_AccumulatorComms.o m_AttrVect.o m_AttrVectComms.o m_AttrVectReduce.o m_ConvertMaps.o m_ExchangeMaps.o m_FcComms.o m_FileResolv.o m_Filename.o m_GeneralGrid.o m_GeneralGridComms.o m_GlobalMap.o m_GlobalSegMap.o m_GlobalSegMapComms.o m_GlobalToLocal.o m_IndexBin_char.o m_IndexBin_integer.o m_IndexBin_logical.o m_List.o m_MCTWorld.o m_MatAttrVectMul.o m_Merge.o m_MergeSorts.o m_Navigator.o m_Permuter.o m_Rearranger.o m_Router.o m_SortingTools.o m_SparseMatrix.o m_SparseMatrixComms.o m_SparseMatrixDecomp.o m_SparseMatrixPlus.o m_SparseMatrixToMaps.o m_SpatialIntegral.o m_SpatialIntegralV.o m_StrTemplate.o m_String.o m_TraceBack.o m_Transfer.o m_chars.o m_die.o m_dropdead.o m_flow.o m_inpak90.o m_ioutil.o m_mall.o m_mpif.o m_mpif90.o m_mpout.o m_rankMerge.o m_realkinds.o m_stdio.o m_zeit.o map_atmatm_mct.o map_atmice_mct.o map_atmlnd_mct.o map_atmocn_mct.o map_glcglc_mct.o map_iceice_mct.o map_iceocn_mct.o map_lndlnd_mct.o map_ocnocn_mct.o map_rofocn_mct.o map_rofrof_mct.o map_snoglc_mct.o map_snosno_mct.o mapz_module.o marsaglia.o mct_mod.o mct_rearrange.o mean_module.o memstuff.o metdata.o mkarbinitMod.o mo_constants.o mo_msis_ubc.o mo_regrider.o mo_solar_parms.o mo_util.o mod_comm.o molec_diff.o mp_assign_to_cpu.o mpishorthand.o mrg_x2a_mct.o mrg_x2g_mct.o mrg_x2i_mct.o mrg_x2l_mct.o mrg_x2o_mct.o mrg_x2s_mct.o msise00.o namelist_utils.o nanMod.o ncdio.o ncdio_atm.o ndepFileMod.o nf_mod.o ocn_comp.o ocn_comp_mct.o ocn_filenames.o ocn_spmd.o ocn_time_manager.o ocn_types.o ocnice_aero.o organicFileMod.o p_d_adjust.o par_vecsum.o par_xsum.o param_cldoptics.o parutilitiesmodule.o perf_mod.o perf_utils.o pfixer.o pft2colMod.o pft_module.o pftdynMod.o pftvarcon.o phys_buffer.o phys_control.o phys_debug.o phys_debug_util.o phys_gmean.o phys_grid.o phys_prop.o physconst.o physics_types.o physpkg.o pio.o pio_kinds.o pio_mpi_utils.o pio_nf_utils.o pio_quicksort.o pio_spmd_utils.o pio_support.o pio_types.o pio_utils.o piodarray.o piolib_mod.o pionfatt_mod.o pionfget_mod.o pionfput_mod.o pionfread_mod.o pionfwrite_mod.o pkez.o pkg_cld_sediment.o pkg_cldoptics.o pmgrid.o pnetcdfversion.o polar_avg.o ppgrid.o prescribed_aero.o prescribed_ghg.o prescribed_ozone.o prescribed_volcaero.o print_memusage.o progseasalts_intr.o pspect.o puminterfaces.o qneg3.o qneg4.o quicksort.o rad_constituents.o rad_solar_var.o radae.o radconstants.o radheat.o radiation.o radiation_data.o radlw.o radsw.o rayleigh_friction.o readinitial.o rearrange.o redistributemodule.o repro_sum_mod.o repro_sum_x86.o restFileMod.o restart_dynamics.o restart_physics.o rgrid.o runtime_opts.o scamMod.o scam_setlatlonidx.o scyc.o seq_avdata_mod.o seq_cdata_mod.o seq_comm_mct.o seq_diag_mct.o seq_domain_mct.o seq_drydep_mod.o seq_flds_indices.o seq_flds_mod.o seq_flux_mct.o seq_frac_mct.o seq_hist_mod.o seq_infodata_mod.o seq_io_mod.o seq_rearr_mod.o seq_rest_mod.o seq_timemgr_mod.o sgexx.o shr_alarm_mod.o shr_cal_mod.o shr_const_mod.o shr_date_mod.o shr_dmodel_mod.o shr_file_mod.o shr_flux_mod.o shr_infnan_mod.o shr_inputinfo_mod.o shr_isnan.o shr_jlcp.o shr_kind_mod.o shr_log_mod.o shr_map_mod.o shr_mct_mod.o shr_mem_mod.o shr_mpi_mod.o shr_msg_mod.o shr_ncio_mod.o shr_ncread_mod.o shr_orb_mod.o shr_pcdf_mod.o shr_scam_mod.o shr_strdata_mod.o shr_stream_mod.o shr_string_mod.o shr_sys_mod.o shr_tInterp_mod.o shr_timer_mod.o shr_vmath_fwrap.o shr_vmath_mod.o snowdp2lev.o solar_data.o spmdGathScatMod.o spmdMod.o spmd_dyn.o spmd_phys.o spmd_utils.o srchutil.o srfxfer.o sslt_rebin.o sst_data.o startup_initialconds.o stepon.o stratiform.o string_utils.o subgridAveMod.o subgridMod.o subgridRestMod.o sulchem.o surfrdMod.o sw_core.o system_messages.o te_map.o threadutil.o tidal_diag.o time_manager.o time_utils.o topology.o tp_core.o tphysac.o tphysbc.o tphysidl.o trac2d.o tracer_data.o tracers.o tracers_suite.o trb_mtn_stress.o tropopause.o trunc.o tsinti.o units.o upper_bc.o uv3s_update.o vertical_diffusion.o virtem.o vrtmap.o wetdep.o wrap_mpi.o wrap_nf.o wrf_error_fatal.o wrf_message.o wv_saturation.o xpavg_mod.o zenith.o zm_conv.o zm_conv_intr.o -L/opt/i/netcdf/lib -lnetcdf -static-intel -L/opt/i/mpich/lib -lmpich
cam_comp.o(.text+0x232): In function `cam_comp_mp_cam_final_':
: undefined reference to `mpi_wtime_'
cam_comp.o(.text+0x952): In function `cam_comp_mp_cam_run4_':
: undefined reference to `mpi_wtime_'
cam_comp.o(.text+0x27f2): In function `cam_comp_mp_cam_run1_':
: undefined reference to `mpi_wtime_'
cam_comp.o(.text+0x3c42): In function `cam_comp_mp_cam_init_':
: undefined reference to `mpi_wtime_'
ccsm_driver.o(.text+0xc902): In function `MAIN__':
: undefined reference to `mpi_wtime_'
ccsm_driver.o(.text+0xc912): more undefined references to `mpi_wtime_' follow
/opt/i/mpich/lib/libmpich.a(comm_split.o)(.text+0x1f2): In function `MPI_Comm_split':
: undefined reference to `PMPI_Allreduce'
/opt/i/mpich/lib/libmpich.a(context_util.o)(.text+0x72): In function `MPIR_Context_alloc':
: undefined reference to `PMPI_Allreduce'
/opt/i/mpich/lib/libmpich.a(context_util.o)(.text+0xd2): In function `MPIR_Context_alloc':
: undefined reference to `PMPI_Allreduce'
/opt/i/mpich/lib/libmpich.a(context_util.o)(.text+0x132): In function `MPIR_Context_alloc':
: undefined reference to `PMPI_Bcast'
/opt/i/mpich/lib/libmpich.a(context_util.o)(.text+0x1b2): In function `MPIR_Context_alloc':
: undefined reference to `PMPI_Sendrecv'
gmake: *** [/data/scratch/rum06003/CTL/bld/cam] Error 1
"

Thank you very much if someone can give me some clues.

Rui
 

eaton

CSEG and Liaisons
Your link line contains:

-L/opt/i/mpich/lib -lmpich

The error messages indicate that externals such as mpi_wtime_ and PMPI_Allreduce are not being found. I would look for these externals in other libraries in /opt/i/mpich/lib using the unix utility nm. If you find the externals in other libraries then add those libraries to the link line by using the -ldflags argument to configure. If you don't find the missing externals then the problem is in the mpi installation. Hopefully you can get a system administrator to help in that case.
 
Hi, Brian,

First I want to thank you for your correction and details of clarification on the run with prescribed varying SST forcing. That really helps me understand things much better.

Now I am still trying to figure out the MPI issue in this thread. Actually I installed the latest MPICH2-1.2.1p1 (i also tried some older versions from 1.0.6 too ) as a local user (no administrative access) and there was no error message during the installation. The only thing that I have concerned about is that during installation I was using "./configure --prefix=/u/home/rum06003/mpi2 --with-device=ch3:sock 2>&1 | tee c.txt" instead of ''./configure --prefix=/u/home/rum06003/mpi2 | tee c.txt'' due to the architechture difference for SGI. I have attached the MAKE.out for my ccsm run below, e.g., I do not know what it was complaining about and what missing libs I should look for. what is "pthread_getspecific" ? Sorry that I have so many questions.

Another question I want to ask is that what kind of references do you recommend if I want to learn more and improve my skills systimatically in diagnosing MPI, NETCDF or CCSM model issues related to running the model? I feel sometimes when I was diagnosing these issues I was using a trail-and -error method and I really have weak understanding on the essence of the issue. Thank you for any advice.

Rui

"intr.o -L/opt/i/netcdf/lib -lnetcdf -static-intel -L/u/home/rum06003/mpi2/lib -lmpich
/u/home/rum06003/mpi2/lib/libmpich.a(commutil.o)(.text+0x13a2): In function `MPIR_Get_contextid':
: undefined reference to `pthread_getspecific'
/u/home/rum06003/mpi2/lib/libmpich.a(commutil.o)(.text+0x13f2): In function `MPIR_Get_contextid':
: undefined reference to `pthread_setspecific'
/u/home/rum06003/mpi2/lib/libmpich.a(commutil.o)(.text+0x2ad2): In function `MPIR_Get_intercomm_contextid':
: undefined reference to `pthread_getspecific'
/u/home/rum06003/mpi2/lib/libmpich.a(commutil.o)(.text+0x2b22): In function `MPIR_Get_intercomm_contextid':
etc.......
"
 
Hi, Brian,

I also posted this on the MPICH list. Here is one of their advice as follows.
But I do not know where and how I should modify. The only thing I knew is that we need to set the LIB_MPI INC_MPI in the jobscript.

Thanks,

Rui

On Fri, Jul 16, 2010 at 9:29 AM, Anthony Chan wrote:


You should use MPI compiler wrapper (mpicc, mpif90...) to compile/link
your app. If not, you need to link with all MPICH2 supported libraries
by hand, e.g. -lpthread. Add -show to the MPI wrapper to see what the
supported libraries are.

A.Chan
 

eaton

CSEG and Liaisons
The advice from the MPICH list contains 2 suggestions:

1) use mpif90 to compile/link. Unfortunately the CAM Makefile isn't set up to allow you to use mpif90 as the compiler (this is on my short list of things to fix). But it is designed to allow mpif90 as the linker, and that's probably enough. To try this add "-linker mpif90" to your configure command. This will use mpif90 in the final link command instead of using ifort. The mpif90 wrapper knows all the libraries that need to be linked for an mpi build. You need to make sure that mpif90 is in your path before you issue the gmake command. From your output it appears the path is /u/home/rum06003/mpi2/bin.

2) Use "mpif90 -show" to determine the mpi libraries and then add them to the link command. This approach should work as well. In your case issue the command:

% /u/home/rum06003/mpi2/bin/mpif90 -show

The output from that command might contain something like this:

ifort -L/u/home/rum06003/mpi2/lib -lmpichf90 -lmpich -lpthread -lrt

The CAM Makefile automatically adds "-L/u/home/rum06003/mpi2/lib -lmpich" to the link command. So you would need to add "-lmpichf90 -lpthread -lrt" This can be done by modifying your configure argument as follows:

% configure [your other arguments...] -ldflags "-lmpichf90 -lpthread -lrt"

Regarding improving your skills for diagnosing problems: my only recommendation is to keep doing what you're doing. Asking questions in the appropriate mail lists and bulletin boards, and searching for info on the web are great resources.
 
Hi, Brian,

Thanks for your persistent help and advice. I really appreciate them very much. I tried both approaches. The first one does not work and the the following (2) is the error message in MAKE.out. However, I am excited that the second works and now we have passed the building process. But it then crashes while running. The following (1) is the cam -log file. So it complains about drv_in. Then I figured that there are no atm_in, lnd_in, drv_in etc. files in the folder. The jobscript is the same as I use for our local machines (with pgf90 on Ubuntu or Redhat system) and I did not change anything in the build-namelist too. The following (3) is part of the jobscript (sorry I did not make the namelist part short as you suggested in another thread).

Thanks,

Rui



(1) cam log for the second approach:

(t_initf) Read in prof_inparm namelist from: drv_in
forrtl: No such file or directory
forrtl: severe (29): file not found, unit 10, file /data/scratch/rum06003/CTL/drv_in
Image PC Routine Line Source
cam 4000000001706580 Unknown Unknown Unknown
cam 4000000001703E50 Unknown Unknown Unknown
cam 40000000016331F0 Unknown Unknown Unknown
cam 400000000154AA00 Unknown Unknown Unknown
cam 40000000015498D0 Unknown Unknown Unknown
cam 400000000156E680 Unknown Unknown Unknown
cam 4000000000F78890 Unknown Unknown Unknown
cam 400000000034E000 Unknown Unknown Unknown
cam 4000000000005610 Unknown Unknown Unknown
libc.so.6.1 2000000000189C50 Unknown Unknown Unknown
cam 4000000000005400 Unknown Unknown Unknown
rank 0 in job 9 linuxAltix_5465 caused collective abort of all ranks
exit status of rank 0: return code 29


(2) MAKE.out for first apporach:

"xpavg_mod.o zenith.o zm_conv.o zm_conv_intr.o -L/opt/i/netcdf/lib -lnetcdf -static-intel -L/u/home/rum06003/mpi/lib -lmpich
ifort: Command line warning: ignoring option '-static'; no argument required
IPO Error: unresolved : for_array_copy_in
Referenced in UrbanMod.o
Referenced in aerosol_intr.o
Referenced in convect_shallow.o
Referenced in dp_coupling.o
Referenced in dyn_comp.o
Referenced in parutilitiesmodule.o
Referenced in pionfget_mod.o
Referenced in radiation.o
Referenced in radlw.o
Referenced in stepon.o
Referenced in stratiform.o
Referenced in sw_core.o
Referenced in trac2d.o
Referenced in zm_conv_intr.o
IPO Error: unresolved : for_array_copy_out
Referenced in UrbanMod.o
Referenced in aerosol_intr.o
Referenced in convect_shallow.o
Referenced in dp_coupling.o
Referenced in dyn_comp.o
Referenced in parutilitiesmodule.o
Referenced in pionfget_mod.o
Referenced in radiation.o
Referenced in radlw.o
Referenced in stepon.o
Referenced in stratiform.o
Referenced in sw_core.o
Referenced in trac2d.o
Referenced in zm_conv_intr.o
UrbanMod.o(.text+0x21ac2): In function `urbanmod_mp_urbanradiation_':
: undefined reference to `for_array_copy_in'
UrbanMod.o(.text+0x21af2): In function `urbanmod_mp_urbanradiation_':
: undefined reference to `for_array_copy_in'
UrbanMod.o(.text+0x21b22): In function `urbanmod_mp_urbanradiation_':
: undefined reference to `for_array_copy_in'
UrbanMod.o(.text+0x21c92): In function `urbanmod_mp_urbanradiation_':
: undefined reference to `for_array_copy_out'
UrbanMod.o(.text+0x21cb2): In function `urbanmod_mp_urbanradiation_':
: undefined reference to `for_array_copy_out'
UrbanMod.o(.text+0x21cd2): In function `urbanmod_mp_urbanradiation_':
: undefined reference to `for_array_copy_out'
UrbanMod.o(.text+0x35262): In function `urbanmod_mp_urbanalbedo_':
: undefined reference to `for_array_copy_in'
UrbanMod.o(.text+0x35282): In function `urbanmod_mp_urbanalbedo_':
: undefined reference to `for_array_copy_in'
UrbanMod.o(.text+0x352a2): In function `urbanmod_mp_urbanalbedo_':
: undefined reference to `for_array_copy_in'
UrbanMod.o(.text+0x352c2): In function `urbanmod_mp_urbanalbedo_':
: undefined reference to `for_array_copy_in'
UrbanMod.o(.text+0x354e2): In function `urbanmod_mp_urbanalbedo_':
: undefined reference to `for_array_copy_out'
UrbanMod.o(.text+0x35502): In function `urbanmod_mp_urbanalbedo_':
etc.....


(3) part of the jobscript

cd $blddir
cat >! namelistfile
 

eaton

CSEG and Liaisons
It looks from your jobscript that the build-namelist output files should be in your $blddir. Check for them there.

If you execute build-namelist in the $blddir, then you need to copy all the output files to your $rundir, which means that your jobscript needs to know the names of all the files. For this reason I find it easier and more robust to run build-namelist in the $rundir so that it produces the namelist files in the location where they are used and no copying is needed. To do that modify your jobscript as follows:

# build the executable
...
# build the namelist
cd $rundir
cat >! namelistfile
 
Hi, Brian,

Thanks for keeping staying with me on this issue. I tried your method and it works to kill that problem. However, it gave me another error message. Here is the last part of log file: For this one, I have no clues and even do not know where to start from. Sorry for asking so many questions.

Thank you very much.

Rui

History File 1 write frequency MONTHLY
History File 7 write frequency YEARLY (INITIAL CONDITIONS)

Filename specifier for history file 1 = %c.cam2.h%t.%y-%m.nc
Filename specifier for history file 7 = %c.cam2.i.%y-%m-%d-%s.nc
Accumulation precision history file 1 = 8
Packing density history file 1 = 2
Number of time samples per file (MFILT) for history file 1 is
1
Accumulation precision history file 7 = 8
Packing density history file 7 = 1
Number of time samples per file (MFILT) for history file 7 is
1
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x600000000570def0 src_=0x600000000570def0 len_=4

internal ABORT - process 0
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x6000000005712088 src_=0x6000000005712088 len_=4

internal ABORT - process 2
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x6000000005770fa4 src_=0x6000000005770fa4 len_=4

internal ABORT - process 1
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x600000000571209c src_=0x600000000571209c len_=4

internal ABORT - process 3
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x60000000057427e4 src_=0x60000000057427e4 len_=4

internal ABORT - process 5
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x6000000005714180 src_=0x6000000005714180 len_=4

internal ABORT - process 4
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x600000000570f07c src_=0x600000000570f07c len_=4

internal ABORT - process 7
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x60000000057120a8 src_=0x60000000057120a8 len_=4

internal ABORT - process 6
rank 7 in job 1 linuxAltix_6629 caused collective abort of all ranks
exit status of rank 7: return code 1
rank 4 in job 1 linuxAltix_6629 caused collective abort of all ranks
exit status of rank 4: return code 1
rank 3 in job 1 linuxAltix_6629 caused collective abort of all ranks
exit status of rank 3: return code 1
rank 2 in job 1 linuxAltix_6629 caused collective abort of all ranks
exit status of rank 2: killed by signal 9
(seq_mct_drv) : Initialize lnd component
 

eaton

CSEG and Liaisons
This appears to be a system problem. To verify that the first thing I'd do is to check that the code runs in serial mode, i.e., execute configure with the arguments "-nospmd -nosmp" and remove the -ntasks and/or -nthreads arguments. Then execute cam without the mpirun job launcher. If you can successfully run CAM in serial mode then you know the problem is related to mpi and/or threading. Your configure command from a previous post indicates you are trying to run in a pure mpi mode (no threading). In this case, if the serial run is successful, then the next run is to go back to the mpi executable and try running with 1 mpi task assigned to the job. The results from this run should be identical to the results from the serial run. If this is successful then try another run with 2 mpi tasks. These results should also be identical to the previous two runs. Continuing in this fashion you should be able to determine where things are going bad.
 
Hi, Brian,

I first tried a run with no mpi and it was successful. Then I tested a run with 1 mpi task (configure -ntasks 1.....mpirun -np 1.........). And it gave me the same error as before when I was using 8 mpi tasks. So definitely the problem is caused by MPI. However, I tested CAM3.0 with this version of MPI and it was totally successful.

Thank you for any help.

Rui
 
Here is the error message for the run with mpi set to be 1 task:

History File 1 write frequency MONTHLY
History File 7 write frequency YEARLY (INITIAL CONDITIONS)

Filename specifier for history file 1 = %c.cam2.h%t.%y-%m.nc
Filename specifier for history file 7 = %c.cam2.i.%y-%m-%d-%s.nc
Accumulation precision history file 1 = 8
Packing density history file 1 = 2
Number of time samples per file (MFILT) for history file 1 is
1
Accumulation precision history file 7 = 8
Packing density history file 7 = 1
Number of time samples per file (MFILT) for history file 7 is
1
(seq_mct_drv) : Initialize lnd component
Assertion failed in file helper_fns.c at line 337: 0
memcpy argument memory ranges overlap, dst_=0x600000000a772ed0 src_=0x600000000a772ed0 len_=4

internal ABORT - process 0
rank 0 in job 3 linuxAltix_8541 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
 
Hi, Brian,

Since the log file says "Assertion failed in file helper_fns.c", I was trying to locate this file. And I found this file name appeared in libmpich.a under MPI lib folder. This is the only clue I have found. But I do not know what is wrong with it.

My intuition tells me that the reason is because this is an SGI machine and the MPI installtion is different from installations on our local linux machine (as I mentioned the configure is different with option --with-device=ch3:sock).

I hope to get some advice from you.

Thank you very much.

Rui


eaton said:
This appears to be a system problem. To verify that the first thing I'd do is to check that the code runs in serial mode, i.e., execute configure with the arguments "-nospmd -nosmp" and remove the -ntasks and/or -nthreads arguments. Then execute cam without the mpirun job launcher. If you can successfully run CAM in serial mode then you know the problem is related to mpi and/or threading. Your configure command from a previous post indicates you are trying to run in a pure mpi mode (no threading). In this case, if the serial run is successful, then the next run is to go back to the mpi executable and try running with 1 mpi task assigned to the job. The results from this run should be identical to the results from the serial run. If this is successful then try another run with 2 mpi tasks. These results should also be identical to the previous two runs. Continuing in this fashion you should be able to determine where things are going bad.
 
Hi, Brian,

I just got some clues from the MPI-developer Group. This is his reply to my question:

"What version of mpich2 are you using?

In some older versions of mpich2, you could get this sort of message by the same buffer as the send and recv buffers for many collective operations. Very old versions often would not complain at all.

In newer versions of mpich2 you will almost always get a much more helpful error message instead.

In all cases, the fix is typically to either use the MPI_IN_PLACE option appropriately for the given collective, or to use separate buffers for sending and receiving."

However, I do not understand his fix. I hope this can help you help me with this issue. Thank you very much.

Rui
 

eaton

CSEG and Liaisons
I have just seen a bug report for the CLM component which addresses this exact issue. Here is the fix that was posted (it's in models/lnd/clm/src/main/spmdMod.F90)


Code:
[yong:lnd/clm/src] erik% svn diff
Index: main/spmdMod.F90
===================================================================
--- main/spmdMod.F90    (revision 24351)
+++ main/spmdMod.F90    (working copy)
@@ -80,10 +80,12 @@
 !EOP
     integer :: i,j         ! indices
     integer :: ier         ! return error status
+    integer :: mylength    ! my processor length
     logical :: mpi_running ! temporary
     integer, allocatable :: length(:)
     integer, allocatable :: displ(:)
     character*(MPI_MAX_PROCESSOR_NAME), allocatable :: procname(:)
+    character*(MPI_MAX_PROCESSOR_NAME)              :: myprocname
 !-----------------------------------------------------------------------

     ! Initialize mpi communicator group
@@ -109,12 +111,12 @@

     allocate (length(0:npes-1), displ(0:npes-1), procname(0:npes-1))

-    call mpi_get_processor_name (procname(iam), length(iam), ier)
-    call mpi_allgather
(length(iam),1,MPI_INTEGER,length,1,MPI_INTEGER,mpicom,ier)
+    call mpi_get_processor_name (myprocname, mylength, ier)
+    call mpi_allgather
(mylength,1,MPI_INTEGER,length,1,MPI_INTEGER,mpicom,ier)
     do i = 0,npes-1
        displ(i)=i*MPI_MAX_PROCESSOR_NAME
     end do
-    call mpi_gatherv (procname(iam),length(iam),MPI_CHARACTER, &
+    call mpi_gatherv (myprocname,mylength,MPI_CHARACTER, &
                       procname,length,displ,MPI_CHARACTER,0,mpicom,ier)
     if (masterproc) then
        write(iulog,100)npes
The problem was using the same buffer as the send and recv buffer in a collective operation. This is pretty old code, so apparently most mpi implementations let you get away with this.
 
Hi, Brian,

Good news. That was a nice fix. However, I got a new error message as follows. It seems the error is caused by the thermo data read from sst file. But I have the same run on our local linux machines and they are totally fine. I also tried the other f19 sst file and it gave me the same error.

I really want to crack this and make the runs on our SGI machine successful. By using the SGI resources, I can speed up all my planed simulations. I hope to get help from you to fix it. Thank you very much.

Rui

"(shr_dmodel_readLBUB) reading file: /u/home/rum06003/ccsm4_1/inputdata/atm/cam/sst/sst_HadOIBl_bc_1.9x2.5_clim_c040810.nc 12
(shr_dmodel_readLBUB) reading file: /u/home/rum06003/ccsm4_1/inputdata/atm/cam/sst/sst_HadOIBl_bc_1.9x2.5_clim_c040810.nc 1

Starting thermo, T > Tmax, layer 1
Tin= -3.504219267679251E-002 , Tmax= -3.505690024561692E-002
istep1, my_task, i, j: 1 5 2 10
qin -277027009.030427
istep1, my_task, iblk = 1 5 1
Global block: 6
Global i and j: 91 9
Lat, Lon: -74.8421052631579 225.000000000000
(shr_sys_abort) ERROR: ice: Vertical thermo error

Starting thermo, T > Tmax, layer 1
Tin= -3.504219267679251E-002 , Tmax= -3.505690024561692E-002
istep1, my_task, i, j: 1 2 2 14
qin -277027009.030427
istep1, my_task, iblk = 1 2 1
Global block: 3
Global i and j: 37 13
Lat, Lon: -67.2631578947369 90.0000000000000
(shr_sys_abort) ERROR: ice: Vertical thermo error
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping

Starting thermo, T > Tmax, layer 1
Tin= -3.504219267679251E-002 , Tmax= -3.505690024561692E-002
istep1, my_task, i, j: 1 0 2 13
qin -277027009.030427
istep1, my_task, iblk = 1 0 1
Global block: 1
Global i and j: 1 12
Lat, Lon: -69.1578947368421 0.000000000000000E+000
(shr_sys_abort) ERROR: ice: Vertical thermo error
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cam 40000000013EFEF1 Unknown Unknown Unknown
cam 4000000001081A70 Unknown Unknown Unknown
cam 40000000010F1F80 Unknown Unknown Unknown
cam 40000000007E08B0 Unknown Unknown Unknown
cam 4000000000071350 Unknown Unknown Unknown
cam 400000000006CCA0 Unknown Unknown Unknown
cam 400000000078E800 Unknown Unknown Unknown
cam 400000000033C9A0 Unknown Unknown Unknown
cam 4000000000005250 Unknown Unknown Unknown
libc.so.6.1 2000000000189C50 Unknown Unknown Unknown
cam 4000000000005040 Unknown Unknown Unknown"
 
I tried to track this error and found it was related to ice_therm_vertical.F90 under models/ice/cice/src/source. When it checks Tin (internal ice layer temp), it was out of bounds. This is very strange. Those files are fine with the runs on all the other machines and those files are also downloaded from NCAR. So they should be totally fine.

Rui
 

eaton

CSEG and Liaisons
You mentioned earlier in the exchange that you were able to run successfully in a serial mode. That would indicate that there is still an mpi related problem. Since your mpi2 library exposed a problem in the CCSM use of mpi, it makes me wonder whether there are more problems that we haven't found yet. The most powerful test of mpi we have is that the answers are independent of the number of mpi tasks used. And you should get identical answers between mpi runs and a serial run (this is only true for standalone CAM runs; in the fully coupled system the active ocean model, POP, does not have this property).

If this is a problem somewhere in the ccsm code, one way to isolate it is to try running different CAM configurations and see where the problem first occurs. For example, run CAM in adiabatic mode (configure -phys adiabatic). This will eliminate the surface components and basically just run the dynamical core. You can check that serial and mpi runs give identical (bit for bit) results. If that works then start adding complexity. An aqua_planet run adds the physics back in without the land or sea ice models.

Another possible approach would be to use the default CAM4 configuration you started with, but just back out the CICE component which is where the current problem seems to be occurring. The old CSIM4 thermodynamic only sea ice component that was used in CAM3 is still available as an option. Just configure with -ice csim4.
 
Hi, Brian,

I redownloaded the CCSM4.0 code, the previous one I was using is CCSM4.01, which is actually an older beta version (I originally thought it is a newer version). And the problem is fixed. The only thing I am not satisfied is that I found it takes longer to run the model, it is about three hours to finish configure process and one and half hour to produce one monthly file. In our local machines, it is about 10 mins to configure and build, 35mins to produce one monthly file.

But my previous experiences tells me that SGI offers better computation speed. This is really strange. I am not sure if this is caused by the compromise of MPI installation as a local account with special option. I would like to know your thoughts on this.

Thank you very much.


Rui
 
Top