Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM 1.2.2 segmentation fault when running model

Hi all,
We want to run cesm 1.2.2 model at our Cluster. For this purpose we downloaded it and compile it for intel 2013_sp1.1.16 with netcdf 4.3.2, netcdf-fortran 4.2 and intel mpi 4.1.3.045 and no threading.
For this purpose we use an example found in the guide:
Code:
./create_newcase -case ~/cesm/EXAMPLE_CASE -compset B_1850_CN -res 0.9x1.25_gx1v6 -mach userdefined<br /><br />And then we ported to our Cluster. So it "successfully" compiles. Then we run with 40 cores with only mpi in order to check that it really works but it throws the error below:<br /><br />...<br />Sw_lamult:Sw_ustokes:Sw_vstokes:Sw_hstokes<br />seq_flds_mod: seq_flds_w2x_fluxes= <br /><br />seq_flds_mod: seq_flds_x2w_states= <br />Sa_u:Sa_v:Sa_tbot:Si_ifrac:So_t:So_u:So_v:So_bldepth<br />seq_flds_mod: seq_flds_x2w_fluxes= <br /><br />          40 pes participating in computation<br /> -----------------------------------<br /> TASK#  NAME<br />  0  ithaca01<br />  1  ithaca01<br />  2  ithaca01<br />  3  ithaca01<br />  4  ithaca01<br />  5  ithaca01<br />  6  ithaca01<br />  7  ithaca01<br />  8  ithaca02<br />  9  ithaca02<br /> 10  ithaca02<br /> 11  ithaca02<br /> 12  ithaca02<br /> 13  ithaca02<br /> 14  ithaca02<br /> 15  ithaca02<br /> 16  ithaca03<br /> 17  ithaca03<br /> 18  ithaca03<br /> 19  ithaca03<br /> 20  ithaca03<br /> 21  ithaca03<br /> 22  ithaca03<br /> 23  ithaca03<br /> 24  ithaca04<br /> 25  ithaca04<br /> 26  ithaca04<br /> 27  ithaca04<br /> 28  ithaca04<br /> 29  ithaca04<br /> 30  ithaca04<br /> 31  ithaca04<br /> 32  ithaca05<br /> 33  ithaca05<br /> 34  ithaca05<br /> 35  ithaca05<br /> 36  ithaca05<br /> 37  ithaca05<br /> 38  ithaca05<br /> 39  ithaca05<br /> Opened existing file b40.1850.track1.1deg.006.cam.i.0863-01-01-00000.nc<br />       65536<br /> Opened existing file <br /> /share/data/udic/cesm/inputdata/atm/cam/topo/USGS-gtopo30_0.9x1.25_remap_c05102<br /> 7.nc      131072<br /> NetCDF: Invalid dimension ID or name<br />forrtl: severe (174): SIGSEGV, segmentation fault occurred<br />Image              PC                Routine            Line        Source             <br />cesm.exe           0000000001B20D99  Unknown               Unknown  Unknown<br />cesm.exe           0000000001B1F710  Unknown               Unknown  Unknown<br />cesm.exe           0000000001ABA282  Unknown               Unknown  Unknown<br />cesm.exe           0000000001A46133  Unknown               Unknown  Unknown<br />cesm.exe           0000000001A4FB4B  Unknown               Unknown  Unknown<br />libpthread.so.0    00002B82D5C38800  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D6022D1E  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D5FDDD56  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D608638F  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D607BFBE  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D605EE50  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D5F10ADC  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D6124BE4  Unknown               Unknown  Unknown<br />libmpi.so.4        00002B82D6124983  Unknown               Unknown  Unknown<br />libmpigf.so.4      00002B82D64DB70A  Unknown               Unknown  Unknown<br />cesm.exe           0000000001980CC7  pio_spmd_utils_mp         345  pio_spmd_utils.F90.in<br />cesm.exe           000000000197A4A9  box_rearrange_mp_        1372  box_rearrange.F90.in<br />cesm.exe           000000000197184D  box_rearrange_mp_        1029  box_rearrange.F90.in<br />cesm.exe           000000000185941D  piolib_mod_mp_pio        1166  piolib_mod.F90<br />cesm.exe           00000000004E0031  cam_pio_utils_mp_         449  cam_pio_utils.F90<br />cesm.exe           0000000000978B65  ncdio_atm_mp_infl         279  ncdio_atm.F90<br />cesm.exe           000000000087BDFA  inidat_mp_read_in         230  inidat.F90<br />cesm.exe           0000000000606AE2  startup_initialco          54  startup_initialconds.F90<br />cesm.exe           00000000005177C4  inital_mp_cam_ini          51  inital.F90<br />cesm.exe           00000000004A8E91  cam_comp_mp_cam_i         164  cam_comp.F90<br />cesm.exe           00000000004A4DB0  atm_comp_mct_mp_a         276  atm_comp_mct.F90<br />cesm.exe           0000000000434AEB  ccsm_comp_mod_mp_        1058  ccsm_comp_mod.F90<br />cesm.exe           00000000004372E6  MAIN__                     90  ccsm_driver.F90<br />cesm.exe           0000000000413D56  Unknown               Unknown  Unknown<br />libc.so.6          00002B82D6D87C36  Unknown               Unknown  Unknown<br />cesm.exe           0000000000413C49  Unknown               Unknown  Unknown<br /><br />As it is a first try with the model we assign 40 cores to each components. We do not expected performance at this moment rather than check if the compilation is well done.<br /><br />As we do not know where to look at, we compile with DEBUG mode to true. But the model finishes with no error.<br /><br />It is confusing. I presume one of the debug flags changes its execution behaviour. <br /><br />Does someone has a clue on how to proceed?<br /><br /><br /><strong>Macros:</strong><br /><br />CPPDEFS+= -DFORTRANUNDERSCORE -DNO_R16 -Dlinux -DCPRINTEL <br />SLIBS+=$(shell $(NETCDF_PATH)/bin/nc-config --flibs)<br />CFLAGS:= -O2 -fp-model precise <br />CXX_LDFLAGS:= -cxxlib <br />CXX_LINKER:=FORTRAN<br />FC_AUTO_R8:= -r8 <br />FFLAGS:= -fp-model source -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs <br />FFLAGS_NOOPT:= -O0 <br />FIXEDFLAGS:= -fixed -132 <br />FREEFLAGS:= -free <br />MPICC:=mpiicc<br />MPICXX:= mpiicpc<br />MPIFC:= mpiifort <br />MPI_LIB_NAME:=mpi<br />MPI_PATH:=/share/software/impi/4.1.3.045/intel64<br />NETCDF_PATH:=/share/software/netCDF/4.3.2-ictce-6.1.5<br />PNETCDF_PATH:=<br />SCC:= icc <br />SCXX:= icpc <br />SFC:= ifort <br />SUPPORTS_CXX:=TRUE<br />ifeq ($(DEBUG), TRUE) <br />   FFLAGS += -O0 -g -check uninit -check bounds -check pointers -fpe0 <br />endif<br />ifeq ($(DEBUG), FALSE) <br />   FFLAGS += -O2 <br />endif<br />ifeq ($(compile_threaded), true) <br />   LDFLAGS += -openmp <br />   CFLAGS += -openmp <br />   FFLAGS += -openmp <br />endif<br /><br />ifeq ($(MODEL), pop2) <br />   CPPDEFS += -D_USE_FLOW_CONTROL <br />endif<br /><br /><strong>atm log:</strong><br /><br />  2.705257020083618E-004<br /> initcom: lat, clat, w          192   1.57079632679490     <br />  3.381742815944389E-005<br /> Number of longitudes per latitude =          288<br /> PHYS_GRID_INIT:  Using PCOLS=          16   phys_loadbalance=           2 <br />   phys_twin_algorithm=           1   phys_alltoall=          -1 <br />   chunks_per_thread=           1<br /> chem_surfvals_init: ghg surface values are fixed as follows<br />   co2 volume mixing ratio =   2.847000000000000E-004<br />   ch4 volume mixing ratio =   7.916000000000000E-007<br />   n2o volume mixing ratio =   2.756800000000000E-007<br />   f11 volume mixing ratio =   1.248000000000000E-011<br />   f12 volume mixing ratio =   0.000000000000000E+000<br /> INITIALIZE_RADBUFFER: ntoplw =           1  pressure:   354.463800000001     <br /> Creating new decomp:             2602192288<br /><br /><strong>general log:</strong><br /><br /> CESM BUILDNML SCRIPT STARTING<br /> - To prestage restarts, untar a restart.tar file into /scratch/udic/ajornet/example9_2/run<br /> infile is /home/ajornet/cesm/example9_2/Buildconf/cplconf/cesm_namelist <br />CAM writing dry deposition namelist to drv_flds_in <br />CAM writing namelist to atm_in <br />CLM configure done.<br />CLM adding use_case 1850_control defaults for var sim_year with val 1850 <br />CLM adding use_case 1850_control defaults for var sim_year_range with val constant <br />CLM adding use_case 1850_control defaults for var stream_year_first_ndep with val 1850 <br />CLM adding use_case 1850_control defaults for var stream_year_last_ndep with val 1850 <br />CLM adding use_case 1850_control defaults for var use_case_desc with val Conditions to simulate 1850 land-use <br />CICE configure done.<br />Getting init_ts_file_fmt from /share/data/udic/cesm/inputdata/ccsm4_init/b40.1850.track1.1deg.006/0863-01-01/rpointer.ocn.restart<br />POP2 build-namelist: ocn_grid is gx1v6 <br />POP2 build-namelist: ocn_tracer_modules are  iage <br /> CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY<br />-------------------------------------------------------------------------<br />-------------------------------------------------------------------------<br /> CESM PRESTAGE SCRIPT STARTING<br /> - Case input data directory, DIN_LOC_ROOT, is /share/data/udic/cesm/inputdata<br /> - Checking the existence of input datasets in DIN_LOC_ROOT<br /> <br />Any files with "status unknown" below were not found in the<br />expected location, and are not from the input data repository.<br />This is informational only; this script will not attempt to<br />find these files. If CESM can find (or does not need) these files<br />at run time, no error will result.<br />Input Data List Files Found:<br />/home/ajornet/cesm/example9_2/Buildconf/cpl.input_data_list<br />/home/ajornet/cesm/example9_2/Buildconf/cice.input_data_list<br />/home/ajornet/cesm/example9_2/Buildconf/rtm.input_data_list<br />/home/ajornet/cesm/example9_2/Buildconf/clm.input_data_list<br />/home/ajornet/cesm/example9_2/Buildconf/pop2.input_data_list<br />/home/ajornet/cesm/example9_2/Buildconf/cam.input_data_list<br />File status unknown: b40.1850.track1.1deg.006.clm2.r.0863-01-01-00000.nc <br />File status unknown: b40.1850.track1.1deg.006.clm2.r.0863-01-01-00000.nc <br />File status unknown: b40.1850.track1.1deg.006.cam.i.0863-01-01-00000.nc <br /> <br /> - Prestaging REFCASE (ccsm4_init/b40.1850.track1.1deg.006/0863-01-01) to /scratch/udic/ajornet/example9_2/run<br /> CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY<br />-------------------------------------------------------------------------<br />Wed Nov 19 13:45:33 CET 2014 -- CSM EXECUTION BEGINS HERE<br />Wed Nov 19 13:45:46 CET 2014 -- CSM EXECUTION HAS FINISHED<br />Model did not complete - see /scratch/udic/ajornet/example9_2/run/cesm.log.141119-134455
 
More details with the error. This time is it compile with DEBUG and launched with 64 cores.  /home/ajornet/Downloads/cesm1_2_2/models/utils/pio/pionfwrite_mod.F90.in
 /home/ajornet/Downloads/cesm1_2_2/models/utils/pio/pionfwrite_mod.F90.in
         127      491520
         127      491520
 /home/ajornet/Downloads/cesm1_2_2/models/utils/pio/pionfwrite_mod.F90.in
         127      368640
 pionfwrite_mod::write_nfdarray_double: 0: done writing for self           3
 pionfwrite_mod::write_nfdarray_double: File%iosystem%comp_rank:           4
 pionfwrite_mod::write_nfdarray_double 1 receiving from            1      491520
 : relaying IOBUF for write size=      368640           1           1
forrtl: severe (193): Run-Time Check Failure. The variable 'pionfwrite_mod_mp_write_nfdarray_double_$I' is being used without being defined
Image              PC                Routine            Line        Source
cesm.exe           0000000006E222E2  piodarray_mp_writ         607  piodarray.F90.in
cesm.exe           0000000006DDE0C5  piodarray_mp_writ         207  piodarray.F90.in
cesm.exe           0000000006DF4F31  piodarray_mp_writ         277  piodarray.F90.in
cesm.exe           0000000005BFB792  io_netcdf_mp_writ        1223  io_netcdf.F90
cesm.exe           0000000005BCE459  io_mp_data_set_           174  io.F90
cesm.exe           0000000006430E79  hmix_aniso_mp_wri        1486  hmix_aniso.F90
cesm.exe           00000000063C0913  hmix_aniso_mp_ini         468  hmix_aniso.F90
cesm.exe           0000000005B687DB  horizontal_mix_mp         276  horizontal_mix.F90
cesm.exe           0000000005B9A41A  initial_mp_pop_in         493  initial.F90
cesm.exe           00000000055091B2  pop_initmod_mp_po         162  POP_InitMod.F90
cesm.exe           000000000502D91E  ocn_comp_mct_mp_o         284  ocn_comp_mct.F90
cesm.exe           0000000000433060  ccsm_comp_mod_mp_        1130  ccsm_comp_mod.F90
cesm.exe           00000000004C21D9  MAIN__                     90  ccsm_driver.F90
cesm.exe           0000000000413D36  Unknown               Unknown  Unknown
libc.so.6          00002B4010D4AC36  Unknown               Unknown  Unknown
cesm.exe           0000000000413C29  Unknown               Unknown  Unknown
 

jedwards

CSEG and Liaisons
Staff member
Hi Albert,I'm sorry but this error is a problem in the debug code (a loop variable i is used with being declared)and not the same problem that you are seeing with debug off.   Looking back at your first message this could be a problem with intel mpi - we don't have a lot of experience with that library.   Can you try again with mpich?
 
Hi, jedwards,Understood with debug option.As you requested, CESM 1.2.2 has been compiled with MPICH2 and then launched.
  • In Macros: I introduced -mcmodel=medium flag due to problems
  • With MPICH2 it fails in the same position as with intel mpi (interesting)
 I attach cesm logs and Macros.CEMS LOG: Opened existing file b40.1850.track1.1deg.006.cam.i.0863-01-01-00000.nc
       65536
 Opened existing file
 /share/data/udic/cesm/inputdata/atm/cam/topo/USGS-gtopo30_0.9x1.25_remap_c05102
 7.nc      131072
 NetCDF: Invalid dimension ID or name
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
cesm.exe           0000000001D0C259  Unknown               Unknown  Unknown
cesm.exe           0000000001D0ABD0  Unknown               Unknown  Unknown
cesm.exe           0000000001CA7FB2  Unknown               Unknown  Unknown
cesm.exe           0000000001C33E63  Unknown               Unknown  Unknown
cesm.exe           0000000001C3D87B  Unknown               Unknown  Unknown
libpthread.so.0    00002B494F1E9800  Unknown               Unknown  Unknown
cesm.exe           0000000001D19ADB  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FA90116  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FA8D9AC  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FAA74D0  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FAA7789  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FA8CBE0  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FB5A078  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FB59D78  Unknown               Unknown  Unknown
libmpich.so.10     00002B494FA5569D  Unknown               Unknown  Unknown
cesm.exe           0000000001B60113  pio_spmd_utils_mp         345  pio_spmd_utils.F90.in
cesm.exe           0000000001B59382  box_rearrange_mp_        1372  box_rearrange.F90.in
cesm.exe           0000000001B5044D  box_rearrange_mp_        1029  box_rearrange.F90.in
cesm.exe           0000000001A27483  piolib_mod_mp_pio        1166  piolib_mod.F90
cesm.exe           00000000004FBCC4  cam_pio_utils_mp_         449  cam_pio_utils.F90
cesm.exe           00000000009E51FB  ncdio_atm_mp_infl         279  ncdio_atm.F90
cesm.exe           00000000008D83BC  inidat_mp_read_in         230  inidat.F90
cesm.exe           000000000063C182  startup_initialco          54  startup_initialconds.F90
cesm.exe           0000000000537824  inital_mp_cam_ini          51  inital.F90
cesm.exe           00000000004BF26F  cam_comp_mp_cam_i         164  cam_comp.F90
cesm.exe           00000000004BA914  atm_comp_mct_mp_a         276  atm_comp_mct.F90
cesm.exe           000000000043E7D7  ccsm_comp_mod_mp_        1058  ccsm_comp_mod.F90
cesm.exe           0000000000441AD0  MAIN__                     90  ccsm_driver.F90
cesm.exe           0000000000413AC6  Unknown               Unknown  Unknown
libc.so.6          00002B4950948C36  Unknown               Unknown  Unknown
cesm.exe           00000000004139B9  Unknown               Unknown  UnknownFatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x10c6c5c0, count=1, dtype=0x4c000829, src=0, tag=13, comm=0xc400003b, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x10c6c310, count=1, dtype=0x4c000829, src=0, tag=1, comm=0xc400003b, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x114a81a0, count=1, dtype=0x4c000829, src=0, tag=6, comm=0xc400003b, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x114af900, count=1, dtype=0x4c000829, src=0, tag=11, comm=0xc400005c, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x1161b8c0, count=1, dtype=0x4c000829, src=0, tag=14, comm=0xc400003a, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x1149f040, count=1, dtype=0x4c000829, src=0, tag=8, comm=0xc400003b, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x1149df30, count=1, dtype=0x4c000829, src=0, tag=10, comm=0xc400003a, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x10c032d0, count=1, dtype=0x4c000829, src=0, tag=3, comm=0xc400003a, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in PMPI_Wait: A process has failed, error stack:
PMPI_Wait(180)............: MPI_Wait(request=0x7ffffde45c70, status=0x78e3360) failed
MPIR_Wait_impl(77)........:
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x10c6c0a0, count=1, dtype=0x4c000829, src=0, tag=4, comm=0xc400003b, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(184).............: MPI_Recv(buf=0x114a7e50, count=1, dtype=0x4c000829, src=0, tag=7, comm=0xc400003a, status=0x78da0a0) failed
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in PMPI_Wait: A process has failed, error stack:
PMPI_Wait(180)............: MPI_Wait(request=0x7fff747332f0, status=0x78e3360) failed
MPIR_Wait_impl(77)........:
dequeue_and_set_error(888): Communication error with rank 1
Fatal error in PMPI_Wait: A process has failed, error stack:
PMPI_Wait(180)............: MPI_Wait(request=0x7fffa64017f0, status=0x78e3360) failed
MPIR_Wait_impl(77)........:===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 174
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:5@ithaca07] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:5@ithaca07] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:5@ithaca07] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:6@ithaca06] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:6@ithaca06] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:6@ithaca06] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:4@ithaca20] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:4@ithaca20] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:4@ithaca20] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@ithaca16] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:2@ithaca16] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@ithaca16] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:9@ithaca26] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:9@ithaca26] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): [mpiexec@ithaca14] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting  
 

jedwards

CSEG and Liaisons
Staff member
I don't see any obvious problems - are you perhaps exceeding ulimits?  Exceeding the account stack limit often results in an error like this.   
 
We have setted those:limit coredumpsize unlimited
limit stacksize unlimited We also applied this issue:https://bb.cgd.ucar.edu/fv-dycore-spmddynf90-and-intel-14x-compilerSolved this as well:https://bb.cgd.ucar.edu/misstype-input-filename-b1850-cam At this point it fails but with no clear message. I attach logs.CESM logsNetCDF: Attribute not found
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff04345910, count=1, MPI_INTEGER, root=0, comm=0xc400001a) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@ithaca45] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@ithaca45] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@ithaca45] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@ithaca45] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@ithaca45] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@ithaca45] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@ithaca45] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion Ocean logsLatitudinal Auxiliary Grid Choice is:
 ... based on flipped Southern Hemisphere latititude grid
 
Transport diagnostics include:
MOC
N_HEAT
N_SALT
 
 The following            6
  regions will be included in the n_transport_reg = 2 transports:
  Atlantic Ocean                     ( 6)
  Mediterranean Sea                  ( 7)
  Labrador Sea                       ( 8)
  GIN Sea                            ( 9)
  Arctic Ocean                       (10)
  Hudson Bay                         (11)
 

jedwards

CSEG and Liaisons
Staff member
I notice that you are running on 3 nodes and that you have an unequal number of tasks on each node (2,8,6)  I'm not sure this is the problem but you should try running with an equal number of tasks on each node.   
 
I tried with equal number of cores per node with the same behaviour.I moved to -O0 to exclude any possible performance error. Doing some research I found out first MPI process eats all memory from node (48GB) until it crashes. I launched with 128 cores. I attach pes config file.Testing case:./create_newcase -case ~/cesm/example11 -compset B_1850_CAM5_CN -res 0.9x1.25_gx1v6 -mach ithacaAny suggestion?
 

jedwards

CSEG and Liaisons
Staff member
Hi Albert,You should start with some simpiler compsets and work your wahy up to the B compset.   Try -compset A -res f19_g16_rx1 if you can run it successufully then try a C or F compset, once all of those work then maybe you can try for a B.    Also f09_g16 is a very high resolution to try to run on such a small number of tasks.    
 
Hi jedwards,Compset A -res f19_g16_rx1 -> OK
Compset C -res f19_g16_rx1 -> Fails due to memory in ocean component (pop2). It uses 64 cores. Alwasys same pattern, process with lower id eats all memory. So it has the same behaviour as my original test.
 

jedwards

CSEG and Liaisons
Staff member
Looks like a memory leak in the pop model - does it still occur if you compile with DEBUG=TRUE?   
 
With DEBUG=TRUE there is no error at all. It successfully finishes the simulation.For DEBUG=FALSE, I managed to pause mpi process with less pid (using idb) at the time the memory is filling up. This is where it stops:ocn/pop2/source/diags_on_lat_aux_grid.F90:REGION_MASK_LAT_AUX(:,:,1) =   &
                    merge( 1, 0, REGION_MASK_LAT_AUX(:,:,1) > 0 )
 

jedwards

CSEG and Liaisons
Staff member
Please try the following change: -     REGION_MASK_LAT_AUX(:,:,1) =   &-                    merge( 1, 0, REGION_MASK_LAT_AUX(:,:,1) > 0 )+     where(REGION_MASK_LAT_AUX(:,:,1) > 0)+        REGION_MASK_LAT_AUX(:,:,1) = 1+     elsewhere+        REGION_MASK_LAT_AUX(:,:,1) = 0+     end where
 
Dear jedwards,                       i have a similar error too. but i am unable to access the bugzilla link. The link seems broken. can you post the file with the changes here, or the line number where the error is located where the changes can be made. 
 
i am unable to make sense of the bugfix sir does +++ or --- or + and - mean. am i supposed to remove lines with minus and add lines indicated as plus. how is the bug to be fixed. 
 
Dear Software Engineer,We're trying to run CESM1.2.2 FMOZ mode on our cluster, and we encountered similar issues with:====================================================================================   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES=   EXIT CODE: 134=   CLEANING UP REMAINING PROCESSES=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES===================================================================================And the run simply stopped after this. The log files are attached, together with some env_*.xml and namelists. I think it may not have to do with the ocean model, which we turned off, but we're then not sure what probably could cause this coz the log files don't really tell. Any ideas? Your insights will be much appreciated.
 

jedwards

CSEG and Liaisons
Staff member
My guess is that the stack size limit for your user is too small.   Check with the ulimit command and increase (try setting it to unlimited)
 
Top