Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Run T31_g37 B1850C5 compset error: malloc(): memory corruption (fast)

Hello There,I am trying to run T31-g37 B1850C5 compset, and run into error: Overflow: Ross Sea                 Product adjacent mask at global (ij)= 61    8 Overflow: Ross Sea                 Product adjacent mask at global (ij)= 61    9 Overflow: Ross Sea                 Product adjacent mask at global (ij)= 61   10*** Error in `../bld/cesm.exe': malloc(): memory corruption (fast): 0x000000001148e240 ****** Error in `../bld/cesm.exe': malloc(): memory corruption (fast): 0x000000001148e250 ****** Error in `../bld/cesm.exe': malloc(): memory corruption (fast): 0x0000000011492810 *** I have check the related topic:https://bb.cgd.ucar.edu/cesm-runtime-error-netcdf-invalid-dimension-id-or-name-glibc-detectedand updated file, spmd_dyn.F90 and still get the same error.I am using Intel compiler, with MPT on a SGI ICE-XA machine.I tried print variable "num_ovf", its value is 4. run with debug on, and get the trace back: MPT: 0x00002aaaac065f19 in waitpid () from /lib64/libpthread.so.0MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 libbitmask-2.0-sgi716r63.rhel73.x86_64 libcpuset-1.0-sgi716r94.rhel73.x86_64 libcxgb3-1.3.1-8.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libhfi1-0.5-23.el7.x86_64 libibverbs-1.2.1-1.el7.x86_64 libmlx4-1.2.1-1.el7.x86_64 libmlx5-1.2.1-8.el7.x86_64 libmthca-1.0.6-13.el7.x86_64 libnl3-3.2.28-2.el7.x86_64 libnuma-3.0sgi-sgi716r61.rhel73.x86_64 libpsm2-devel-10.2.175-1.x86_64 numatools-2.0-sgi716r146.rhel73.x86_64 xpmem-1.6-sgi716r125.rhel73.x86_64MPT: (gdb) #0  0x00002aaaac065f19 in waitpid () from /lib64/libpthread.so.0MPT: #1  0x00002aaaaba9784c in mpi_sgi_system (command=,MPT:     __statbuf=, __fd=) at sig.c:98MPT: #2  MPI_SGI_stacktraceback (header=) at sig.c:339MPT: #3  0x00002aaaaba98354 in first_arriver_handler (signo=6,MPT:     stack_trace_sem=0x2aaab8e60500) at sig.c:488MPT: #4  0x00002aaaaba985df in slave_sig_handler (signo=6, siginfo=,MPT:     extra=) at sig.c:563MPT: #5  MPT: #6  0x00002aaaac2a81d7 in raise () from /lib64/libc.so.6MPT: #7  0x00002aaaac2a98c8 in abort () from /lib64/libc.so.6MPT: #8  0x00002aaaac2e7f07 in __libc_message () from /lib64/libc.so.6MPT: #9  0x00002aaaac2edda4 in malloc_printerr () from /lib64/libc.so.6MPT: #10 0x00002aaaac2f0dc7 in _int_malloc () from /lib64/libc.so.6MPT: #11 0x00002aaaac2f2fbc in malloc () from /lib64/libc.so.6MPT: #12 0x00000000073897fd in for__get_vm ()MPT: #13 0x0000000007352975 in for__add_to_lf_table ()MPT: #14 0x00000000073c46db in for__open_proc ()MPT: #15 0x00000000073599b2 in for__open_default ()MPT: #16 0x00000000073a4ce4 in for_write_seq_lis ()MPT: #17 0x000000000532fa91 in ovf_utils::ovf_init_groups ()MPT:     at /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/ovf_utils.F90:120MPT: #18 0x000000000530d09d in overflows::ovf_hu (hu=..., hum=...)MPT:     at /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/overflows.F90:5791MPT: #19 0x00000000053009ae in overflows::ovf_solvers_9pt ()MPT:     at /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/overflows.F90:5613MPT: #20 0x00000000051dc082 in overflows::init_overflows3 ()MPT:     at /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/overflows.F90:1449MPT: #21 0x0000000005ce9a01 in initial::pop_init_phase1 (errorcode=0)MPT:     at /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/initial.F90:345MPT: #22 0x000000000564f16b in pop_initmod::pop_initialize1 (errorcode=0)MPT:     at /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/POP_InitMod.F90:102MPT: #23 0x0000000005122cad in ocn_comp_mct::ocn_init_mct (eclock=..., cdata_o=...,MPT:     x2o_o=..., o2x_o=..., nlfilename='drv_in', .tmp.NLFILENAME.len_V$2850=6)MPT:     at /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/ocn_comp_mct.F90:261MPT: #24 0x000000000043d952 in ccsm_comp_mod::ccsm_init ()MPT:     at /store/whuang/CESM/cesm1_2_2_1/models/drv/driver/ccsm_comp_mod.F90:1130MPT: #25 0x00000000004fefb2 in ccsm_driver ()MPT:     at /store/whuang/CESM/cesm1_2_2_1/models/drv/driver/ccsm_driver.F90:90MPT: #26 0x0000000000418e9e in main ()MPT: #27 0x00002aaaac294b35 in __libc_start_main () from /lib64/libc.so.6MPT: #28 0x0000000000418da9 in _start ()MPT: (gdb) A debugging session is active.MPT:MPT:    Inferior 1 [process 99237] will be detached.MPT:MPT: Quit anyway? (y or n) [answered Y; input not from terminal]MPT: Detaching from program: /proc/99237/exe, process 99237 MPT: -----stack traceback ends-----MPT: On host r1i7n7, Program /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/cesm.exe, Rank 256, Process 99237: Dumping core on signal SIGABRT/SIGIOT(6) into directory /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/runMPT ERROR: MPI_COMM_WORLD rank 256 has terminated without calling MPI_Finalize()        aborting jobMPT: Received signal 6  file /lustre/whuang/wd4cesm1/mpt.T31_g37.B1850.288c.lnd72_ice144_ocn72.tpn36.omp1/bld/ocn/source/ovf_utils.F90 has line number:  115    integer (int_kind), pointer :: starts(:) 116 117    logical (log_kind) :: found, comm_master_present 118    real (r8), dimension(:,:,:), pointer :: g_mask !the mask 119 120    write(0, *) 'num_ovf = ', num_ovf 121 122    allocate(ids(num_ovf)) 123    count = 0 124 125 126 127 !   print *, 'MYPROC: ', my_task, 'OVF_INIT_GROUPS ' (Note, I added the write(0, *) .... line.)  Thanks in advance for your help! Wei  
 

jedwards

CSEG and Liaisons
Staff member
Please provide instructions to reproduce the problem including cesm version, compiler version, mpi version, pelayout. Note that the spmd_dyn.F90 file is for the fv dycore but you are using the eul dycore and so it does nothing.
 

klindsay

CSEG and Liaisons
Staff member
I suspect that in your version of overflows.F90, the call to ovf_HU in overflows.F90 is preceeded by the lineHUM(:,:,:) = HU(:,:,:)Please try replacing that with

   !$OMP PARALLEL DO PRIVATE(iblock,i,j)
   do iblock = 1,numBlocksClinic
      do j=1,POP_nyBlock
         do i=1,POP_nxBlock
            HUM(i,j,iblock) = HU(i,j,iblock)
         enddo
      enddo
   enddo
   !$OMP END PARALLEL DO 

This change was introduced in later tags of POP with the comment "Fix memory corruption issue in overflows.F90"
 

aaudette

Alexandre Audette
New Member
Hi, I just had the same issue running CESM1.2 with POP2. It used to run fine on 320 cores (all sequential) but when increasing the amount of cores to 800 (all sequential again), I ran into the same issue as Wei. I applied the fixed recommended and the model runs now, but the throughput is less than on 320 cores (went from 1.716 seconds/mday to 30.294 seconds/mday for POP).

I understand that do-loops are slow, but this is almost 20 times slower. Is this something that was noticed with this fix?
 
  • Like
Reactions: KJK
Top