Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Malloc error

Hi,

I have installed cesm1.2.2 and am trying to run the B1850 compsets using intel compiler. (I have attached the env_mach_specfic, config_compiler.xml, config_machines.xml and the cesm log file for reference).

I am able to run the F compsets successfully, however for B compsets, I can build the case successfully but get the following error on running the case:-

cesm.exe: malloc.c:4048: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
malloc(): invalid size (unsorted)
malloc(): invalid size (unsorted)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

I tried setting 'ulimit -s unlimited', however, I get the same error.

I am not sure what the error means and I am looking for some headway on how I can fix it.

Thanks in advance...
 

Attachments

  • env_mach_specific.txt
    1.2 KB · Views: 3
  • cesm.log.txt
    283.1 KB · Views: 4
  • config_compilers.txt
    38.4 KB · Views: 1
  • config_machines.txt
    28.5 KB · Views: 3

mlevy

Michael Levy
CSEG and Liaisons
Staff member
The memory issue seems to be coming from the ocean component:

Code:
Image              PC                Routine            Line        Source
cesm.exe           00000000017BECCB  Unknown               Unknown  Unknown
libpthread-2.28.s  0000153DD3EF2DC0  Unknown               Unknown  Unknown
libucp.so.0.0.0    0000153DC072A0E9  ucp_worker_progre     Unknown  Unknown
mca_pml_ucx.so     0000153DC0B8C317  mca_pml_ucx_progr     Unknown  Unknown
libopen-pal.so.40  0000153DCF1D417C  opal_progress         Unknown  Unknown
libmpi.so.40.20.2  0000153DD4B9EDD5  ompi_request_defa     Unknown  Unknown
libmpi.so.40.20.2  0000153DD4BE6744  ompi_coll_base_ba     Unknown  Unknown
libmpi.so.40.20.2  0000153DD4BB4FB8  MPI_Barrier           Unknown  Unknown
libmpi_mpifh.so    0000153DD44C9463  MPI_Barrier_f08       Unknown  Unknown
cesm.exe           0000000001235E2B  broadcast_mp_broa         205  broadcast.F90
cesm.exe           000000000136F899  pressure_grad_mp_         134  pressure_grad.F90
cesm.exe           00000000013223F3  initial_mp_pop_in         356  initial.F90
cesm.exe           00000000011E7977  pop_initmod_mp_po         102  POP_InitMod.F90
cesm.exe           000000000112E883  ocn_comp_mct_mp_o         261  ocn_comp_mct.F90
cesm.exe           0000000000440BA3  ccsm_comp_mod_mp_        1130  ccsm_comp_mod.F90
cesm.exe           00000000004435C7  MAIN__                     90  ccsm_driver.F90
cesm.exe           00000000004169A2  Unknown               Unknown  Unknown
libc-2.28.so       0000153DD3B40873  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004168AE  Unknown               Unknown  Unknown

I wonder if the ocean is trying to run with an inconvenient number of tasks -- how many cores are you running on?

Could you please provide the ocn.log file from your run directory as well as env_mach_pes.xml from your case directory? And ocn.bldlog from your build directory could also be useful.
 
Hi Michael,

I am running all components sequentially on 672 cores.
I have attached the env_mach_pes.xml, ocn.log and ocn.bldlog.

Thanks.


The memory issue seems to be coming from the ocean component:

Code:
Image              PC                Routine            Line        Source
cesm.exe           00000000017BECCB  Unknown               Unknown  Unknown
libpthread-2.28.s  0000153DD3EF2DC0  Unknown               Unknown  Unknown
libucp.so.0.0.0    0000153DC072A0E9  ucp_worker_progre     Unknown  Unknown
mca_pml_ucx.so     0000153DC0B8C317  mca_pml_ucx_progr     Unknown  Unknown
libopen-pal.so.40  0000153DCF1D417C  opal_progress         Unknown  Unknown
libmpi.so.40.20.2  0000153DD4B9EDD5  ompi_request_defa     Unknown  Unknown
libmpi.so.40.20.2  0000153DD4BE6744  ompi_coll_base_ba     Unknown  Unknown
libmpi.so.40.20.2  0000153DD4BB4FB8  MPI_Barrier           Unknown  Unknown
libmpi_mpifh.so    0000153DD44C9463  MPI_Barrier_f08       Unknown  Unknown
cesm.exe           0000000001235E2B  broadcast_mp_broa         205  broadcast.F90
cesm.exe           000000000136F899  pressure_grad_mp_         134  pressure_grad.F90
cesm.exe           00000000013223F3  initial_mp_pop_in         356  initial.F90
cesm.exe           00000000011E7977  pop_initmod_mp_po         102  POP_InitMod.F90
cesm.exe           000000000112E883  ocn_comp_mct_mp_o         261  ocn_comp_mct.F90
cesm.exe           0000000000440BA3  ccsm_comp_mod_mp_        1130  ccsm_comp_mod.F90
cesm.exe           00000000004435C7  MAIN__                     90  ccsm_driver.F90
cesm.exe           00000000004169A2  Unknown               Unknown  Unknown
libc-2.28.so       0000153DD3B40873  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004168AE  Unknown               Unknown  Unknown

I wonder if the ocean is trying to run with an inconvenient number of tasks -- how many cores are you running on?

Could you please provide the ocn.log file from your run directory as well as env_mach_pes.xml from your case directory? And ocn.bldlog from your build directory could also be useful.
 

Attachments

  • ocn.bldlog.200415-204439.gz
    5.4 KB · Views: 3
  • env_mach_pes.txt
    5.8 KB · Views: 7
  • ocn.log.txt
    32.4 KB · Views: 5

mlevy

Michael Levy
CSEG and Liaisons
Staff member
The ocean model sometimes struggles with unexpected task counts, so one thing to try would be using 480 tasks for the ocean instead of the full 672 -- the grid is 320 x 384 cells, so that results in 16x16 blocks (20x20 after being padded by a halo region to reduce communication). You can do that either by creating a new case and running

Code:
$ ./xmlchange NTASKS_OCN=480

before running cesm_setup

or you can try cleaning up your current case and rebuilding:

Code:
$ ./${CASE}.clean_build all
$ ./cesm_setup -clean
$ ./xmlchange NTASKS_OCN=480
$ ./cesm_setup
$ ./${CASE}.build

Let me know if this works, and if you'd like more info on how to know what task counts to try.
 
Top