KILLED BY SIGNAL 9 (and Signal 6) with "malloc(): invalid size" during CLM regional spin-up using GSWP3 forcing

kylin

kylin
New Member
Hi everyone,

I am trying to run a regional CLM simulation (Qinling/Shanxi area) using CTSM5.2.005 with DATM (GSWP3v1) as forcing. The case builds successfully, but immediately fails at runtime with KILLED BY SIGNAL: 9 and no output files are produced. I have already spent a lot of time debugging and would appreciate any suggestions.

What I did:

  1. Created case:

    text
    ./create_newcase --case spinup06112 --res CLM_USRDAT --compset 2000_DATM%GSWP3v1_CLM50%SP_SICE_SOCN_MOSART_SGLC_SWAV --machine myintel --compiler intel --run-unsupported
  2. Set up custom grid and forcing (see attached user_nl_clm, user_datm.streams.xml, user_nl_datm and env_run.xml for details). Main settings:
    • CLM using a 0.01° unstructured mesh (186,837 elements) with mask file.
    • DATM streams use <meshfile> pointing to a coarse atm_mesh.nc (2,170 elements).
    • Forcing: GSWP3 0.5° data (years 1951-1999).
    • Spinup mode: accelerated spinup, cold start, 49 years per run.
  3. Built and submitted. The job dies within seconds.
What I have tried (all failed with same SIGKILL):

  • Set NTASKS=1 (single core) – still killed.
  • Fixed nlevurb mismatch: my surface dataset originally had nlevurb=5, I extended it to 10 using NCO/Python.
  • Checked all input files exist and are readable; GSWP3 files appear normal.
  • Verified memory is not exhausted (node has 503 GB, ~320 GB available when job runs).
Observations:

  • DATM opens the first solar forcing file successfully, but the process is killed shortly after, while setting up the I/O descriptor for variable FSDS. No data is actually read or interpolated.
  • The cesm.log shows `malloc(): invalid size (unsorted)` followed by termination with mixed signals (SIGABRT on rank 1, SIGKILL on other ranks). This indicates heap corruption, not just out-of-memory.
  • No lnd.log output beyond initialization header.
Attached files:

  • cesm.log, atm.log, lnd.log, drv.log, med.log
  • My detailed setup steps (covering user_nl_clm, user_nl_datm and user_datm.streams.xml)
My question:
What could cause the model to be killed immediately after reading the first forcing file, despite single-core mode and sufficient memory? Is it a mesh/domain mismatch, a library issue, or something else? Any help is greatly appreciated.

Thank you!
 

Attachments

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Thanks for all of this information. Looking at your datm.streams.xml, I see this for the meshfile:

<meshfile>/data/user/cesm/inputdata/lmwg/atm_mesh.nc</meshfile>

You indicated that this has 2170 elements. If you are using the default GSWP3 data as you indicated then the default mesh file is:

clmforc.GSWP3.c2011.0.5x0.5.TPQWL.SCRIP.210520_ESMFmesh.nc

which has 259200 elements corresponding to 360X720 (0.5 degree global forcing).

Is that the problem?
 
Vote Upvote 0 Downvote
Back
Top