Bus error

samrabin · Apr 29, 2022

Continuing my struggles to get a global 0.5° run working, I'm now specifying NTASKS: ['CPL:3600', 'ATM:3600', 'LND:3600', 'ICE:3600', 'OCN:3600', 'ROF:3600', 'GLC:3600', 'WAV:3600', 'ESP:100']. (Suggestions appreciated if any of those seem off to you!)

It works for a few years, but then crashes on October 29, 1980 after less than two realtime hours. I've tried twice, and it doesn't crash on the same timestep (once was at 3600s and the second was at 72000s), which makes me think it might not be a problem with the code.

The first time it crashed, I got the following error in cesm.log:

Code:

18:MPT ERROR: Rank 18(g:18) received signal SIGBUS(7).
18:    Process ID: 13137, Host: r3i1n7, Program: /glade/scratch/samrabin/halfdeg_test_20220425/bld/cesm.exe
18:    MPT Version: HPE MPT 2.22  03/31/20 15:59:10
18:
18:MPT: --------stack traceback-------
-1:MPT ERROR: MPI_COMM_WORLD rank 30 has terminated without calling MPI_Finalize()
-1:    aborting job
MPT: Received signal 7

No error was given the second time it crashed.

Any ideas of where I should start looking for the issue? Thanks as always.

oleson · Apr 29, 2022

I'm not sure if it will make a difference but you could try the pe-layout from a recent 1/2deg run shown here:

/glade/u/home/dlawren/cases/clm5sp_hd_crujra2020_createICs/clm5sp_hd_crujra2020_createICs/env_mach_pes.xml

In general we run with very few nodes for the datm compared to clm.

If that doesn't help, I'd suggest compiling and running in debug mode.

samrabin · May 17, 2022

Thanks, Keith. I'm still getting the error, unfortunately, and the debug output is either nonexistent or not helpful. I've begun the process of reverting my code piecewise toward the main branch to try and isolate the problem, but it's slow, as it takes about 8-10 realtime hours to hit the error.

In the meantime, I was speaking with someone from CISL, and she mentioned some compiler debugging flags that might help, either to alert about the issue in compile or describe it in error output. Do you know what those are and how I can set them? I'm using the default compiler for Cheyenne, which I think is Intel.

oleson · May 17, 2022

Well, that reminds me that I had cloned your case and redid the pe layout and ran it, and then forgot to update you. It seems to crash consistently at:

atm : model date 19801101 5400

I then reran to get monthly restart files so that I could restart near the crash and try to get a traceback.
I did get a traceback in the cesm log:

345:MPT: #6 0x00000000021c7adc in nutrientcompetitionflexiblecnmod::calc_plant_nitrogen_demand (this=0x2af1b021eb60, bounds=..., num_soilp=149, filter_soilp=...,
345:MPT: photosyns_inst=..., crop_inst=..., canopystate_inst=...,
345:MPT: cnveg_state_inst=..., cnveg_carbonstate_inst=...,
345:MPT: cnveg_carbonflux_inst=..., c13_cnveg_carbonflux_inst=...,
345:MPT: c14_cnveg_carbonflux_inst=..., cnveg_nitrogenstate_inst=...,
345:MPT: cnveg_nitrogenflux_inst=..., soilbiogeochem_carbonflux_inst=...,
345:MPT: soilbiogeochem_nitrogenstate_inst=..., energyflux_inst=..., aroot=...,
345:MPT: arepr=...)
345:MPT: at /glade/u/home/samrabin/ctsm/cime/../src/biogeochem/NutrientCompetitionFlexibleCNMod.F90:1352
345:MPT: #7 0x00000000021a4c12 in nutrientcompetitionflexiblecnmod::calc_plant_nutrient_demand (this=0x2af1b021eb60, bounds=..., num_soilp=149, filter_soilp=...,
345:MPT: photosyns_inst=..., crop_inst=..., canopystate_inst=...,
345:MPT: cnveg_state_inst=..., cnveg_carbonstate_inst=...,
345:MPT: cnveg_carbonflux_inst=..., c13_cnveg_carbonflux_inst=...,
345:MPT: c14_cnveg_carbonflux_inst=..., cnveg_nitrogenstate_inst=...,
345:MPT: cnveg_nitrogenflux_inst=..., soilbiogeochem_carbonflux_inst=...,
345:MPT: soilbiogeochem_nitrogenstate_inst=..., energyflux_inst=..., aroot=...,
345:MPT: arepr=...)

On the other hand I'm not sure how useful this is because it is pointing to a subroutine call, not a specific line of code in a subroutine.
My case directory is here:

/glade/work/oleson/ctsm_runs/halfdeg_test_20220425

and my run directory is here:

/glade/scratch/oleson/halfdeg_test_20220425/run

With regard to debugging flags, I think I would ask @sacks and/or @erik.

sacks · May 17, 2022

To add debugging flags, run ./xmlchange DEBUG=TRUE before building. In an existing case, you will need to run ./case.build --clean-all, or just create a new case; I usually do the latter as long as it's relatively easy to set up a new case, to be sure that I'm really getting a clean build and to avoid messing with an existing case. Note that this will be slow to run, though. Ideally, get a restart file shortly before the crash then do a new case with DEBUG=TRUE that starts up from that restart file.

I recently made major changes to NutrientCompetitionFlexibleCNMod.F90 on master, so could talk more with you about this if that seems like the source of the issue.

samrabin · May 18, 2022

Thanks, both!

@oleson, I did end up seeing that stacktrace on one run, but I doubted it was really relevant because the crash was non-deterministic—it kept happening at different timesteps. When I did a simplified run, it went away and I started getting SIGNAL 7 again. I've now traced that back to an un-deallocated variable I had introduced, but going back to the original run settings I'm again getting the crash that you saw.

@sacks, my branch is only up-to-date with ctsm5.1.dev091, which was the commit before you merged in those changes. My next test will be bringing my code up to date to see if that magically fixes the problem, or at least results in a better stacktrace.

samrabin · May 18, 2022

Oh, by the way, I'm wondering if the debug output isn't very helpful is because of this:

Code:

1538:MPT: Attaching to program: /proc/9319/exe, process 9319
1538:MPT: (No debugging symbols found in /usr/lib64/libdplace.so)
1538:MPT: (No debugging symbols found in /glade/u/apps/opt/intel/2020u1/mkl/lib/intel64/libmkl_intel_lp64.so)
1538:MPT: (No debugging symbols found in /glade/u/apps/opt/intel/2020u1/mkl/lib/intel64/libmkl_cdft_core.so)
1538:MPT: (No debugging symbols found in /glade/u/apps/opt/intel/2020u1/mkl/lib/intel64/libmkl_scalapack_lp64.so)
1538:MPT: (No debugging symbols found in /glade/u/apps/opt/intel/2020u1/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so)
1538:MPT: (No debugging symbols found in /glade/u/apps/opt/intel/2020u1/mkl/lib/intel64/libmkl_sequential.so)
1538:MPT: (No debugging symbols found in /glade/u/apps/opt/intel/2020u1/mkl/lib/intel64/libmkl_rt.so)
1538:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/libpthread.so.0
1538:MPT: Try: zypper install -C "debuginfo(build-id)=9fa703da2a52d78cc49877c18843db7b1ad9049b"
1538:MPT: (No debugging symbols found in /glade/u/apps/ch/os/lib64/libpthread.so.0)
1538:MPT: [Thread debugging using libthread_db enabled]
1538:MPT: Using host libthread_db library "/glade/u/apps/ch/os/lib64/libthread_db.so.1".
1538:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/lib64/libm.so.6
1538:MPT: Try: zypper install -C "debuginfo(build-id)=cc12cf31ea4a157ebc7ac7bdfc09d5bfa3e0f3e0"
1538:MPT: (No debugging symbols found in /glade/u/apps/ch/os/lib64/libm.so.6)

and then a bunch more messages like those last 3 lines.

sacks · May 18, 2022

I feel like I commonly get messages like that, but that doesn't imply you won't get useful debug output. My interpretation (quite possibly wrong) is just that the debug info might not go all the way down into some lower-level libraries.

Bus error

samrabin

Sam Rabin

Member

oleson

Keith Oleson

CSEG and Liaisons

samrabin

Sam Rabin

Member

oleson

Keith Oleson

CSEG and Liaisons

sacks

Bill Sacks

CSEG and Liaisons

samrabin

Sam Rabin

Member

samrabin

Sam Rabin

Member

sacks

Bill Sacks

CSEG and Liaisons