Assertion failed at ibdev_multirail.c:4297: "0 <= chan->queued"

samrabin · Jul 12, 2022

I'm doing a 2-degree spinup of ctsm5.1.dev092. The 200-year CLM_ACCELERATED_SPINUP run went fine, as did the first 400 years of the second phase. Now I'm trying to do another 400 years, and I'm running into this error:

Code:

1229:MPT ERROR: Assertion failed at ibdev_multirail.c:4297: "0 <= chan->queued"
1229:MPT ERROR: Rank 1229(g:1229) is aborting with error code 0.
1229:   Process ID: 51000, Host: r1i5n32, Program: /glade/scratch/samrabin/spinup_ctsm5.1.dev092_I1850Clm50BgcCrop_f19-g17_pt2/bld/cesm.exe
1229:   MPT Version: HPE MPT 2.22  03/31/20 15:59:10

The first time this happened, I thought it might be a fluke and resubmitted. The resubmitted run successfully continued past the timestep where it crashed before, but about 100 years later it crashed again with the same failed assertion.

I'm wondering what my next troubleshooting steps should be. I could resubmit with DEBUG=TRUE, but I don't like how the error isn't definitely reproducible. There also is a fair amount of traceback already; I'd paste it here but it's too many characters. It's in

/glade/u/home/samrabin/cases_ctsm/spinup_ctsm5.1.dev092_I1850Clm50BgcCrop_f19-g17_pt2/logs/run_logs/cesm.log.4956786.chadmin1.ib0.cheyenne.ucar.edu.220711-233504

.

Any suggestions much appreciated.

samrabin · Jul 12, 2022

(I should say—that log is from the second failed run, whereas the error snippet in my post is from the first one.)

oleson · Jul 13, 2022

There was a post today about encountering the same error:

Previously successful run crashes after 18 years

Hi all, I'm running a coupled atmosphere-ocean compset in CESM2.2.0 on Cheyenne. I have successfully run this compset for 10 years - in the next run of 10 years it crashed at year 18. The error in the cesm.log is: WHL, oc_tavg_helper is already associated; reset the tavg fields 0: sysmem...

bb.cgd.ucar.edu

The suggestion was that it might be a machine (cheyenne) error and to resubmit.

I see that error was reported on node r1i5n32, which is the same node your error reported, so I wonder if this is a bad node...

samrabin · Jul 13, 2022

Yes, resubmitting did work initially, but it's weird that it ended up happening again. But you're right, the second failure also happened on that same node! I'll try re-resubmitting…

minminfu · Nov 17, 2022

I just had a similar issue, I find resubmission often fixes MPT errors. On Cheyenne, these are the nodes the failed job ran on.

exec_vnode = (r12i0n33:ncpus=36)+(r12i0n31:ncpus=36)+(r1i2n15:ncpus=36)+(r8
i1n27:ncpus=36)+(r2i6n4:ncpus=36)+(r3i1n2:ncpus=36)+(r7i0n22:ncpus=36)+
(r6i3n1:ncpus=36)+(r3i3n33:ncpus=36)+(r13i2n27:ncpus=36)+(r14i5n6:ncpus
=36)+(r14i5n25:ncpus=36)+(r1i1n24:ncpus=36)+(r12i4n11:ncpus=36)+(r10i4n
15:ncpus=36)+(r14i2n35:ncpus=36)+(r10i7n16:ncpus=36)+(r1i3n25:ncpus=36)
+(r3i4n7:ncpus=36)+(r1i2n5:ncpus=36)+(r6i2n20:ncpus=36)+(r1i5n27:ncpus=
36)+(r3i7n21:ncpus=36)+(r9i1n25:ncpus=36)

Assertion failed at ibdev_multirail.c:4297: "0 <= chan->queued"

samrabin

Sam Rabin

Member

samrabin

Sam Rabin

Member

oleson

Keith Oleson

CSEG and Liaisons

Previously successful run crashes after 18 years

samrabin

Sam Rabin

Member

minminfu

Member