Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Assertion failed at ibdev_multirail.c:4297: "0 <= chan->queued"

samrabin

Sam Rabin
Member
I'm doing a 2-degree spinup of ctsm5.1.dev092. The 200-year CLM_ACCELERATED_SPINUP run went fine, as did the first 400 years of the second phase. Now I'm trying to do another 400 years, and I'm running into this error:
Code:
1229:MPT ERROR: Assertion failed at ibdev_multirail.c:4297: "0 <= chan->queued"
1229:MPT ERROR: Rank 1229(g:1229) is aborting with error code 0.
1229:   Process ID: 51000, Host: r1i5n32, Program: /glade/scratch/samrabin/spinup_ctsm5.1.dev092_I1850Clm50BgcCrop_f19-g17_pt2/bld/cesm.exe
1229:   MPT Version: HPE MPT 2.22  03/31/20 15:59:10
The first time this happened, I thought it might be a fluke and resubmitted. The resubmitted run successfully continued past the timestep where it crashed before, but about 100 years later it crashed again with the same failed assertion.

I'm wondering what my next troubleshooting steps should be. I could resubmit with DEBUG=TRUE, but I don't like how the error isn't definitely reproducible. There also is a fair amount of traceback already; I'd paste it here but it's too many characters. It's in /glade/u/home/samrabin/cases_ctsm/spinup_ctsm5.1.dev092_I1850Clm50BgcCrop_f19-g17_pt2/logs/run_logs/cesm.log.4956786.chadmin1.ib0.cheyenne.ucar.edu.220711-233504.

Any suggestions much appreciated.
 

samrabin

Sam Rabin
Member
(I should say—that log is from the second failed run, whereas the error snippet in my post is from the first one.)
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
There was a post today about encountering the same error:


The suggestion was that it might be a machine (cheyenne) error and to resubmit.

I see that error was reported on node r1i5n32, which is the same node your error reported, so I wonder if this is a bad node...
 

samrabin

Sam Rabin
Member
Yes, resubmitting did work initially, but it's weird that it ended up happening again. But you're right, the second failure also happened on that same node! I'll try re-resubmitting…
 

minminfu

Member
I just had a similar issue, I find resubmission often fixes MPT errors. On Cheyenne, these are the nodes the failed job ran on.

exec_vnode = (r12i0n33:ncpus=36)+(r12i0n31:ncpus=36)+(r1i2n15:ncpus=36)+(r8
i1n27:ncpus=36)+(r2i6n4:ncpus=36)+(r3i1n2:ncpus=36)+(r7i0n22:ncpus=36)+
(r6i3n1:ncpus=36)+(r3i3n33:ncpus=36)+(r13i2n27:ncpus=36)+(r14i5n6:ncpus
=36)+(r14i5n25:ncpus=36)+(r1i1n24:ncpus=36)+(r12i4n11:ncpus=36)+(r10i4n
15:ncpus=36)+(r14i2n35:ncpus=36)+(r10i7n16:ncpus=36)+(r1i3n25:ncpus=36)
+(r3i4n7:ncpus=36)+(r1i2n5:ncpus=36)+(r6i2n20:ncpus=36)+(r1i5n27:ncpus=
36)+(r3i7n21:ncpus=36)+(r9i1n25:ncpus=36)
 
Top