Run fails after 5 months with unfamiliar error

James King · May 16, 2023

Hi all,

I'm working on a regionally-refined spectral element grid and my case has failed with what looks like a fairly general MPT error but it's not one I've seen before, and the usual 'search for error message in CESM forum' approach hasn't been useful. The case runs OK for 3 months. After I resubmit for another 3 months, the model writes output files for months 4 and 5, and then crashes. There are no error messages in any of the component logs.The cesm.log is too large to attach here, but the error appears to be this:

244:MPT: rank 244 dropping unexpected RC packet from 256 (45523:45524), presumed failover - 2880 0 2816
383:MPT Warning: 09:56:19: rank 383: r12i7n17 HCA mlx5_0 port 1 had an IB
383: timeout with communication to r12i4n34. Attempting to rebuild this
383: particular connection.
1000:MPT Warning: 09:56:19: rank 1000: r13i6n3 HCA mlx5_0 port 1 had an IB
1000: timeout with communication to r13i5n23. Attempting to rebuild this
1000: particular connection.
751:MPT Warning: 09:56:25: rank 751: r11i4n26 HCA mlx5_0 port 1 had an IB
751: timeout with communication to r13i6n3. Attempting to rebuild this
751: particular connection.
251:MPT: rank 251 dropping unexpected RC packet from 255 (20685:35723), presumed failover - 64 0 -4696321144358065024
251:MPT: rank 251 dropping unexpected RC packet from 255 (59818:35723), presumed failover - 18 0 4469130340749704337
251:MPT ERROR: Extracting flags from IB packet of unknown length
251:MPT ERROR: Rank 251(g:251) is aborting with error code 0.

This looks like an MPT error to me, but I'm not sure what's causing it. This case configuration has run OK previously. I'm running CESM2.2.0 on Cheyenne here. Any pointers would be most welcome.

Thanks,

James

mlevy · May 16, 2023

Can you please share your case directory and run directory? Also, is this error reproducible? The log makes it appear to be a hardware issue, so I wouldn't be surprised if you could resubmit and run successfully. (It might also be useful to share these errors with CISL via help@ucar.edu)

James King · May 16, 2023

Hi Michael,

Sure thing! The case is

/glade/work/jamesking/cases/FHIST_MUSICA_UK_grid2_test4_newtopo

with run dir

/glade/scratch/jamesking/FHIST_MUSICA_UK_grid2_test4_newtopo/run

The error is reproducible in the sense that I have tried resubmitting twice today and got the same error both times at roughly the same time in the run (after history files for 2 months have been written to the case dir).

Thanks,

James

mlevy · May 16, 2023

The runs are failing at two different points in the model; the first run made it to June 27th

Code:

tStamp_write: model date =   20090627       0 wall clock = 2023-05-16 02:47:39 avg dt =   290.84 dt =   426.94

but the second run only made it to the 3rd

Code:

tStamp_write: model date =   20090603       0 wall clock = 2023-05-16 09:52:52 avg dt =   296.66 dt =   291.81

This feels more and more like a hardware issue, although there don't seem to be common nodes failing in both runs... but I would start with emailing CISL before we dig too deeply into code.

James King · May 17, 2023

Thanks for the advice, I will drop them an email.

James King · May 19, 2023

Looks like it is some sort of hardware error - though trying the run again with the same settings causes it to crash with the same error after only a few days of running so the issue is getting worse, not better! I've reported it to CISL.

mlevy · May 22, 2023

Another option that I should have mentioned last week, is you can try to run in DEBUG mode. To do that

Code:

$ ./case.build --clean-all
$ ./xmlchange DEBUG=TRUE
$ ./case.build
$ ./case.submit

Note that this will run significantly slower than before, as we reduce the compiler optimization level and increase the amount of data written in the log files. But if the run crashes, there should be far more information pointing to where the problem occurred. If the model is crashing at different points in the run due to bad memory management (out of bounds indexing, pointing to deallocated memory, etc) then DEBUG should catch it.

James King · May 31, 2023

Hi Michael,

I've been in communication with CISL and tthey're investigating. Infuriatingly this happens with some runs and not others, even when the model configs are identical apart from a few different input files. Wondered if anyone else is seeing this?

James

akhtert · Jun 5, 2023

James King said:
Hi Michael,

I've been in communication with CISL and tthey're investigating. Infuriatingly this happens with some runs and not others, even when the model configs are identical apart from a few different input files. Wondered if anyone else is seeing this?

James

Hi James, I am having similar problem for my global run. The same model config and setup ran for 9 years and then it stopped running The problem is there for about a month now! I resubmitted and re-created case for about 15 times, the same thing happens. However, the same settings run if I just use coldstart, but not for a run with initial data. Do you have any update from CISL? I am planning to contact cisl.

best
Tanjila

James King · Jun 5, 2023

Hi Tanjila,

I've been in contact with CISL and they're investigating. In the meantime, you might find this patch works: in your CESM root there should be a script which generates a .case.run script at model runtime:

Code:

<cesmroot>/cime/config/cesm/machines/template.case.run

Add the following line at the start of the script below #Batch system directives:

Code:

#PBS -l place=group=rack

This forces the model to only run within a single server rack on Cheyenne, which may help as CISL think the issue is to do with the hardware interconnects.

Hope that helps - it's worked with some of my runs.

James

akhtert · Jun 5, 2023

Hi James,

I tried this but did not work through. may I know how many nodes are you using for the cases that ran?

Thank you again
Tanjila

James King · Jun 6, 2023

Hi Tanjila,

It did take a few attempts to get this to work, rebuilding the case each time and decreasing the number of resubmissions. This worked for me on a CLM-only case which used 51 nodes. In theory it should work for <250 nodes.

Hope that helps,

James

James King · Jun 7, 2023

By way of an update, see this morning's ARC Daily Bulletin:

Daily Bulletin | ARC NCAR

Ming Chen · Jun 29, 2023

Hi,
I added a high resolution grid (60-3km) to CESM2.2.0 and try to run I-Compset to produce CLM initial data for my coupled CAM-MPAS run.
However, the job stopped after 21 days of integration with the error message below:

MPT: rank 3110 dropping unexpected RC packet from 3238 (406:854), presumed failover - 400 0 336
MPT: Received signal 15

Anyone has insights what is wrong?

My case is located at /glade/scratch/chenming/JJ/run

James King · Jun 29, 2023

Ming Chen - it's a hardware problem which should be fixed with some repair work in August. Per the ARC Daily Bulletin from 26th June:

"Users may continue to experience a higher rate of job failures than typical, particularly at large node counts. The following error messages are likely related to this network path error:

ERROR: Extracting flags from IB packet of unknown length
Transport retry count exceeded on mlx5_0:1/IB
MPT: rank XXX dropping unexpected RC packet from YYY …, presumed failover
Hung applications that eventually time out
(no error message, but no output, either)

These messages may occur in application logs, and the failure modes can include immediate job termination or application hangs.

Until the network is repaired, current remediation options remain limited. Users are encouraged to resubmit failed jobs and include the PBS directive “#PBS -l place=group=rack” in their batch scripts when requiring 250 nodes or less. This will request PBS to select nodes from the same rack, perhaps reducing – but likely not eliminating – the impact of the failed switches. Users are also encouraged to reach out at the NCAR Research Computing (RC) Helpdesk to request core-hour refunds if significantly impacted by these ongoing disruptions."

akhtert · Jun 29, 2023

Hi James,

I was wondering if you can still run your high resolution using #PBS -l place=group=rack. No solution ever worked for me since!

Best
Tanjila

James King · Jun 29, 2023

Hi Tanjila,

I've successfully run high-res CAM-SE (MUSICA) but only for 5 days because I needed to test some input files. We've done production runs with CLM5 only at high resolution (0.25 deg) with no problems (yet). I'm holding off doing anything ambitious with an atmosphere on Cheyenne until the repairs in August.

James

Run fails after 5 months with unfamiliar error

James King

Member

Michael Levy

CSEG and Liaisons

James King

Member

Michael Levy

CSEG and Liaisons

James King

Member

James King

Member

Michael Levy

CSEG and Liaisons

James King

Member

Tanjila Akhter

Member

James King

Member

Tanjila Akhter

Member

James King

Member

James King

Member

Ming Chen

New Member

James King

Member

Tanjila Akhter

Member

James King

Member