Hi all,
I'm working on a regionally-refined spectral element grid and my case has failed with what looks like a fairly general MPT error but it's not one I've seen before, and the usual 'search for error message in CESM forum' approach hasn't been useful. The case runs OK for 3 months. After I resubmit for another 3 months, the model writes output files for months 4 and 5, and then crashes. There are no error messages in any of the component logs.The cesm.log is too large to attach here, but the error appears to be this:
244:MPT: rank 244 dropping unexpected RC packet from 256 (45523:45524), presumed failover - 2880 0 2816
383:MPT Warning: 09:56:19: rank 383: r12i7n17 HCA mlx5_0 port 1 had an IB
383: timeout with communication to r12i4n34. Attempting to rebuild this
383: particular connection.
1000:MPT Warning: 09:56:19: rank 1000: r13i6n3 HCA mlx5_0 port 1 had an IB
1000: timeout with communication to r13i5n23. Attempting to rebuild this
1000: particular connection.
751:MPT Warning: 09:56:25: rank 751: r11i4n26 HCA mlx5_0 port 1 had an IB
751: timeout with communication to r13i6n3. Attempting to rebuild this
751: particular connection.
251:MPT: rank 251 dropping unexpected RC packet from 255 (20685:35723), presumed failover - 64 0 -4696321144358065024
251:MPT: rank 251 dropping unexpected RC packet from 255 (59818:35723), presumed failover - 18 0 4469130340749704337
251:MPT ERROR: Extracting flags from IB packet of unknown length
251:MPT ERROR: Rank 251(g:251) is aborting with error code 0.
This looks like an MPT error to me, but I'm not sure what's causing it. This case configuration has run OK previously. I'm running CESM2.2.0 on Cheyenne here. Any pointers would be most welcome.
Thanks,
James
I'm working on a regionally-refined spectral element grid and my case has failed with what looks like a fairly general MPT error but it's not one I've seen before, and the usual 'search for error message in CESM forum' approach hasn't been useful. The case runs OK for 3 months. After I resubmit for another 3 months, the model writes output files for months 4 and 5, and then crashes. There are no error messages in any of the component logs.The cesm.log is too large to attach here, but the error appears to be this:
244:MPT: rank 244 dropping unexpected RC packet from 256 (45523:45524), presumed failover - 2880 0 2816
383:MPT Warning: 09:56:19: rank 383: r12i7n17 HCA mlx5_0 port 1 had an IB
383: timeout with communication to r12i4n34. Attempting to rebuild this
383: particular connection.
1000:MPT Warning: 09:56:19: rank 1000: r13i6n3 HCA mlx5_0 port 1 had an IB
1000: timeout with communication to r13i5n23. Attempting to rebuild this
1000: particular connection.
751:MPT Warning: 09:56:25: rank 751: r11i4n26 HCA mlx5_0 port 1 had an IB
751: timeout with communication to r13i6n3. Attempting to rebuild this
751: particular connection.
251:MPT: rank 251 dropping unexpected RC packet from 255 (20685:35723), presumed failover - 64 0 -4696321144358065024
251:MPT: rank 251 dropping unexpected RC packet from 255 (59818:35723), presumed failover - 18 0 4469130340749704337
251:MPT ERROR: Extracting flags from IB packet of unknown length
251:MPT ERROR: Rank 251(g:251) is aborting with error code 0.
This looks like an MPT error to me, but I'm not sure what's causing it. This case configuration has run OK previously. I'm running CESM2.2.0 on Cheyenne here. Any pointers would be most welcome.
Thanks,
James