Slough outage

Incident Report for Simwood

Resolved

The Brocade switch stack has been stable since recovery, the Slough nodes of various clusters have now recovered, and billing has caught up.

The issue here was a software failure in the stack of switches sitting between the routers in Slough and all hosts. The switches were alive, but unresponsive, even over directly connected serial cable. Equipment is intentionally connected to multiple switches to overcome a hardware failure - each switch node being distinct hardware - but this failure rendered the entire stack inoperable and irrecoverable with a simple power-cycle. We believe this to be a software bug related to uptime that corrupted the stack configuration. Our previously announced plan to replace Brocade equipment with Arista, and simplify the network architecture (i.e. remove the separate access layer and any kind of stacking) continues.

Whilst real-time telemetry was impaired, retrospectively we saw a substantial uptick in traffic in other sites suggesting most customers were correctly failing over and voice service remained fully operational. This was partly manually triggered by us amending DNS records to swing traffic over in respect of traffic directed at Slough. Some (3) customers have however reported some errors from traffic directed to London exclusively without failover elsewhere; we believe we have identified the cause of those and will follow up with those affected. We also know (and reported here) that some inbound calls would have seen elevated connection times as they arrived in Slough but were rejected to other sites, but they were correctly connecting. We tried unsuccessfully to reach BT's NMC to re-route these. Overall though, voice volumes normal despite the loss of our most significant site and that is our primary goal in situations such as this.

We'd like to apologise to those customers who were affected by this and our Operations Desk will happily work with customers on any interop issues this brought to light.

Posted Mar 18, 2017 - 14:35 UTC

Monitoring

The Slough switch stack was completely recovered at 12.49. We are presently monitoring service restoration.

Posted Mar 18, 2017 - 12:58 UTC

Update

We have partially recovered the Slough switch stack which has restored connectivity to many services there. Calls remain failed over to other sites as we attempt to bring remaining units back into service.

Posted Mar 18, 2017 - 12:39 UTC

Update

Remote hands are on site.

Outbound calls configured per our interop are unaffected but 50% of inbound calls might experience elevated connect times.

Posted Mar 18, 2017 - 11:46 UTC

Identified

We have requested remote hands intervention here at 11.23.

Posted Mar 18, 2017 - 11:36 UTC

Investigating

We appear to have lost an access switch stack in Slough.

We are investigating.

Posted Mar 18, 2017 - 11:25 UTC