Our initial diagnosis is that this was a software failure within the switch stack. This resulted in all functions that involve CPU being halted. The hardware was unaffected and thus the switches continued passing packets. The erratic nature of certain flows to hosts behind these switches could be explained by switch stack hashing and, we believe, an inability to hash new flows. Unfortunately, there are no logs on or off the device that show any useful activity before or after the failure. Certainly, whilst this may have looked like a routing issue, routing to the site (from both on and off-net) was unimpeded, but certain flows did not make it across the switch stack.
Given we have at least two of everything, the separate switches, just as the separate routers, that we have in each site should have mitigated this; the failure appears to be in the software that stacks them together. This affirms the rationale for the architectural changes we're making elsewhere in the network - a stack cannot fail if it doesn't exist.
A reboot of the stack (combined with the roll-back of remedial config alterations elsewhere) restored normal service at the Manchester site.
In terms of the timeline; our monitoring triggered at 4.02 indicating multiple hosts in Manchester were down. One engineer responded immediately, joined by a second at 4.06. Initial signs (and consequential monitoring) suggested this may have been a wider issue but it was confirmed to be Manchester-specific at 4.10.
An initial status update was made at 4.15, whilst by 4.13 we'd identified that the entire network was sound at layer 3, notably that routing into Manchester and the pair of active/active routers were operational. Concurrently, human tests of core functions such as voice calls were made through other sites to confirm the diagnosis and eliminate unknown dependencies.
By 4.30, we had identified that the access switch stack in Manchester was unavailable due to the LACP bonded ports facing down to the switches in the switch stack from each of the two routers were UP but BLOCKED. With no obvious reason for this, we tried repeatedly to access the switch stack over out-of-band serial to investigate but there was no output from the switch it connected to.
At 4.43 a decision was made to dispatch an engineer to site to repatch the console cable, and potentially power cycle the switches, whilst the other engineer would work to fully fail-over services such as the API to other sites.
At 4.55 however, this plan was postponed when we removed LACP from the router-side of the switch links, disabling anything other than the first link on each and this enabled the ports to come up. The following hour was spent investigating the level of connectivity as off and on-net monitoring continued to flap, whilst manual testing confirmed intermittent behaviour and packet-loss at the switch stack.
At 6.00 we concluded it was too late to get on-site before the business day started, and requested Equinix to repatch the out-of-band console cable to another switch in the stack; the formal time for this submission was 6.17 given un-noticed error messages.
In the intervening period, we experimented with swapping which router port was active into which part of the stack, with variable impacts on traffic but no resolution. We proceeded with action to ensure clusters spanning other sites were production ready without Manchester, given the possibility it wouldn't be back in time for the business day, whilst also investigating the erratic availability of Manchester hosts we were seeing.
At 8.40, our Equinix ticket was escalated to the Customer Service Manager, and at 9.10 we directly contacted the Equinix Manchester operations team and further opened a support ticket about the unacknowledged Smart Hands request at 10.13.
At the time of writing, we've had no response to these or the original ticket.
The business day was now well under way and our call volumes were normal, confirming our belief that this was being handled by the action we'd taken and systems in place.
At 11.14 a local network engineer was admitted to Equinix and, in a few seconds, performed the cable swap we'd requested at 6.17 but regrettably it made no difference. At 11.29, with a plan in place for total failure of the switch units, we asked him to power cycle them, and they came back up at 11.40. By 11.44 we removed the remedial changes to enable the LAGs to reform from the routers and, from there, the Manchester hosts; including connectivity customers behind the same switches were fully available again.
As is hopefully clear this issue was largely mitigated in both architecture and our action and involved no outage to voice services to this point, whilst resolution was a matter of very few minutes work.
The extreme frustration is the failure of Equinix to enact our request or respond to our escalations. We have complained about their "Service Desk" in Slough before, and whilst we need external assistance very very rarely, in Telecity days, just as in other datacentres such as Telehouse, we would email the engineers directly and they act in under 5 minutes.
We strongly believe this issue would have been resolved before 6.30am had it occurred in a non-Telecity datacentre or prior to them buying Telecity. It would be premature to conclude what action is required but evidently we cannot rely on timely resolution of issues requiring on-site assistance from Equinix. It is a great shame, as in every other way they are operationally supreme; we have pointed this issue out repeatedly but previously our only Equinix site was Slough.
Whilst call routing and voice services were unaffected as we've said above, there was one issue that came as a consequence of restoring services in Manchester. Our call-routing nodes each query a RAM based dataset which is a slave of a master which resides in Manchester. These are designed to function without the master being present, and allow call routing in each site to be wholly independent.
They tolerated the master being intermittently available, but the power-cycling of the switches and their return at 11.44 caused a resync (rather than resume replication). The process of resyncing caused them to block for approximately 30 seconds as they applied the completely refreshed dataset which means they cannot be queried for calls. However, each voice gateway can query multiple call routing nodes and fails over in the event of a time-out. 3 customers have reported they were seeing a high proportion of calls failing between 11.45 and 11.47 which suggests they were hitting a node where more than one of the call-routing nodes were blocking at the same time. We will take away an action to investigate and remedy this.
Finally, we'd like to apologise for any concern or inconvenience this may have caused. We do need to stress though that this was isolated to a single site and do not at this stage think our own response could have been improved. We will, of course, be having a full debrief and taking any lessons from this.