Notification Type
Emergency Maintenance
Service Affecting
Yes
Message
Summary: We're currently experiencing a network problem. We're working to diagnose it and get everything that is down backup.Initial findings: We've found some MAC related log entries in some of our core network switches. These highlight a layer 2 loop within the network. These began @ 12:48. The knock on effect was layer 2 port flaps followed by BGP and OSPF flaps localised to that data centre. Any traffic passing through that data centre via Cogent, INEX LAN2 in or out would have been dead in the water.
At 12:56 the flaps stopped and the routers began to stabilise. It took a few minutes after this for everything to calm down. We're still investigating this right now and we'll post more information as it's available.
RFO: The reason for this outage was due to an ethernet loop within our network in the InterXion datacentre. A new piece of equipment was introduced to the network earlier this week fully configured and as such it caused no issues. While this equipment was being worked on it's configuration was wiped and upon reboot it appeared on the network and caused a network loop. This caused a spanning tree event in our core switching fabric in InterXion. The result was the network outage that was observed. Typically events like this can't occur and we put strict provisions in place to prevent it however in this instance a 3rd party piece of equipment caused the issue. It wasn't immediately evident that this would occur as the device in question had been tested in our lab for 4 weeks prior to it's deployment within the data centre. Obviously we take every precaution when working on our network and sometimes events like this can occur. However it should not happen in future and we're working on ways to prevent it happening again. At the very least we hope to contain issues to individual racks rather than the entire ethernet fabric in that data centre.
Any explanation on cause of outage?
Hi Dermot,
We'll have a report once the engineers have all the logs are fully analysed. Initial findings are that we saw a Layer 2 loop in one of our data centres which caused a cascading layer 2 network problem which in turn caused a layer 3 flap for BGP and OSPF which meant we had a heap of equipment doing route re-calcs and while they were doing this they were forwarding no traffic.
I'm going to update the blog post with some of the above info now.
Paul