Summary: At 09:04 we had a 30 second blip in InterXion. This happened during the process of bringing a series of new switches online. Ordinarily this operation is non service affecting however for some reason it caused a problem on a number of vlans.
All service was restored on or before 09:05 and the case is closed.
Summary: Today at 16:39 customers on several firewall segments were impacted by high packetloss. This was the result of a yet unknown event, however we were able to restore service within a timely manner by forcing the "active" firewall into standby thus forcing the standby firewall to take over.
We're currently investigating the cause further and examining syslogs, port graphs, cpu graphs etc. We will post a conclusion to the issue once our engineering team have completed their investigations.
Current situation as of 17:10:
Network connectivity was restored within 10 minutes of the actual event occuring however some services like e-mail on our shared hosting cluster will take some time to recover as there are an extraordinary number of inbound connections into the cluster and the load balancer is doing it's best to diviy out connections to the mail servers.
Update 1810
Mail services should be back to normal, but if there are any issues let us know.
Summary: We're currently experiencing a network wide outage that has cascaded due to fibre failure in the Dublin area. A brief fibre outage on one of our metro rings seems to have caused some internal routing issues and in turn caused some of our BGP sessions with carriers in InterXion to drop or become non-responsive.
This issue is currently in the midst of being resolved. We'll post an RFO once more information is gathered from logs etc.
Current status: All Clear
We experienced an issue today where one of our transit routers had an issue due to the carriers router flapping it's bgp session. We've shut down this BGP session and had a conversation with the carrier about the issue. While bgp was doing it's thing, some people may have noticed some latency or slowness to connect. This was due to the routes moving from 1 ISP to the other.
Currently Global Cross and Cogent are carrying our traffic. Level(3) are out of the picture for the moment.
Further updates will be posted as we have them.
Update: 17:05
Packet Exchange have acknowledged an issue on their network. One of their customers was advertising the Level(3) router IP into the same VLAN we're in. So Level(3) and this other customer were fighting for the IP. This session is still down.
As an interim measure to ensure we continue to offer the same high level of service that our customers are used to we're putting Tiscali back into the loop. So we'll be back to 3 live carriers. This is happening now and Tiscali should be starting to take traffic away from Cogent and Global Crossing.
Further updates will be posted until this issue is resolved.
Update: 14:55 May 21st
Packet Exchange report that this issue is resolved. We've also received an RFO. With this in mind we'll be turning Level(3) back up at 18:30 this evening. This will cause some brief latency while routes re-converge. A final update will be posted once our engineering team confirm everything is ok once we bring this circuit back up.
Update: 20:30 May 21st
Level(3) has been live for the past 2 hours and all is looking well. This is the last update on this particular ticket. We're closing it now.
Summary: At approx 20:12 this evening we had a network event that caused us to loose peering with Packet Exchange eXpress and Cogent. This represents approx 60% of our external network traffic to the internet. We were able to see it originated from 1 VLAN in Data Electronics and we're working to find a cause.
Connectivity is restored and has been for some time, but we've had reports of patchy connectivity.
Update: A sequence of events caused a number of our internal and external peers to drop during a 8 minute window last night.
The events went as follows:
20:12 Global Crossing peering goes down, re convergence begins and our route reflectors recalculate best routes to the internet.
20:12 30 seconds after the GC event, our Cogent peering on the same router flaps, again we have our route reflectors re calculating the best routes to the internet.
20:14 One of our Core routers in InterXion bounces it's internal iBGP peerings and some OSPF peerings. This caused another cascaded recalculation of routes and at this point one of our route reflectors in DEG crashed. At this time all traffic traversing DEG (as this was the live router) stopped routing, approx 30 seconds later it's BGP partner took the load and traffic started to flow again.
20:15 several INEX peerings on LAN#2 flapped and we also saw 13 flaps of our peerings with the Packet Exchange eXpress route servers.
All the above BGP peering flaps caused reconvergence within our network. This made it look from some aspects that the network was down, but the internet and ourselves were just figuring out the best path in and out to us. The network stablised @ 20:20 and has been fine since.
We're investigating further for the route cause of this issue.