The hardware node for the MySQL servers requires an emergency reboot. The following MySQL servers are affected:
mysql873.cp.blacknight.com
mysql870.cp.blacknight.com
We anticipate they will return in 5-10 minutes, but will update this post should there be any further developments.
Update @ 09:38: Both servers are now fully back up. Apologies for any inconvenience.
Server mysql71.cp.blacknight.com with IP 81.17.254.45 is been rebooted to resolve a load issue.
Update @ 10:55: The container has had to be taken offline for a raid resync. We will update with more information shortly.
Update 11:25 We are going to migrate the server to another hardware node as this will provide the quickest solution. This will take 45 mins approximately
Update 13:25 The server is coming back online on the original hardware now, we will look at moving it to new hardware in the coming days.
Server mysql71.cp.blacknight.com with IP 81.17.254.45 rebooted and is running an file system check.
Once it is back up we will update this page.
Update: This post previously referenced the wrong server. Updated to reflect correct MySQL server affected.
Update @ 15:10: This server is now back up.
The MySQL Server mysql452.cp.blacknight.com with IP address 81.17.254.3, will be offline for approximately 30 minutes from 7am Tueday the 4th of October for scheduled maintenance.
Summary: pemvzmps42 and pemvzmps63 require a reboot to faciliate maintenance. The Servers will be offline from 8am Monday the 25th July for approximaetly 1 hour.
The following VPS servers will be down during this operation:
pemlinweb38.blacknight.com
mysql519.cp.blacknight.com
pemlinweb64.blacknight.com
pemlinweb63.blacknight.com
Update 08:50: Completed successfully.The two nodes were down until 08:40.
Summary: Customers have reported issues of slow database access to us. We're investigating this at present. It doesn't appear to be a problem on the mysql servers themselves or a network issue. It may well be a problem with DNS. We'll post further updates when we have more information.
Update: 09:40: We believe that we had found a fix for this and we notified customers of same however we're still working on this issue. It is most certainly a DNS problem. It appears to be related to the stateful firewall on the client DNS servers.
Update: 09:51: This issue is resolved for the moment. We're reviewing a number of factors that may have caused this issue. It _is_ DNS related but we haven't found the exact cause just yet.
Summary: We're currently experiencing a network problem. We're working to diagnose it and get everything that is down backup.
Initial findings: We've found some MAC related log entries in some of our core network switches. These highlight a layer 2 loop within the network. These began @ 12:48. The knock on effect was layer 2 port flaps followed by BGP and OSPF flaps localised to that data centre. Any traffic passing through that data centre via Cogent, INEX LAN2 in or out would have been dead in the water.
At 12:56 the flaps stopped and the routers began to stabilise. It took a few minutes after this for everything to calm down. We're still investigating this right now and we'll post more information as it's available.
RFO: The reason for this outage was due to an ethernet loop within our network in the InterXion datacentre. A new piece of equipment was introduced to the network earlier this week fully configured and as such it caused no issues. While this equipment was being worked on it's configuration was wiped and upon reboot it appeared on the network and caused a network loop. This caused a spanning tree event in our core switching fabric in InterXion. The result was the network outage that was observed. Typically events like this can't occur and we put strict provisions in place to prevent it however in this instance a 3rd party piece of equipment caused the issue. It wasn't immediately evident that this would occur as the device in question had been tested in our lab for 4 weeks prior to it's deployment within the data centre. Obviously we take every precaution when working on our network and sometimes events like this can occur. However it should not happen in future and we're working on ways to prevent it happening again. At the very least we hope to contain issues to individual racks rather than the entire ethernet fabric in that data centre.
We're currently experiencing issue with mysql71.cp.blacknight.com/mysql71int.cp.blacknight.com We are investigating at the moment.
Update 22:14: This issue is now resolved. It appears to have been an issue we've seen in the past in Virtuozzo where it's constantly swapping which uses extension IO. Once we updated the kernel and rebooted the machine it came back. We're monitoring this node closely to see if the issue re-occurs.