November 2008 Archives

Summary:

At approx 12:10 and 19:06 some customers on 2 web servers on our new system websites were showing mysql errors.

We've traced this issue back to a misconfiguration of a vlan interface for a VPN to have the wrong master and secondary IP addresses. In each case the interface was shut down and it remains shutdown.

The affected machines were pemlinweb08 and pemlinweb09. As we've found the reason for this problem occurring we've permanently fixed it so when we bring the interface online again this won't reoccur. We'll be monitoring internal communication between the web and db servers for a couple of days to ensure that there are no other issues.

We will be updating the software on our Windows VPS platform on Wednesday evening. This will require each hardware node to be rebooted.

As each VPS will need to be manually stopped and started on either side of the reboot, this will result in a downtime window of approximately 90 minutes.

 

Update: 27/11/2008 00:08 This update is now complete, and this maintenance window can now be deemed to be closed.

The shared server "Rivalin" is currently experiencing issues.

Our technical team are aware of the issues and are working on a resolution

Update 1720: The server in question has been rebooted and services should be coming back to normal shortly. If anyone has any issues please let us know

You can check server status here
When: Wednesday 19th of November @ 04:00 - 08:00
What: Core provisioning system major upgrade from version 2.7 HF05 to 2.8 HF01 (HF means Hot Fix)

Service hits:

Our CCP (https://cp.blacknight.com) will be down during this maintenance window. All core nodes will be getting software upgrades during this window. There will be hits of upto 4 minutes per shared hosting node, vps hardware node and some other critical service delivery systems.

There are a number of bug fixes in this upgrade that we've been waiting for, there is also some additional features that we've been waiting some time for. We'll post more details of these once the upgrade has been performed.
Our domain order backend experienced issues earlier this evening, however it should now be functioning as normal.


We are currently experiencing some technical difficulties with the nameservers for our older system:

ns.blacknightsolutions.com
ns2.blacknightsolutions.com

Any domains set up on these nameservers will not resolve at the moment, however any domains not using these nameservers, but using any of our services will be fine. 

We hope to have service restored fully as soon as possible. 

UPDATE 16:00: Service should be fully restored now.  Service might be a little slow until the nameservers fully recover but they are back and functioning now.

Update 23:49 November 20th

The cause of this outage was the result of 2 events.

Event 1)

Network issues between NS2 and our dublin DB cluster. This caused the NS2 scripts to open multiple connections to the DB server, lock tables and not close due to communication issues.

Event 2)

The scripts on NS couldn't access the database because the tables required were locked so the script wiped the bind include file that writes that contains all the information for all our forward DNS.

Why this happened:

The code base for this system was written in 2004 when we had 500 odd domain names, today this system serves dns for close to 40k Domain names. It was never built with this scope in mind. It was also never built to deal with partial failures. It was able to deal with not being able to reach the DB server, but not to deal with connections opening and then subsequently failing.

What we're done to prevent this from re-occurring:

We've spent the guts of a week re-writing the code from the ground up. In doing this we've put several levels of protection in place that will prevent network issues, partial network issues or any other transient problem for affecting the bind includes. Essentially the scripts won't touch a file until it successfully completes the transaction with the db server. We've built in locking to prevent the script running to overlap and we've also fixed several bugs with the code that were causing other non service affecting problems. Finally we've built in a level of monitoring previously unavailable to us so we'll be alerted immediately should the system have any problems writing out to files or locking files or even connecting to the DB cluster.
One of our older nameservers is causing some transitory issues at present.

Our technical team are aware of the issue and are working on a resolution

UPDATE: This issue has been resolved
The shared server 'bors' is currently having issues. We're working to get this back now. It's one of the older shared servers that we have and it's time for it to be replaced. We'll begin contacting customers soon to move them over to the new system.

Update: Nov 11, 10:31

@ 15:39 on the 10th of Nov this server came back fully, sorry for not updating the post.
The above named server is currently experiencing issues. We're working to resolve them at the moment and it should be back working shortly. Please address support queries to our support team via e-mail support (at) blacknight.com or via the web https://support.blacknight.ie

Update: 12:50

This server is back now. Similar to 'iseults' issue last week, we can attribute this downtime to an attack on a customers website.

We're monitoring the situation closely.
We're currently working on an issue with this machine, it appears that the load shot up and that it has become unresponsive. We believe this is due to 1 customer site being effectively DOS'd. More information will be posted when we have it.

Update: 10:12

This machine is back up now, same issue as Monday. Contacting the customer on the receiving end to see what we can do to mitigate this issue in future.
Your billing tab in your control panel (https://cp.blacknight.com) is currently unavailable. We're working on a problem with it at the moment and currently we don't have an ETA for a fix.

Hopefully we'll get it back up and running shortly.

Resolution: One of the containers on our BM node was core dumping, this turned out to be a log file issue which has been raised with our development people to ensure that a permanent fix is put in place so this doesn't reoccur.

Timeline: 09:30 BM goes down - 10:28 BM comes back up
If you're on one of our new shared Linux hosting plans (minimus, medius or maximus) then you may be pleased to learn that we have upgraded php5 to include some extra modules.

This change means that we can now support Magento, which is proving to be a very popular option for ecommerce sites.

SugarCRM users will also be able to take advantage of IMAP and IMAPS functionality.

If you have any queries please let us know

Reblog this post [with Zemanta]