Notification Type
Emergency Maintenance
Service Affecting
Yes
Message
We are currently experiencing some technical difficulties with the nameservers for our older system:ns.blacknightsolutions.com
ns2.blacknightsolutions.com
Any domains set up on these nameservers will not resolve at the moment, however any domains not using these nameservers, but using any of our services will be fine.
We hope to have service restored fully as soon as possible.
UPDATE 16:00: Service should be fully restored now. Service might be a little slow until the nameservers fully recover but they are back and functioning now.
Update 23:49 November 20th
The cause of this outage was the result of 2 events.
Event 1)
Network issues between NS2 and our dublin DB cluster. This caused the NS2 scripts to open multiple connections to the DB server, lock tables and not close due to communication issues.
Event 2)
The scripts on NS couldn't access the database because the tables required were locked so the script wiped the bind include file that writes that contains all the information for all our forward DNS.
Why this happened:
The code base for this system was written in 2004 when we had 500 odd domain names, today this system serves dns for close to 40k Domain names. It was never built with this scope in mind. It was also never built to deal with partial failures. It was able to deal with not being able to reach the DB server, but not to deal with connections opening and then subsequently failing.
What we're done to prevent this from re-occurring:
We've spent the guts of a week re-writing the code from the ground up. In doing this we've put several levels of protection in place that will prevent network issues, partial network issues or any other transient problem for affecting the bind includes. Essentially the scripts won't touch a file until it successfully completes the transaction with the db server. We've built in locking to prevent the script running to overlap and we've also fixed several bugs with the code that were causing other non service affecting problems. Finally we've built in a level of monitoring previously unavailable to us so we'll be alerted immediately should the system have any problems writing out to files or locking files or even connecting to the DB cluster.
Leave a comment