DNS.be, the .be registry operator, have informed of scheduled maintenance next week
When?
Tuesday 24 November 2009
What time?
1730 - 2200 CET (Central European Time)
What will be affected?
All registration services and whois.
Existing .be domain names will not be impacted
ns2.blacknightsolutions.com (82.96.97.64) is getting moved to new hardware. The new server is already setup and has been tested. All that's left to do is to move it to it's proper ip address.
There should be no visible downtime during the swap.
The .be registry (dns.be) is conducting some scheduled maintenance this afternoon from 5pm onwards.
The expected timeframe is about 10 minutes.
This will only impact new .be domain registrations being processed
We are currently experiencing some technical difficulties with the nameservers for our older system:
ns.blacknightsolutions.com
ns2.blacknightsolutions.com
Any domains set up on these nameservers will not resolve at the moment, however any domains not using these nameservers, but using any of our services will be fine.
We hope to have service restored fully as soon as possible.
UPDATE 16:00: Service should be fully restored now. Service might be a little slow until the nameservers fully recover but they are back and functioning now.
Update 23:49 November 20th
The cause of this outage was the result of 2 events.
Event 1)
Network issues between NS2 and our dublin DB cluster. This caused the NS2 scripts to open multiple connections to the DB server, lock tables and not close due to communication issues.
Event 2)
The scripts on NS couldn't access the database because the tables required were locked so the script wiped the bind include file that writes that contains all the information for all our forward DNS.
Why this happened:
The code base for this system was written in 2004 when we had 500 odd domain names, today this system serves dns for close to 40k Domain names. It was never built with this scope in mind. It was also never built to deal with partial failures. It was able to deal with not being able to reach the DB server, but not to deal with connections opening and then subsequently failing.
What we're done to prevent this from re-occurring:
We've spent the guts of a week re-writing the code from the ground up. In doing this we've put several levels of protection in place that will prevent network issues, partial network issues or any other transient problem for affecting the bind includes. Essentially the scripts won't touch a file until it successfully completes the transaction with the db server. We've built in locking to prevent the script running to overlap and we've also fixed several bugs with the code that were causing other non service affecting problems. Finally we've built in a level of monitoring previously unavailable to us so we'll be alerted immediately should the system have any problems writing out to files or locking files or even connecting to the DB cluster.
One of our older nameservers is causing some transitory issues at present.
Our technical team are aware of the issue and are working on a resolution
UPDATE: This issue has been resolved
The "DEG Mesh", i.e. the network that Data Electronics provide to customers has been experiencing issues intermittently during the day today. We've escalated this issue to them and we're awaiting a response regarding the issue.
Currently it's not affecting DNS, but DNS responses may take a few 100ms longer than normal.
Update: 9:45
The issue appears to be resoled. We've had no formal notification of a fix as of yet. I'll give an update once we've received a formal RFO.