July 2010 Archives

Nominet Scheduled Maintenance - 3 August 2010

Comments (0)

Nominet have informed us of scheduled maintenance work on August 3rd 2010 between 0700 and 0800

During this period online services and updates will not be available for *.uk domain names

Fully registered domain names will not be impacted

Emergency Reboot - pemmysql05

TrackBacks (0) Comments (0)

The above MySQL node had had to be rebooted. It is on its way back, but is doing a file system check, whick may take up to 10 minutes.

 

Update: 07:59 - This server has now returned to production.

Com / Net Scheduled Maintenance

Comments (0)
Verisign, the registry operator for .com and .net, will be conducting routine maintenance between 0100 and 0145 UTC tomorrow morning, July 25 2010

During this window we will be unable to process new domain registrations or updates for .com and .net

Registered .com and .net domains will not be impacted

Transient Issue With .Co

Comments (0)
There is currently a transient issue affecting .co registrations.

We are currently unable to process new registrations and / or updates

UPDATE 0056 - this issue has been resolved


Mail Issues

Comments (48)

Our technical team are currently working to resolve an issue with the mail cluster.

This issue is impacting some people's ability to login.

More details once we have them

UPDATE 1045: the issue appears to be related to LDAP

UPDATE 1052: We have a ticket open with software vendors. Our technical team continues to work on the issue as well

UPDATE 1107: The technical team continue to work on the issue. Some people are reporting continued issues, while others appear to have full service

UPDATE 1130: We've made some adjustments to the mail cluster configuration. The technical team are still working on it

UPDATE 1154: The mail cluster is not stable. We are working on it. Once we are confident that the issue has been resolved we will update this blog post

We understand that people are frustrated with the mail issues this morning, but we are working on the issue and have been since we first became aware of it.

We do not know when the issue will be resolved as yet.

UPDATE 1210: In response to some of the queries and comments people have posted. The issue is specifically related to "authentication" ie. logging in to mail to send / receive. Mails from the outside to you should not be affected.

Unfortunately with the number of people trying to collect and send emails simultaneously the servers are under high load so service may be slower than normal

UPDATE 1400: Email service is currently stable. If you are still having issues please contact our help desk

Emergency Reboot - PEMVZMPS13

TrackBacks (0) Comments (0)
We are scheduling an emergency reboot of the hardware node PEMVZMPS13. The kernel on the server is throwing some strange errors. We need to reboot it to unload some modules and fix them.

Whats effected ?
Two pemlinweb linux shared hosting nodes:
81.17.254.89    pemlinweb08.blacknight.com     
81.17.254.91    pemlinweb09.blacknight.com

How long ?
About 10 minutes from now so it will be fully back online by 10:05AM.

We'll update this blog post once completed.

UPDATE 10:16 - The server has not come back online after the reboot. We've engineers on route to the server now to diagnose. ETA 15 minutes until they arrive.

UPDATE 10:27 - The issue has now been resolved.

PEMLINWEB02 And PEMLINWEB06 Unresponsive

TrackBacks (0) Comments (2)
The hardware node responsible for the above two pemlinweb nodes has gone unresponsive due to high load. There's currently engineers on the way to investigate and bring the linwebs back up.

Update 18:10: The hardware node is back up, however we're now waiting for the "Quota Checks" to complete before it will bring up the linweb nodes. No ETA at the moment.

Update 18:25: Both pemlinweb02 and pemlinweb06 are back up and running. We're still haven't tracked down what caused the load to spike so badly, but it is being investigated.

pemmysql06 expericening problems

TrackBacks (0) Comments (1)
Summary: For the past couple of days we've been seeing high loads and extremely odd behaviour on this node. Today simple things like a repair of a table is causing segmentation faults (not good!). To make things slightly worse, the backups for this node haven't been working since July 18th because of the same issue presumably.

Action: We're restoring this node to new hardware from the last known good backup which was July 17th. We'll take this node offline and put the new node live and try to fix the old node offline and if we're successful we'll be able to restore more recent db information for people.

We expect to have to take this node offline today at least once or twice. This downtime is completely unavoidable unfortunately.

What's affected?

All databases where your connection string will be:

mysql360.cp.blacknight.com
mysql360int.cp.blacknight.com
81.17.254.61
172.16.4.23

We will post further updates as we have them available. FYI service is currently unaffected, but we expect the issue the issue to get worse.

Update: PEMMYSQL06 is going to be bought down tonight at around 02:30 while we attempt to get the most up to date data off the current hardware node onto the new node. This has to be done while the server is offline. Unfortunately, we have no way of knowing how long this is likely to take. If this works, the most up to date data will be be available on the new server.

Update:  It looks like the box has died. An engineer is on the way to see if it can be bought back up.

Update 2217: Engineers are onsite and working on the server

Update 23:40:
The SQL server has been bought up on new hardware with no issues and fully up to date data, no loss.  We are monitoring the server to make sure there are no further issues, but so far all is looking good.

The maintenance scheduled for 02:30 has been cancelled as it's now not required. 

Shared Hosting Linux - Ector

TrackBacks (0) Comments (0)
We are currently experiencing issues with our shared hosting server Ector. Our engineers are working on resolving this asap.

3:00PM - This issue is now resolved.

FTP Issues to windows and linux hosting packages (Resolved)

TrackBacks (0) Comments (0)
Summary: After last nights maintenance window some IP ranges (not all) were not allowing ftp to work properly. This was due to a missing policy-map on the firewall that instructs it to inspect all ftp traffic and track statefull and passive connections.

This morning this policy-map was put back in place around 08:40 and since then ftp is checking out from all NAT'd connections. Non NAT'd connections wouldn't have experienced any connection problems to FTP.

This issue is resolved.


Network issue affecting all firewalled services in InterXion

TrackBacks (0) Comments (0)
Summary: We're currently experiencing a problem on our network. Our network engineers are onsite looking at the issue.

Once we have further information we'll update this post.

ETA to fix: 5 minutes

Update: 17:18

A switch port on a core switch was made active during a planning phase of up coming maintenance. This port has been shutdown to prevent the connected device causing loss of connectivity. Total downtime was approx 4 minutes.

Mail Cluster Issues

TrackBacks (0) Comments (0)
We're currently experiencing major issue with authentication on the shared qmail cluster. The issue is being investigated, and a ticket has been opened with the vendor in order to try and get things back up and running as quickly as possible.

Update 1450
Apologies for not updating this sooner
The issue was resolved a few hours ago, but if you are still having issues please let our helpdesk know

IEDR Suspensions Deletions Scheduled For Monday

Comments (0)

The IEDR inform us that today's suspension and deletion run has been postponed until Monday at 1200

Old Linux Shared Hosting Node - Arthur - Network Issues

TrackBacks (0) Comments (0)
One of our linux shared hosting nodes located in the UK is having networking issues. It's dropping packets intermittently.

Our engineers are liaising with the data center techs where the server is located to resolve this issue asap.

Windows VPS Hardware Node: PEMVZWIN02 Issues

TrackBacks (0) Comments (0)
We are currently experiencing issues with one of our windows VPS hardware nodes, PEMVZWIN02.

Our engineers are resolving this issue as quick as they can.

The affect VPS nodes are:
The affected nodes are:
78.153.208.116  VPS-238
78.153.208.123  VPS-258
78.153.208.16   VPS-262
78.153.208.127  VPS-282
78.153.208.28   VPS-287
78.153.209.211  VPS-288
78.153.208.154  VPS-317
78.153.208.168  VPS-331
78.153.208.149  VPS-337
78.153.208.170  VPS-341
78.153.208.172  VPS-342
78.153.208.169  VPS
78.153.208.171  VPS-344
78.153.209.151  VPS-346
78.153.208.15   VPS-347
78.153.208.62   VPS-353
78.153.208.176  VPS-354
78.153.208.177  VPS-355
78.153.208.179  VPS-357
78.153.208.184  VPS-362
78.153.210.75   VPS-387
78.153.208.205  VPS-390
78.153.208.218  VPS-407
78.153.208.222  VPS-409
78.153.208.223  VPS-410
78.153.209.118  VPS-620
78.153.209.160  VPS-664
78.153.209.164  VPS-667
78.153.209.176  PARTYCENTRAL
78.153.209.107  VPS-710

Once any information is available we'll have it posted here for you.

UPDATE 08:35: The node is now back online. The VPSs on the node are now booting up. All services should be fully restored shortly. I'll continue to keep this post up to date.

UPDATE 08:44: All VPSs are now back online.

Old Linux Shared Hosting Node: Gorlois - Disk Replacement

TrackBacks (0) Comments (0)
Due to a failed disk located in the linux shared hosting server Gorlois, we will be replacing the disk tonight.

The node will be offline for aprox 15 mins at 21:00 tonight 14/07/10

I will update this blog post once the work has been completed.

UPDATE 21:05 - New disk has been added into the RAID. The server is rebuilding the RAID currently and seems to be completely rather quickly which is great. ETA until it's back online - 10mins.

UPDATE 21:25 - The RAID is fully rebuilt and the server is now back online. Sorry this took an extra 10 minutes than expected. Normal operations are resumed!

Firewall Upgrades

TrackBacks (0) Comments (0)
In order to add more redundancy into the network, we've moving certain segments of the network over to their own dedicated pairs of firewalls.

The following services will be affected:
  • Miniumus, Medius and Maximus Windows and Linux shared hosted
  • DirectAdmin shared hosting
  • Windows Helm
  • cp.blacknight.com
  • Hosted Exchange
In each case, there should be minimal downtime as we're just shutting down the interface on the old firewalls and bringing it up on the new firewalls.

Update: 23:52

The above work is mostly completed. Due to time constraints we didn't complete it, however cp.blacknight.com and all our minimus/medius and maximus hosting packages have all been moved to the new firewall infrastructure.

During another maintenance window we'll complete this work, this current one is now closed.

Shared Hosting: Ector

TrackBacks (0) Comments (0)
We are currently experiencing issues with our older shared hosting platform. A server from the linux shared hosting named Ector.

Our engineers are working on resolving this asap and will update this blog post once completed.

UPDATE 02:02AM: The server is having write issues with a partition on this disk. Currently fsck is running manually and checking for orphaned inodes. ETA is 20 minutes.

UPDATE 02:09AM: The file system has been fully checked and CentOS is happy to boot. The server is now back online.

Nameserver Issue ns2.blacknightsolutions.com

TrackBacks (0) Comments (1)
We are experiencing some issues with one of our offsite name servers: ns2.blacknightsolutions.com

As the server is off site (in Germany) our engineers are liaising with the third party data center to resolve the issue asap.

No downtime or issues will be incurred to our users by this outage as our nameservers are fully redundant.

As always, we will update this post as soon as more information is available and/or the issue is resolved.

UPDATE 12:05 This issue is now resolved. We're just awaiting a Reason for Outage document from the third party data center.

Linux Shared Hosting Issues - Ector

TrackBacks (0) Comments (0)
We are currently experiencing issues with our older shared hosting platform. A server from the linux shared hosting named Ector.

Our engineers are working on resolving this asap and will update this blog post once completed.

UPDATE 11:40 - This issue has now been resolved.



pemwinweb01 throwing .net errors

TrackBacks (0) Comments (0)
Summary: pemwinweb01 ran out of disk space on the C: drive and and because of this it started throwing .net errors and other temporary file creation errors from php, perl or python.

Resolution: We removed old IIS error logs, moved the system page from C: to D: and we've freed up 40% of disk space. The machine was also rebooted at 08:30 this morning in order to ensure that swap was functioning correctly.

Reboot of Windows 2003 Shared Hosting Servers

TrackBacks (0) Comments (0)

All of our Windows 2003 Shared Hosting Servers that are accessed under cp.blacknight.com will be rebooted at 23:00 tonight to apply a Microsoft Hotfix.

Dowtime will be no more than 10 minutes.