October 2010 Archives

MySQL database server having issues again this morning

TrackBacks (0) Comments (0)
Update 28th Oct @ 10:10: This issue has been tracked back to a memory leak in the version of virtuozzo on this machine. It will have to be upgraded in the near future. This morning we rebooted the physical server that both of the VMs below reside on, this appears to have stabilised the system somewhat so we're happy with it's performance at the moment.

Rest assured that we're working on this as a top priority and that we're monitoring it very closely. The main cause was too busy sites hammering the MySQL servers which uncovered the memory leak that we mentioned above. We'll move one of these DBs to a newer server today which should mean that this node will be more stable until we can schedule a maintenance window where we can run the updates to the VM platform.

The following database server is having issues again this morning:

mysql637.cp.blacknight.com
mysql637int.cp.blacknight.com

and

mysql640.cp.blacknight.com
mysql640int.cp.blacknight.com

Our engineers are working on this and have been working on this since the server first started to show problems.  We hope to have this resolved as soon as possible.

I have asked our engineers to update this status post with more information as soon as they can.

MySQL Issues

TrackBacks (0) Comments (3)
We are currently experiencing issues with two of our mysql nodes: 

mysql637.cp.blacknight.com
mysql637int.cp.blacknight.com

and

mysql640.cp.blacknight.com
mysql640int.cp.blacknight.com

Our engineers are working to resolve this currently.


Billing Upgrade on cp.blacknight.com

TrackBacks (0) Comments (0)
Summary: As part of our continued efforts to provide the best services, control panel etc to our customers we're upgrading our billing system tonight / tomorrow morning (Oct 26th / Oct 27th) starting from around 3am.

When: Starting at 3am Irish time on Wednesday 27th of October until approx 9am.

What:
Our store / billing system will be offline during this process. So you won't be able to buy new domains, update contacts on domains, manage name servers etc. You will however still be able to login to cp.blacknight.com and manage your hosting, e-mail, sharepoint and VPS servers.

This is a _major_ change in our software. It's moving our backend DB from mysql to postgreysql and lots of UI changes. There'll be a new payment wizard to allow people to pay for renewal orders etc. Screen shots of this were put up on our main company blog here.

Update: 26/10 @ 22:00 due to unforeseen circumstances we need to postpone this upgrade until the 10th of November.

pemvzwin02 emergency reboot

TrackBacks (0) Comments (0)

For reasons beyond out control, it has become necessary to reboot the above Windows VPS Hardware node immediately. All VPSs on this node will also reboot.

 

We apologies for any inconvenience this may cause.

mail.blacknight.com / smtp1r.cp.blacknight.com information notice

TrackBacks (0) Comments (1)
Summary: We've added further RBLs to our mailservers. We do operate an SMTP AUTH by-pass mechanism so if you're having trouble sending e-mail please turn SMTP authentication in your e-mail client.

The follow KB article will give you instructions on how to do this for most clients:

https://support.blacknight.ie/472/smtp-authentication.html


Multiple Legacy Shared Linux Hosting reboots

TrackBacks (0) Comments (1)

We are currently in the process of upgrading our kernels modules on all of our legacy (DirectAdmin) linux hardware nodes to include the latest and greatest backup module. We need to reboot some of these nodes in order to do this.

As snapshots and backups are very important to us we would like to get this done ASAP.

When - The downtime window will be from tonight, the 18th of Oct 2010 between the hours of 22:00 and 23:00. Although the window length is an hour, the estimated downtime is about 10 minutes max per hardware node.

Whats effected - The following list of hardware nodes will be affected.

  • Balin
  • Camelot
  • Da-Server1
  • Ector
  • Gorlois
  • Igraine
  • Morgana

    As always, we are sorry that downtime must occur but we need to ensure we have server snapshots of these critical hardware nodes.

    Once these hardware nodes are upgraded to the latest module version, it will not be necessary in the future to reboot in order to upgrade.

    This blog post will be updated once the maintenance is completed.

Update @ 22:25 : All servers have been rebooted, and services have returned to normal.

     

Network issues due to DDOS

TrackBacks (0) Comments (2)
Summary:Sunday Oct 17th starting at approx 11:00 we began seeing large volumes of traffic from multiple destinations which is currently disrupting services. We're working with our carriers to resolve this as quickly as possible,

Update 11:35: This issue is completely resolved.

Emergency Reboot - pemvzwin12

TrackBacks (0) Comments (0)

The above Windows Hardware node needs to be rebooted as a matter of urgency. All Windows VPSs currently hosted on this node will also be rebooted at that time.

We will be rebooting it at 17:00, downtime is expected to be less than 10 minutes.

pemvzwin12 emergency reboot

TrackBacks (0) Comments (0)
Summary: The virtualisation platform on this node is behaving very strangely and causing people's VPS to go down. We've rebooted the node now and it should be fully back up and running by 09:10 or 09:15

Multiple hardware node reboot notification

TrackBacks (0) Comments (0)
We are currently in the process of upgrading our kernels modules on all of our Parallels linux hardware nodes to include the latest and greatest backup module.

Out of our currently 61 hardware nodes, we need to reboot some of the older nodes in order to do this (6 of them).

As snapshots and backups are very important to us we would like to get this done ASAP.

When - The downtime window will be from tonight, the 13th of Oct 2010 between the hours of 21:00 and 22:00. Although the window length is an hour, the estimated downtime is about 10 minutes max per hardware node.

Whats effected - The following list of hardware nodes and their corresponding VEs (also listed) will be affected.

  • PEMVZMPS19
    • pemlinweb21.blacknight.com (81.17.254.58)
    • pemlinweb22.blacknight.com (81.17.254.59)
  • PEMVZMPS21
    • pemlinweb23.blacknight.com (81.17.254.62)
    • pemlinweb24.blacknight.com (81.17.254.63)
  • PEMVZMPS23
    • pemlinweb05.blacknight.com (81.17.254.86)
  • PEMVZMPS30
    • pemlinweb27.blacknight.com (81.17.254.68)
    • pemlinweb28.blacknight.com (81.17.254.69)
  • PEMVZMPS31
    • pemlinweb12.blacknight.com (81.17.254.94)
  • PEMVZMPS33
    • pemlinweb32.blacknight.com (81.17.254.44)
    • pemlinweb33.blacknight.com (81.17.254.48)

As always, we are sorry that downtime must occur but we need to ensure we have server snapshots of these critical hardware nodes.

Once these hardware nodes are upgraded to the latest module version, it will not be necessary in the future to reboot in order to upgrade.

This blog post will be updated once the maintenance is completed.


UPDATE 21:25 - Everything has been completed however we are awaiting PEMVZMPS21 to come back online. It is currently running a disk check.

UPDATE 22:32 - All maintenance is complete

cp.blacknight.com control panel maintenance

TrackBacks (0) Comments (3)
Summary: Some people may have noticed that cp.blacknight.com has been getting slower and slower over the past 2 months. This is despite us spending a lot of money on new hardware just for this application. Tonight  / Tomorrow (Oct 13th @ 3am) morning we're going to be working on the back end database to reduce it's size, ensure all indexes are in place and vacuum the tables.

Update 08:10 13/10/10: This work is completed and was 100% successful. The database is now a 10th of it's original size which is much more manageable.

When:
Tonight / Tomorrow morning from 2am Irish time on October 13th until 6am October 13th. The work will occur inside this window.

Details: This UI is split into two distinct different applications each with their own dedicated hardware.

1) CP which runs a java application which provides the cp interface via apache/mod_jk
2) CORE node which runs the back end for the control panel and handles provisioning and all the magic that makes the cp tick.

On the core node we've noticed a number of tables in the DB have grown to unmanageable sizes. For example 1 table that people use everyday is now 24GB in size!, there's 100s of tables in the database.

Tonight starting at 3am Irish time Parallels engineers are going to perform this maintenance, they've estimated around 30 minutes but we suspect it may take upto 2 hours due to the sheer size of the database in question.

Services affected: cp.blacknight.com - no customer services, web services, email etc will be affected by this. You simply won't be able to login and perform any actions in your account until this maintenance window has been concluded.

Sorbs Issue

TrackBacks (0) Comments (0)
We aware of an issue with Sorbs at the moment where a lot of our IP Addresses, including those of our Qmail cluster,  are being listed incorrectly as dynamic. We are in contact with Sorbs and trying to get it sorted, however they seem to having serious issues with their site at the moment which may be related. Note: our Hosted Exchange service is not effected.

There is nothing we can do to correct this at the moment, we have to wait for Sorbs to get the issue sorted.

UPDATE 15:45: It seems that Sorbs have now emptied the blacklist that was causing the issue. The TTL (Time-To-Live) for those records is one hour, so hopefully within the next hour we see the last of the block.

qmail hosting (smtp1r.cp.blacknight.com / mail.blacknight.com) slow mail delivery

TrackBacks (0) Comments (1)
Summary: The mail queues on our cluster are climbing. As such e-mail sent to you will be delayed for some time. Mail sent to outside E-mail addresses are not affected by this as they are in a different queue.

There are 2 reasons for this problem.

1) the number of pop/imap connections are at a critical level which is causing high IO wait for our storage backend.

2) In relation to point 1, this high IO wait is slowing down delivery of mail for the very same reason.

In order to rectify this problem we've been looking at a number of solutions. However we've not yet pinpointed a scalable solution that can cope with our current requirements and allow for future growth. The new storage platform put in place less than 12 months ago is at point where it can't really deal with more IO requests. As such we need to find a solution that will last more than 12 months and have enough data capacity to allow for growth.

We apologise for the lengthy delay in e-mail delivery and please understand that there will be a solution put in place sooner rather than later but that it will take some time.

Further updates will be posted here, we'll also announce in another post when we are putting the new storage backend in place. We hope to announce a timeline this week.


Qmail cluster unavailable

TrackBacks (0) Comments (0)
Summary: Qmail mail services are currently not functioning correctly. We're working on this issue at the moment and hope to have it resolved shortly. It looks like the NFSD on the storage node may have had a kernel panic for some reason and this looks to be the cause of the problem.

Update: 03:00 - The main issue was as mentioned above was a kernel panic, the reason for prolonged downtime was the disk check required. There's 14m e-mails on the file system so it takes a while for the disk check to complete. It's been back since around 02:30.

No e-mail will be lost during this maintenance window.

Further updates will be posted here about this.

Problems with pemwinweb12

TrackBacks (0) Comments (0)

Shared Windows server pemwinweb12 - 81.17.250.46 - is currently experiencing difficulties.

 

An engineer is en-route to try and rectify the problem.

 

Update: All services returned to normal at 00:15