September 2011 Archives

mail.blacknight.com / smtpr1.cp.blacknight.com mail delivery delays

TrackBacks (0) Comments (0)
Summary: We're noticing some strange behaviour in the mail system right now. For no reason at all the load balancer is not putting connections through to the mail server on port 25. We're currently investigating this and hope to fix it shortly.

Additionally there is around 6k or so mails in the "local" delivery queue. Which means that inbound e-mail delivery is slower than normal. Again we're investigating this issue but it may be related to the issue mentioned above.

We will post further updates on this when the information is available to us.

pemlinweb19 & pemlinweb20 Outage

TrackBacks (0) Comments (0)
pemlinweb19 with IP 81.17.254.79
pemlinweb20 with IP 81.17.254.57
are offline, we are rebooting the node and they should be online shortly

Update 23:00
Servers are back online, we will continue to monitor

pemlinweb14 & pemlinweb13 Outage

TrackBacks (0) Comments (0)
pemlinweb14 & pemlinweb13 are offline, we are investigating the issue and will update ASAP

Update 14:40

The Servers are back online, we are still investigating the root cause.

Linux VPS node having difficulty - Pemvzlin03

TrackBacks (0) Comments (0)
We are currently experiencing some technical difficulties with one of our VPS nodes - pemlinvz03.  We have dispatched an engineer on-site to look in to this right away and we will update this post as soon as we have more information.

The primary IP address of affected VPSs are:

78.153.208.128
78.153.208.130
78.153.208.142
78.153.208.150
78.153.208.164
78.153.208.147
78.153.208.173
78.153.208.182
78.153.208.192
78.153.209.91
78.153.208.195
78.153.209.112
78.153.209.125
78.153.209.137
78.153.209.146
78.153.208.71
78.153.209.14
78.153.209.74
78.153.209.89
78.153.210.129
78.153.210.138
78.153.208.152

Update: 08:45: the node was completely unresponsive so unfortunately we had to reboot it. It's currently doing an fsck on the /vz partition (which is around 700-800GB) which is at 32% approx. We estimate that it'll be back by around 09:30. Sorry for the inconvenience.

Update: 09:20: The disk check had to be restarted, however it's already at 54% so it's going well. ETA now is 09:45

Update: 09:40: The disk check is at 65% currently and going well. We estimate that it'll be between 10:00 and 10:20 when it'll finish.

Update: 09:50: It has restarted from the beginning again, we're keeping a close eye on it. We've a box built and prepared and we can restore the backup from around 2am on the 21st if we have to.

Update: 10:25: It's still going. In the mean time we're doing a bare metal restore of the box to a spare box that we have in place for just such an occasion. We'll know more in the next 30 minutes on what the ETA will be.

Update: 10:50: Ok the file system check finished sucessfully, the machine has rebooted as normal and it's now back online. All the containers are doing disk checks themselves but they're starting slowly. When they're all back online we'll post one further update.

Update: 11:50: All the Containers on this node are now fully back up and running.

Billing & Store Scheduled Maintenance

TrackBacks (0) Comments (0)

The Store and Billing Section of the CP.Blacknight.com will be offline for maintenance from 6:30am on Wednesday the 21/09/2011 for approximately 60 minutes for Scheduled Maintenance.


Update 07:00am

The Store and Billing section are now online again.

pemlinweb65 and pemlinweb66 currently offline

TrackBacks (0) Comments (0)
Summary: pemlinweb64 and 65 are currently down. Sites affected will be on the following IP addresses:

pemlinweb65:

78.153.215.156

pemlinweb66:

78.153.215.157
78.153.214.131
78.153.214.135
78.153.214.148
78.153.214.152
78.153.214.170
78.153.215.13

We're working to resolve this at the moment.

Update 12:30:

This server is experiencing some sort of strange power problem, it boots but then complains about there being a low power situation.

Our options include, removing PCI-E cards, iDrac and replacing the motherboard and or power supplies. We're still working on this and it has top priority currently.

Update: 13:00:

We're currently moving this server to new hardware.

Update: 14:00:

We need to boot the machine from a rescue cd to edit MAC addresses in configuration files in order to get the machine to come online 100%. This is taking a bit longer than we would like but we hope to have it resolved in a short time. A MAC address is the hardware address of the network interface card in the server, these are hard wired into the network configuration. Current ETA is 15:00 to have your sites back online. We're doing everything we possibly can to get this done asap.

Possible Issues With .Co Registrations / Updates

Comments (0)
We have been informed by our partners in Colombia that they are conducting emergency maintenance on their EPP systems for the next few hours.

Existing .co domain names will not be impacted ie. they will resolve as normal

Possibly impacted:
- new registrations
- updates


mysql452 Outage

TrackBacks (0) Comments (0)
The MySQL Server mysql452.cp.blacknight.com with IP address 81.17.254.3, went offline.

We are investigating at the moment.

14:44 The server has been rebooted and is now back online

Getting Business Online Sign Ups are currently disabled

TrackBacks (0) Comments (0)
Due to a recent upgrade of our control panel software signups for the GettingBusinessOnline.ie website are currently disabled. Developers are looking into the issue and hope to have it resolved shortly.

This issue was resolved early Friday afternoon

Issue with 'billing' section of control panel

TrackBacks (0) Comments (0)
Some customers may be experiencing issues with clicking on the 'billing' section of their control panel at http://cp.blacknight.com/.

To fix this, please clear your browser's cache and cookies, and restart your browser.

The billing section will then be accessible.

For instructions on how to clear your browser's cache, please see http://www.wikihow.com/Clear-Your-Browser%27s-Cache.

For instructions on how to clear your browser's cookies, please see http://www.wikihow.com/Clear-Your-Browser%27s-Cookies.

Scheduled Maintenance mysql452

TrackBacks (0) Comments (0)
The MySQL node mysql452.cp.blacknight.com IP address 81.17.254.39 will be offline from 9pm for approximately 1 hour to allow for migration to new hardware

Update 21:30
Server back online

Scheduled Maintenance of Shared Hosting Nodes

TrackBacks (0) Comments (0)
The shared hosting nodes listed below will be rebooted at 8:00 am Friday the 16th September for scheduled updates.

81.17.254.64            pemlinweb25.blacknight.com
81.17.254.67            pemlinweb26.blacknight.com

Scheduled Maintenance mysql452 & pemlinweb31

TrackBacks (0) Comments (0)
The nodes listed below will be offline for a period of approximately 60 minutes from 22:00 this evening for migartion to new hardware.

The MySQL node mysql452.cp.blacknight.com IP address 81.17.254.39
Linux Shared Hosting node pemlinweb31.blacknight.com, IP address 81.17.254.38

UPDATE 21:00

The scheduled maintenance has been postponed until tomorrow the 14th at 22:00

Update 22:45 14th Sept
Servers remain offline, we are working to get them back online ASAP

Update 00:00
Server is back online, pemlinweb31 is running checks and will be online in approximately 20 minutes.

Update 00:10
Both servers now back online

mail.blacknight.com offline for emergency maintenance

TrackBacks (0) Comments (0)
Summary: In order to get the system stable we're taking all services offline for about 1 hour to do some configuration that we hope will speed things up.

All services are affected, no e-mail should be lost during this time.

Update: 21:24: The NFS server that provides file storage for e-mail is currently doing an disk check, this will take some time. It's currently at around 36%, I estimate around another 60-80 minutes before it'll be completed. That means that mail should be back up and running at around 22:30 - 22:50 give or take. I'll post further updates if it's looking like it won't be back quicker than this.

Update: 22:14: I can tell from the progress that this probably won't be completed by 22:50 like I thought it would be. It's at 48.4%. I don't expect from 70% to 100% to take too long, but from 48>70 it could take another 1h 30 minutes. I'll post another update in 45-60 minutes with an ETA.

Update: 23:05: The disk check is at 87.5% now as I type. E-mail services should be back online before 00:00 (midnight!).

To answer some users question, the maintenance should have taken 3 minutes (i.e. a reboot) earlier however an unforeseen problem occurred and a disk check of the entire mail store ran. There is literally no way to bypass this. So no it's wasn't necessary, however we felt it important to have mail back as stable as possible asap and this work should help.

Update: 23:20: Ok unfortunately the disk check failed at around 89% and has to be rerun with an extra flag that will fix errors that it finds. While this will take another few hours, it won't take as long as the initial check.

Also we're putting a plan of action in place to have a new storage system in place by very early next week. Possibly before Monday time permitting. More updates will be posted as we have more information available.

Update: 00:10: The second run of the disk checker is progressing all be it slower than I would like, but I can't influence it's speed. It's approx 1/3 of the way through now. 

For people wondering why it's not in a failover situation. Mail is very disk intensive so at the moment there is one NFS (network file system) server that houses this data with very fast disks and loads of RAM to cache files etc. The file system on the server needed a disk check as many many millions of files have been written, deleted and rewritten since it's last reboot. This is a check that is forced on file systems to keep them intact. The reason there's no failover in this instance is because the file system isn't fully healthy. Now doing highly available NFS services is not easy to do right you would loose performance. In the 4 years we've been running Qmail we've upgraded the storage platform twice and we're about to go for a third. Thankfully the third will be it's final upgrade as the SAN we're moving it to can scale both in size and performance.

Update: 01:10: We're about 50% of the way through fsck number 2, we haven't hit any of the corrupt files from the previous run yet. I don't expect that until around 89-90%, ETA at this stage will be around 3am. I'll post one more update between now and 3am.

Update: 02:30: The mail storage system is back online as is the Qmail cluster itself. There was only a half dozen files or directory entries for files corrupt and this was found on the second pass of fsck which relates to directories etc. Thank you for your patience, this will be the final update for this issue.

mail.blacknight.com / smtpr1.cp.blacknight.com connection problems

TrackBacks (0) Comments (0)
Summary: As a knock on from Mondays major issues we've been experiencing intermittent high load on our Qmail cluster. Symptoms may include disconnections on pop, timouts while sending e-mail and general mail slowness. The queue levels are perfectly fine so delivery of e-mail isn't being badly affected.

We understand that this is frustrating and we are working towards a fix. Most of the issues stem from each slight "outage", be it 1 or 2 minutes or a little more where connections aren't accepted. Once the system starts accepting email again we effectively end up getting DDOS by our customers. We believe there is an underlying bug that a recent Qmail upgrade may have introduced. We hope to get a fix for this today.

pemvzlin20 reboot

TrackBacks (0) Comments (0)
Summary: pemvzlin20 had some issues doing it's backup snapshot during the night due to high IO issues. In order to remedy this we've just installed the latest virtuozzo kernel and we're going to reboot the node shortly.

The following VPS IPs will be down for the duration of the reboot:

78.153.210.173
78.153.210.74
78.153.209.45
78.153.209.65
78.153.209.100
78.153.209.128
78.153.209.109
78.153.209.178
78.153.209.179
78.153.209.194
78.153.209.196
78.153.209.198
78.153.210.12
78.153.210.37
78.153.210.48
78.153.208.98
78.153.210.63
78.153.210.71
78.153.210.104
78.153.210.119
78.153.211.91
78.153.211.119
78.153.211.164
78.153.211.167
78.153.211.168
78.153.211.171
78.153.211.184
78.153.211.89

It should be back by 09:00

Update: 09:20: All the nodes containers on this node were back online before 09:00

mail.blacknight.com / smtpr1.cp.blacknight.com mail delivery delays

TrackBacks (0) Comments (0)
Summary: All domains serviced by the above mail server cluster are experiencing a mail delivery delay to inbound e-mail. Mail being delivered to external addresses via the same system is not affected by the issue.

We suspect it's a problem with the spam-assassin servers and we're checking the configurations to see can we discover the source of the problem. We'll post further updates as we have them.

Update: 12:50: We've found the source of the problem. We are experiencing some internal packet loss between the SA servers and our resolvers in their local data centre. We should be able to fix this issue in the next 30 minutes and then the mail servers should be able to catch up.

Update: 13:50: The mail queues have reached their peak and they're now starting to go down quite quickly. The issue was indeed the cause of a DNS issue caused by packet loss on our SA nodes. The packet loss was due to some limits that were put in place on a number of visualised servers, we've raised these limits and we're seeing much higher throughput now. We expect the queues to have returned to normal by 15:00 approx. We'll post one more update closer to 15:00.

Update: 15:00: Unfortunately the issue we found earlier hasn't completely fixed the problems that the mail cluster is experiencing. We've asked our software vendor to have a closer look at the configuration for us as it hasn't really changed since May. They provide all the components involved so they might be able to assist us further.

Update: 16:00: The mail queues are going down slowly however the cluster hasn't completely stabilised as of yet. We may require an outage window of about 1 hour tonight from 23:00 to 00:00 in order to fully rectify the situation.

Update: 17:00: Right now Parallels have made some changes to the concurrency of the delivery daemon on the mail servers. The default setting appears quite low so this is being put up which should begin delivering email quicker. We still have a route cause of the problem and we'll continue working on this to get to a resolution. The next update will be at 20:00.

Update 21:00: All mail queues were successfully cleared by approx 18:30 this evening. We're now concentrating on the duplicate e-mail issue that people have reported. We hope to get this resolved over night so we don't have a reoccurrence of todays problems.