Recently in email Category

I can't send e-mail, what's up?

TrackBacks (0) Comments (3)
Summary: Nothing. After the downtime on Monday and Tuesday we put a new Mail system in place that deals with POP and IMAP. it's more robust, faster for you and for us. It's better in many different ways than the old one.
There is a slight drawback -  you can no longer send e-mail without explicitly turning on SMTP authentication in Outlook, Apple Mail etc.

We have gathered a few articles for you to help you out on this:

1) https://support.blacknight.ie/index.php?_m=knowledgebase&_a=viewarticle&kbarticleid=472
    - this gives you instructions on how to enable this in most common e-mail clients.

OR

2) http://wiki.blacknight.com/index.php/SMTP
    - this gives you a list of SMTP servers for most Irish ISPs, down the bottom is a link to a page that has a list of ISPs for many ISPs outside of Ireland.
An ISP is the company that provides your broadband, cable, dsl, dial up, satellite net connect etc.

I've requested that a newsletter go out to all customers with the above information in it as a technical notice for the next few runs.

Update: 09:35

Webmail users using Firefox 4 or IE 7 or above can use our "alternative" webmail application here: https://altmail.blacknight.com

mail delivery delay last night Nov 3rd

TrackBacks (0) Comments (0)
Summary: We have had some complaints about mail delivery last night that it was delayed. We have found that the cause of this is similar to an issue we've found on other machines with 10GigE interfaces that are very new. There is a new driver that we will apply to all machines involved and that will prevent the issue from occurring again.

This affected inbound e-mail into our spam scanning server that sits in the Cloud. No e-mail was lost during this window. It was resolved around 00:00 last night.

mail.blacknight.com / smtpr1.cp.blacknight.com brief outage notification

TrackBacks (0) Comments (0)
Summary: At 23:00 tonight Friday 21st of October we're going to update the NIC card drivers for the 10GigE network cards that provide connectivity to the SAN. To do this we have to take mail down for a window of approx 15 minutes.

This is to hopefully resolve a bug that we've seen manifest a number of times since the mail storage platform was upgraded last Friday night. Once this is complete we hope that the mail system will function better moving forward.

Update 23:30: The driver upgrade was successful and mail was down for approx 20 minutes. We're monitoring this situation very closely. 

mail.blacknight.com / smtpr1.cp.blacknight.com mail delivery delays

TrackBacks (0) Comments (0)
Summary: 2 of the 4 servers that deliver mail to mailboxes currently have a few thousand messages in their queues at the moment. We expect this to die down in the coming 60-90 minutes as it was due to a number of large mailers from various companies hosted outside the qmail cluster. As a result the spam scanning servers are very busy.

Additionally we're moving the primary spam scanning box to new hardware which should improve performance 10 fold and increase delivery times in periods of high inbound email load.

Update:
The upgrade has greatly improved performance.

mail.blacknight.com

TrackBacks (0) Comments (0)
We're currently experiencing issues with our qmail cluster, mail.blacknight.com - Our engineers are working to resolve this ASAP and all updates will be posted here as they become available.

UPDATE 11:32 - All services have been resumed. 

mail.blacknight.com / smtpr1.cp.blacknight.com major maintenance

TrackBacks (0) Comments (0)
Summary: Following on from this weeks earlier maintenance we're going to do the final move of e-mail to the new storage platform tonight October 14th.

We'll take mail offline again at 23:00 hours for approx 2 hours until 1am. This move is essentially just an rsync of data from the old system to the new system and that's it. We've already seeded the data to the new platform using a recent backup so there isn't a huge amount of data to be copied. We will then re-mount the new drives onto the mail servers and begin the mail flow once again after some essential tests.

Tomorrow Saturday 15th we'll monitor everything closely and see how it all performs.

Update: 01:20: We've gone over time on this but the difference between the restored backup and the data on the old mail storage node is more significant than we expected. The copy is approx 50% done and we estimate about another hour to two hours for completion.

Update: 03:30: The copy is going very well, it shouldn't be too much longer. Current estimates for complete are approx 04:15am

Update: 04:15: It's very close to being completed right now. There's probably another 30-40 minutes max left. ETA is now 05:00.

Update: 04:45: Mail has been back and stable for the past 10 minutes. We have full visibility on the storage platforms performance now and everything is well within specified parameters.

mail.blacknight.com / smtpr1.cp.blacknight.com mail download slowness

TrackBacks (0) Comments (0)
Summary: Since approx 11:30 this morning people have been experiencing some delays in downloading e-mail, it appears to download very slow and sometimes it times out. Between 11:30 and approx 13:30 is the busiest time of day for the mail servers and we expect that a rebuilding raid array may be causing this issue. The array should be rebuilt completely by mid to late afternoon today at which time things should improve dramatically. We can't really do anything to speed this up in the interim.

Additionally once we perform the major upgrade later this week of the storage backend we expect mail to function 100% and perform much much better than it has been.

Update @ 16:45:  Mail raid rebuild is still continuing at this time.

Update: 19:55: The raid array that powers the current mail cluster finished rebuilding at 18:45:54 this evening. We expect it's performance to return to normal now.

mail.blacknight.com / smtpr1.cp.blacknight.com major maintenance

TrackBacks (0) Comments (0)
Summary: On 23:00 on Monday 10th of October we're going to take the mail cluster offline completely for approx 4 hours until 03:00. This is to facilitate the following:

1) Upgraded storage space for the mail store
2) Upgraded performance for the mail store
3) Upgraded file system from ext3 to ext4 for performance reasons

We're doing this in stages.

Stage 1)

We've restored the existing mail system from a recent backup to a new location. We'll begin copying that data to the new Mail Storage node which has 20Gbit/s of connectivity to our SAN. We estimate that this will take approx 24-48 hours to complete.

Stage 2)

We will then take the mail system offline again in a couple of nights to do a final sync of the changes on the new system. This will depend largely on the amount of time it takes to do the copy over to the SAN. Estimations range from between 36 and 50 hours approx.

Tonight we're simply relocating the servers to a new rack in close proximity to our SAN. And while they're down we're doing a clean up of old mailboxes and purging data that is no longer required.

Update: 03:30: Everything went as planned tonight regarding this move. The servers are now in their new home. This week we hope to swap the storage backends. We'll put a further maintenance window in place for this with an ETA (as of now) for Friday night this week. 

mail.blacknight.com / smtpr1.cp.blacknight.com mail delivery delays

TrackBacks (0) Comments (0)
Summary: We're noticing some strange behaviour in the mail system right now. For no reason at all the load balancer is not putting connections through to the mail server on port 25. We're currently investigating this and hope to fix it shortly.

Additionally there is around 6k or so mails in the "local" delivery queue. Which means that inbound e-mail delivery is slower than normal. Again we're investigating this issue but it may be related to the issue mentioned above.

We will post further updates on this when the information is available to us.

mail.blacknight.com offline for emergency maintenance

TrackBacks (0) Comments (0)
Summary: In order to get the system stable we're taking all services offline for about 1 hour to do some configuration that we hope will speed things up.

All services are affected, no e-mail should be lost during this time.

Update: 21:24: The NFS server that provides file storage for e-mail is currently doing an disk check, this will take some time. It's currently at around 36%, I estimate around another 60-80 minutes before it'll be completed. That means that mail should be back up and running at around 22:30 - 22:50 give or take. I'll post further updates if it's looking like it won't be back quicker than this.

Update: 22:14: I can tell from the progress that this probably won't be completed by 22:50 like I thought it would be. It's at 48.4%. I don't expect from 70% to 100% to take too long, but from 48>70 it could take another 1h 30 minutes. I'll post another update in 45-60 minutes with an ETA.

Update: 23:05: The disk check is at 87.5% now as I type. E-mail services should be back online before 00:00 (midnight!).

To answer some users question, the maintenance should have taken 3 minutes (i.e. a reboot) earlier however an unforeseen problem occurred and a disk check of the entire mail store ran. There is literally no way to bypass this. So no it's wasn't necessary, however we felt it important to have mail back as stable as possible asap and this work should help.

Update: 23:20: Ok unfortunately the disk check failed at around 89% and has to be rerun with an extra flag that will fix errors that it finds. While this will take another few hours, it won't take as long as the initial check.

Also we're putting a plan of action in place to have a new storage system in place by very early next week. Possibly before Monday time permitting. More updates will be posted as we have more information available.

Update: 00:10: The second run of the disk checker is progressing all be it slower than I would like, but I can't influence it's speed. It's approx 1/3 of the way through now. 

For people wondering why it's not in a failover situation. Mail is very disk intensive so at the moment there is one NFS (network file system) server that houses this data with very fast disks and loads of RAM to cache files etc. The file system on the server needed a disk check as many many millions of files have been written, deleted and rewritten since it's last reboot. This is a check that is forced on file systems to keep them intact. The reason there's no failover in this instance is because the file system isn't fully healthy. Now doing highly available NFS services is not easy to do right you would loose performance. In the 4 years we've been running Qmail we've upgraded the storage platform twice and we're about to go for a third. Thankfully the third will be it's final upgrade as the SAN we're moving it to can scale both in size and performance.

Update: 01:10: We're about 50% of the way through fsck number 2, we haven't hit any of the corrupt files from the previous run yet. I don't expect that until around 89-90%, ETA at this stage will be around 3am. I'll post one more update between now and 3am.

Update: 02:30: The mail storage system is back online as is the Qmail cluster itself. There was only a half dozen files or directory entries for files corrupt and this was found on the second pass of fsck which relates to directories etc. Thank you for your patience, this will be the final update for this issue.