mail.blacknight.com / smtpr1.cp.blacknight.com performance issues

Notification Type

Emergency Maintenance

Service Affecting

Yes

Message

Summary: This mail cluster is having some performance issues this morning. We're working on a fix right now.

Update: 10:45: We have been busy working away trying to resolve this issue. At the moment however the cause of the issue isn't at all clear and as such it's proving difficult to get a fix for it. This system has been stable since the last round of hardware updates we put in place a couple of weeks ago. The only thing that has changed is that Compellent our SAN vendor swapped out the iSCSI cards in the two SAN controllers yesterday. This should not have had a negative impact on the system however it appears that it has. So we're working with them to find the cause of the problem.

Update: 11:40: We are still working on this issue. It's the top most priority for our engineering and support teams this morning.

Update: 13:00: This issue is still on going. Unfortunately we've not had made any progress in finding a cause for the slow down. 

Update: 13:25: We're having people getting abusive to our helpdesk staff. This is not helpful for anyone. The issue at the moment is that while they are working properly they're not fulfilling their duties and thus causing this service issue for you all. We are still working on this issue and we are investigating all avenues currently including blocking certain services to see if it's some sort of inbound attack on the mail servers.

Update: 14:30: I've removed some of the previous commentary from this thread as it was causing people issues. I'm sorry about that. Right now we're on the phone to Compellent and we're hoping that their escalation time has found something in the logs we sent them.

Update: 14:55: We have currently taken the entire system offline completely and we're examining each part. The Qmail cluster is made up of 4 service groups.

1) SAN + NFS server
2) POP/IMAP/SMTP servers
3) Authentication - LDAP and WHOSOND
4) Mail Scanning / Anti Spam prevention.

We're fairly confident that Groups 3 and 4 are functioning perfectly as we're not seeing the type of issues you would see if they were having issues. So that leaves the pop/imap and SAN systems. The SAN system had some cards replaced yesterday by Compellent so we immediately thought that this was the cause of the problem and asked them to give us back the old cards. They're told this isn't possible. We've had 2 x 1hour long phone calls with them so far today where we went over all the metrics on the SAN. Disk latency, network latency, volume latency, IO throughput etc. Everything on the SAN looks normal. So that leaves the NFS server + NFS clients. We would normally see upward of 300Mbit/s of traffic between the clients and the server, today this is showing as 10-20Mbit/s so it's fairly obvious that the problem is entered around NFS. This is where we are now concentrating all of our efforts. To figure out what is causing this and to fix it.

Update: 15:45: A number of people who forward their email onto gmail / hotmail etc have been getting their email all day. This is expected. SMTP inbound i.e. mail delivery from others into us is working ok. The issue is the pop/imap connections from your e-mail clients and are problematic. For those that asked, all the servers are back online now. We're still seeing the performance issue after the tweaks / changes we've made but forwarding should be working ok right now. Again please accept our sincerest apologies for the issues this is causing you all.

Update: 16:50: Sorry about the previous comment. It was a direct response to some customers having issues with forwarding. E-mail is still down but no e-mail will be lost. Again sorry about this outage, it's the single longest outage we have ever had. It is the number 1 priority and has been all day.

Update: 18:45: Sorry about the delayed response since our last update. We believe we have identified the cause of the issue. We're not sure exactly where the problem lies but we can see some weird network traffic between the NFS server and the SAN. We're in discussions with Compellent now to get them to shine some light on the situation.

Update: 20:50: Having spent most of the evening with Compellent they did find a problem with the Write Cache on one of the controllers. This happens to be the primary controller for the mail storage system. So this has been resolved. It hasn't fixed the issue completely but we turned e-mail back on for 30 minutes and we saw a lot more traffic over the network so it looks like we're quite close.

At this point we're going to begin syncing e-mail back to the old mailstore in order to have a fallback. This will also give allow us to eliminate the current server as the problem if that is the issues we've been having. 

Update: 21:15: We're instigating a roll back plan to the old mail storage box until we can nail down what is causing the issues on the newer one. 

Update: 23:30: The roll back plan is going to take a number of hours to put in place. Currently e-mail is syncing back to the old mail store and it's about 25% done right now. Despite Compellent  finding an issue with the Cache settings on the SAN controller for this volume it didn't have a positive impact on the mail performance. So mail is currently completely switched off.

Day Changed to November 8th:

Update: 03:30: The data copy back to the old storage node is progressing well. Will check in on it again at 06:00.

Update: 06:15: The copy to the old storage node is almost completed. The ETA still stands at 09:00 to have mail backup and running.

Update: 06:56: The 9am ETA is this morning Tuesday 8th of November.

Update: 08:45: POP and IMAP have been switched back on. During the night we moved back to mailstore1 and we also converted the mail system away from Courier-IMAP to Dovecot. This change we hope brings significant performance improvements through better indexing and logging. SMTP will take a while to turn back on unfortunately. ETA for smtp is now 11am.

Update: 09:15: People are saying to our helpdesk that they're having problems with IMAP connections. They can't sync folders. We're investigating this now.

Update: 09:45: POP3 seems to be working ok for most customers. IMAP is intermittent and we're trying to figure that out. Webmail relies heavily on IMAP, so when IMAP is fully working so will Webmail.

Update: 10:44: We are working our way through some file permission issues. Once we get these sorted we'll have everything backup. The main issue right now is e-mail delivery and IMAP/Webmail access. We are not going to make the 11am Deadline on this unfortunately. The ETA is being pushed onto Midday.

Update: 11:55: Right now we have e-mail flowing from the general internet and our inbound scanning boxes into Qmail. So people who are able to get onto POP3 will begin receiving email in the next while. We estimate around 1,000,000 or so e-mails are queued for delivery, a lot of which will bounce because they're spam messages. So far we've seen around 250k of these go into the local delivery queues on the mail servers. So things are progressing all beit slower than you would like. The reason for this is that we have an abnormally large number of users trying to get their e-mail because of the prolonged outage.

Update: 12:45: We have been working with Parallels to get Dovecot working properly. Dovecot is built to work with NFS storage and is programmed in such a way that it is NFS friendly. We have got it working on 2 of the 4 mail servers currently and we've processed well over 500k mails and delivered them to your inboxes. Some of you may also have noticed that SMTP is working but it's still a little patchy due to the high volume of inbound e-mail however it's not as bad as it was at 11:30. There is still a fair bit of e-mail to get through right now but the system is handling it very well.

Update: 14:10: All e-mail has been delivered to their respective mailboxes at this stage. POP3 is working but not on SSL. IMAP and SMTP are intermittent still but we're close to having those resolved. Also as mentioned earlier IMAP being offline or not working fully means webmail isn't working yet. ETA for full restoration is another 2 hours unfortunately.

Calling support to look for an update is futile as the engineering team are putting the updates here first and passing the url onto support. They do not know more than the information is being put here.

Update: 16:15: We believe we have nailed down the right combination of limits for IMAP to be stable. We made some changes about 15 minutes ago and we're monitoring connections to it right now. Once we deem it stable we'll turn webmail back on as we're acutely aware that a number of customers only use Webmail.

Update: 17:10: We turned webmail back on at 16:20 this evening. We've been monitoring it closely and so far we're happy with the performance. As of now this issue is finally resolved.

A few points to note:

1) if you used to pop mail and leave it on the server, you'll have to re-download all your e-mail. This is unfortunately unavoidable.
2) we have moved away from Couier-IMAP to Dovecot. Dovecot does some very smart caching on the mail server and this appears to be doing great things for performance.
3) pop before smtp is no longer supported. We appreciate that this might cause issues for customers but unfortunately we can't turn it back on.

We will post an update on our main company blog and here on the status blog with further information about this issue once we've had time to diagnose it fully and produce a report for the management team here.

All Services should be functioning normal as of 16:20 this evening.