We are currently experiencing load problems with 2 MySQL nodes. These are :- mysql870 and mysql873
UPDATE 15:50: It seems there was a memory leak on the hardware node. We've got that under control and both MySQL servers are behaving as expected now. We have a ticket open with our vendors to see if this can be fixed.
We have had to temporarily shut down pemlinweb32 and pemlinweb33 as the hardware node they're on is having issues. The RAID is currently rebuilding, and while the pemlinwebs were running, it was taking way too long and significantly affecting the performance of the web servers. By stopping the servers, we're able to rebuild the RAID a lot faster, and get the webservers working properly again.
UPDATE 00:30: This has been completed.
PIR, the .org registry, is conducting scheduled maintenance on 19 February 2011 between 15:00 and 19:00 UTC.
During this period whois, updates and new registrations will not be available
Existing .org domain names will continue to resolve as normal
The above servers have become unresponsive and have been rebooted. They are currently running a quota check which may take some time to complete. When complete, the servers will return back to normal.
Summary: This morning we'd a brief outage on our LDAP servers which caused authentication issues and some issues for people trying to send e-mail.
It occurred at 10:35:08 and ran until 10:42:10. This issue happens from time to time during high load where ldap's indexes get corrupt. Typically it only last a couple of minutes and there are newer versions of openldap where the issue is fixed, however the upgrade isn't currently supported by our software vendor. We've been discussing this with them for some time.
From 19:00 to 20:00 this evening our Helpdesk will be down while we upgrade the backend software. As part of this, all inbound mails will be queued for the duration of the maintenance period.
Out of hours support for dedicated and colo customer will still be available on the on-call number.
The mysql servers at 81.17.254.34/172.16.4.247 and 81.17.254.35/172.16.4.248 are currently down. This was due to the load getting so high that we were forced to reboot the hardware node as no access was possible. The server is on the way back up, but we're waiting for an file system check to complete.
ETA is currently about 15 to 20 mins.
Update 18:05: Both servers are back up and running.
Summary: This evening we're going to go ahead and do the work we said that we'd do in http://blacknig.ht/18d .
We'll start at 19:00 and it shouldn't take longer than an hour or so.
Update 20:00: The mail is syncing between the old and new storage. It's about 50% done. We suspect it won't take more than another hour or so.
Update 20:19: FYI this affects all Qmail services, so pop/imap and smtp. Including pop33r.cp.blacknight.com etc
Update 20:45: This isn't quite complete yet. It might run past 21:00 GMT and if it does we'll roll back the changes and try again later tonight. We'll leave the sync running while mail continues to flow and put pop3/imap services back live.
Update 21:00: We've backed out of the move for now. We thought that the sync of 24 hours of data would only take 45-60 minutes, unfortunately it's still going. We'll go again in a few hours when less people are waiting on e-mail.
Update Saturday 5th @ 00:05: The data sync has been on going for the past few hours and is almost complete. At this stage we're going to block inbound SMTP e-mail and leave pop3/imap alive until we're ready for the final move to the new platform. By disabling SMTP inbound we won't have to re-sync from scratch again all the new e-mail that arrives.
Update 02:20: The data sync is complete. People find the odd e-mail from earlier this evening marked as unread or it'll download again via pop3. This was unavoidable. however the good news is that we've moved over to the new storage platform and so far it's performing far better, however this is a quiet time for the cluster.
Webmail/pop/imap/smtp access has now been fully restored.
Summary: In order to provide a better quality of service to our customers we've decided to step up the installation of the new mailstorage system. We're going to go ahead and do it this evening.
What and When: At 22:30 we'll shutdown inbound e-mail, imap and pop3 access to the qmail cluster. We'll spend around 15-30 minutes confirm the configuration is synchronized between the two machines and finally we'll run one last sync of the data. We kicked off the restore of last nights backup onto the new platform which has taken 10hours or there abouts to complete. We're hoping this will bring huge improvements to e-mail delivery, imap and pop access and especially webmail access.
We've an upgrade of the webmail planned for the near future to to a newer version of Atmail which has better caching support for folders and e-mail.
Update 23:30: Due to delays in the restore completing on time we'll scrub this until tomorrow night around the same time. We'll post a fresh maintenance window tomorrow morning for it.
Summary: Due to a large volume of inbound e-mail from certain sources there's around a 15-20 minute delay on inbound e-mail. Outbound is unaffected at this time.
This should clear itself up by around 13:30 14:00. In the mean time services continue to function just a little slower than normal.