Notification Type
Emergency Maintenance
Date
September 8, 2011 8:20 PM
Service Affecting
Yes
Message
Summary: In order to get the system stable we're taking all services offline for about 1 hour to do some configuration that we hope will speed things up.All services are affected, no e-mail should be lost during this time.
Update: 21:24: The NFS server that provides file storage for e-mail is currently doing an disk check, this will take some time. It's currently at around 36%, I estimate around another 60-80 minutes before it'll be completed. That means that mail should be back up and running at around 22:30 - 22:50 give or take. I'll post further updates if it's looking like it won't be back quicker than this.
Update: 22:14: I can tell from the progress that this probably won't be completed by 22:50 like I thought it would be. It's at 48.4%. I don't expect from 70% to 100% to take too long, but from 48>70 it could take another 1h 30 minutes. I'll post another update in 45-60 minutes with an ETA.
Update: 23:05: The disk check is at 87.5% now as I type. E-mail services should be back online before 00:00 (midnight!).
To answer some users question, the maintenance should have taken 3 minutes (i.e. a reboot) earlier however an unforeseen problem occurred and a disk check of the entire mail store ran. There is literally no way to bypass this. So no it's wasn't necessary, however we felt it important to have mail back as stable as possible asap and this work should help.
Update: 23:20: Ok unfortunately the disk check failed at around 89% and has to be rerun with an extra flag that will fix errors that it finds. While this will take another few hours, it won't take as long as the initial check.
Also we're putting a plan of action in place to have a new storage system in place by very early next week. Possibly before Monday time permitting. More updates will be posted as we have more information available.
Update: 00:10: The second run of the disk checker is progressing all be it slower than I would like, but I can't influence it's speed. It's approx 1/3 of the way through now.
For people wondering why it's not in a failover situation. Mail is very disk intensive so at the moment there is one NFS (network file system) server that houses this data with very fast disks and loads of RAM to cache files etc. The file system on the server needed a disk check as many many millions of files have been written, deleted and rewritten since it's last reboot. This is a check that is forced on file systems to keep them intact. The reason there's no failover in this instance is because the file system isn't fully healthy. Now doing highly available NFS services is not easy to do right you would loose performance. In the 4 years we've been running Qmail we've upgraded the storage platform twice and we're about to go for a third. Thankfully the third will be it's final upgrade as the SAN we're moving it to can scale both in size and performance.
Update: 01:10: We're about 50% of the way through fsck number 2, we haven't hit any of the corrupt files from the previous run yet. I don't expect that until around 89-90%, ETA at this stage will be around 3am. I'll post one more update between now and 3am.
Update: 02:30: The mail storage system is back online as is the Qmail cluster itself. There was only a half dozen files or directory entries for files corrupt and this was found on the second pass of fsck which relates to directories etc. Thank you for your patience, this will be the final update for this issue.
Leave a comment