pemvzlin16 issue - all vps down

TrackBacks (0) Comments (5)

Notification Type

Emergency Maintenance

Date

July 21, 2011 8:59 AM

Service Affecting

Yes

Message


: pemvzlin16 is down at the moment, we're looking at it to see what the problem might be. An engineer is on site with it at the moment.

Update: 10:00: We are continuing to work on this server, we hope to have it back online by approx 11:00 am. Further updates will follow.

Update: 11:20: This machine is still being worked on, we've run over our last ETA, sorry about that. We hope to get the node backup and running as soon as possible. The current situation is that we're waiting for a raid array to rebuild, this is a time consuming process but we want to be sure that it completes before booting the machine.

Update: 12:00: After several attempts to do a manual disk check on this server we have been unsuccessful. The data on the /vz partition where the vps servers resides appears to be badly corrupted. As this juncture we're looking at doing a restore which could take upto 48 hours to complete or maybe longer due to the sheer volume of the data contained on the node. We're talking several TB of data.

If customers have their own backups offsite that they wish to restore let us know and we can re-provision your vps on another node. Otherwise you'll have to wait upto 48 hours at least before we will be able to get you back up and running. We apologise for this but it was outside of our control. Raid card can sometimes do very weird things and in this instance it has somehow corrupted the filesystem on one of the arrays in this machine.

Update: 22nd July 14:16: Currently the restoration process is going ok, our engineers have built a new array and began the restoration of data on to this.  We have encountered some problems with the data we are restoring as there is some level of corruption.  We have already raised this with the R1soft CDP vendors (the backup software we use) but are continuing to work on the data in the meantime.  We will not have a better picture on the level of corruption and data that is restorable until tomorrow afternoon at the earliest, we hope to have another update at that time but we are looking at a few days until relatively full service is restored.

Update: 23rd July 15:05: Unfortunatly the file transfer process is taking longer than expected due to the volume of small files.  We now hope to be able to fully install and test the recovered data sometime between 8-10pm tonight.  We can see that the level of corruption on files is low, around 10%, though we do not know if any vital files are corrupted yet. After we are able to test the recovered data we will see if there are any further ways to recover or repair any corrupted files.

Update: 23rd July 23:15: Our engineers are restoring the data in VPS format on to a new VPS node now.  This restore will run overnight.  In the meantime if you have provisioned a new VPS and\or pointed your domains elsewhere during this outage please let us know by sending an email to our Support team (or replying to any ticket you may already have open) and we will be sure to help you get your data on to your new VPS where possible, or pointed back to the restored data.

Update 24th July 19:11: The restore has been running since last night and has been very slow due to the large number of small files involved.

the following VPS are affected:

78.153.211.72
78.153.208.216
78.153.211.82
78.153.211.85
78.153.211.90
78.153.211.97
78.153.209.218
78.153.210.170
78.153.211.100
78.153.211.105
78.153.211.106
78.153.211.108
78.153.211.111
78.153.211.114
78.153.211.122
78.153.210.184
78.153.211.128
78.153.211.109
78.153.211.139
78.153.211.140
78.153.211.142
78.153.211.143
78.153.209.206
78.153.208.56
78.153.208.144
78.153.208.201
78.153.208.220
78.153.208.232
78.153.211.169
78.153.211.179
78.153.211.181
78.153.208.31
78.153.211.214
78.153.208.215
78.153.209.18
78.153.209.127
78.153.209.195
78.153.210.45
78.153.210.219
78.153.210.238
78.153.211.11

UPDATE 25/07/11 - 9:54AM: We have fully recovered data back from all VPSs. The data is in good condition however we are unable to boot the old VPSs currently. We are trying to boot them from previous backups.

If you would like us to generate a new VPS for you and dump the data from your previous VPS into a folder on your new one we can do that easily. Please email support@blacknight.com if you would like us to do this.

In the interim we will continue to try boot the old restores.

UPDATE 26/7/11 - 10:30AM: We've worked  through the night to try and restore all the VPS to their former state. However we can now say with some certainty that this is not going to be possible and I'll explain why. Virtuozzo doesn't use a traditional image based file system like Xen, KVM, HyperV, Vmware etc. It has a template based system that that has a master template for each OS type. e.g. CentOS, when the VPS gets installed most of the system binaries etc are symlinks which link to this template. When the restore took place it didn't understand these symlinks and so it was unable to restore them. As a result of this the VPS can't boot because most of the operating system is missing.

Symlinks are pointers which look to most applications like a real file but on the file system level they're pointers to the real file. When a file that is symlinked gets updated, say you upgrade apache via yum or apt the symlink gets replaced with the real file.

It's not possible to fix these vps servers right now and we believe our efforts are better spent restoring your data in a new VPS and helping you to get everything back online. We do have Parallels working with us to try restore the automatic backups that we perform each night but we haven't had any success with this yet. If this proves to be fruitful we'll let everyone know.

All customer data, including modified files, web pages, email, databases, log files etc is restoreable and we've been creating new VPS for all customers and putting the old data in them.

To re-assure everyone at this stage, we're going to do better backup checking in future, we will perform weekly and daily test restores of all VPS hardware nodes to ensure that a) the backups are working properly and b) that the restored data is working as we would expect. We will also be most likely discontinuing this product line. By this I mean we're going to simply replace it with a better product, one that will be more flexible and will actually operate like a real server. e.g. Xen, KVM or HyperV. The new product won't have any single points of failure and will be "Cloud" based, i.e. it'll be clusters of hardware nodes and no single VM will live on a single raid array, rather it'll be stored on our new cloud storage platform.

UPDATE 26/7/11 - 11:30AM: Some customers have asked how we are providing them with their data. Basically we'll give you a new VPS and drop the restored data into a folder for you. If you haven't already contacted support please do so immediately. We are re-creating VPS servers on request of customers and we will place your old data in /restored/ on the VPS once it is created. Also if you are on ubuntu 9.x or lower we'll create the new vps using 10.04 LTS.

0 TrackBacks

Listed below are links to blogs that reference this entry: pemvzlin16 issue - all vps down.

TrackBack URL for this entry: http://www.blacknightstatus.com/cgi-bin/mt/mt-tb.cgi/562

5 Comments

Hey Paul,

Sounds like a pain in the proverbial, best of luck with getting up asap. Just to get you know, 211.71 is affected also.

Cheers,
Alastair.

Juyong Kim on July 22, 2011 1:37 PM

Hi

Are there any updates on this?

Our clients are getting quite irate now.

Thanks

Can you please estimate how long it will take to complete. IE, if it is 50% done and took 48 hours, you should be able to estimate how long it will be.

And I am getting irate too.

Shane

Hi,

While I appreciate this update, I still don't know what you plan to do about it.

When and how and where will my data be restored?

Shane

Shane

Please contact our support desk.

Regards

Michele

Leave a comment