VMware ESXi CBT Bug – Your data may be at risk

Changed Block Tracking (CBT) is a great feature introduced in ESXi 4.0, that effectively brought incremental backup technology to our virtual machines. Third parties were able to leverage this so backups would only require the processing of blocks that have changed rather than all blocks.

This saved inordinate amounts of time, data and effort while allowing vendors to support flexible backup routines.

That was cool, until some of your data goes awol.

VMware have quietly released a knowledgebase article  (KB: 2090639) relating to a CBT issue that may compromise recoverable data under the following conditions;

  • CBT is enabled
  • A virtual machine disk (vmdk) has been extended to a size greater than 128GB.
  • An incremental backup is taken using the specific QueryChangedDiskAreas method
  • VM is restored from an incremental backup 

All of these conditions are common and not really edge case scenarios. Attempting to restore vmdks that are affected can result in corruption and data loss.  That’s a lot of potential for data loss, over a long period. How will this affect the “Incremental forever” & “Synthetic Full” technologies ? Time will tell.

All versions since CBT was implemented are affected. Yes, that’s ESXi  4.x through to current build 5.x and guess what ? There’s no fix yet.

ttemp

 

 

 

 

 

 

While a fix is being developed, and let’s be honest, we have no idea what that might entail, I think it’s prudent to begin looking at reducing degrees of exposure and risk mitigation. I take data integrity very seriously, and will always take precautions.

It’s been stated (see the knowledgebase) that disabling and then re-enabling CBT for affected vms will draw a “line in the sand” so after a full backup is taken post change, data should be ok.

Phew.

ttemp2

 

 

 

 

 

There are numerous PowerCLI scripts to reset CBT on vm’s, however they seem to target all vm’s in a DC/cluster etc. On even moderate scale environments that can cause (what I deem to be) unnecessary resource usage. Resources being time, disk space and compute cycles. I’d prefer to be precise and only change objects that require changing.

With that in mind, I tweaked existing scripts to only change the CBT settings for VM’s that a) Have CBT enabled already and B) Have vmdk’s larger than 120GB (to be safe until total size/scope is confirmed)

*** USUAL WARNINGS *** Before you run this: Make sure there are no backups running and delete/commit any snapshots. Run at your own risk and test before executing.

This will only DISABLE CBT, and commit the change, “stunning”  the vm by creating and then removing a snapshot.

the process executes against each vm sequentially and you will see a storm of task activity;

tasks

 

 

 

 

 

If you want to further limit the scope, change the $targetvms variable accordingly.  For example to limit to a Cluster ;

Limit to a Resource Pool;

…you get the idea

Note that after this change is made, the next backup ran will effectively be a full, as block records will be empty.

It seems common practice that backup vendors automatically enable CBT (if it’s not already enabled) during vm backups. This is confirmed for Commvault, Veeam and others.

If not, and you feel you need to enable it again manually, run this prior to your backup to enable CBT.

Spread the word, lost data is not fun for anyone.

References:

Veeam thread, WIP;  http://forums.veeam.com/vmware-vsphere-f24/vmware-cbt-bug-kb-2090639-t23757.html

Veeam have now automated a workaround into V8 of B&R   

Powercli disable CBT script foundation: http://www.itwalkthru.com/2012/03/disabling-or-enabling-vmware-change.html

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: