Thursday, January 8, 2009

Death by Snapshot

One of the greatest features in VMware's many virtualization technologies is the ability to take snapshots of virtual machines while they are powered on. Put simply, VMware snapshots save the system state of the virtual machine providing a restore point. This technology is available and works very well on a variety of VMware platforms including Workstation, Server, and ESX. Unfortunately this feature carries it's own risk to the performance and stability of your environment. In this entry I will discuss the destructive event of a snapshot consuming remaining available space on a datastore.



Recovery Outline at End of post


How snapshots work

In preparing for this entry I came across an extremely helpful post by Eric Siebert where he very clearly explains the many aspects of snapshots and how they work. I'll summarize a few important points in this section but highly recommend reading his full post.


Snapshots can be initiated and managed by either the VI Client or through the command line and once initiated, the snapshot makes the base VMDK read only and writes all changes to a new file. Each new differential (delta) file begins at 16MB and grows at 16MB increments as changes are made to the VM but naturally can not exceed the size of the original base file. These files will continue to grow until the snapshot is reverted (the changes are applied back to the base VMDK) or disk space is depleted, which makes for a stressful afternoon if this happens.


Drive space is gone!

I have seen it happen several times. Someone, or something, initiates a snapshot in your ESX environment and forgets to remove that snapshot when they have completed their task. My personal experience has found that this typically becomes a problem when backup software uses snapshots and the snapshot isn't merged back. Snapshot usage in this manner is common for backing up VMs since it allows the full disk to be placed in a read-only state so that the copy can continue without interrupting the machine's ability to operate.


A live snapshot delta file for most VMs will not likely grow very quickly, except for servers with higher amounts of disk I/O such as Exchange, SQL, or File shares (especially when using Windows Volume Shadow Copy). If a growing snapshot file isn't discovered and either applied or reverted then the snapshot could consume the remaining available storage. Once this happens the VM with the applied snapshot can no longer write its changes to the delta file causing the server to stop. Additionally, any other VMs writing to the filled datastore will also be forced to shut down. Fortunately any virtual machines on the datastore without snapshots, or an active swap file, will continue to run.


Note: If you are dealing with a single snapshot then no additional space is required to commit that snapshot to the original VMDK file, but I personally feel better when I have some extra room to move.


Make Room

If you are like me and have storage claustrophobia you want to make some room on your datastore. If the VM is a critical server it would be advisable to move another VM to a different datastore so that the afflicted VM can be restarted quicker and reduce any risks from trying to move a VM with a snapshot. I've seen mixed information about migrating with snapshots and it's feasibility. I'll leave it to the reader in their own situation but my suggestion is to play it safe and move another system without running snapshots.


Storage Vmotion immediately comes to mind for this situation. Unfortunately as Chad Sakac clearly explains in his blog post, Storage Vmotion requires creating a snapshot in order to operate. Consequently, with no drive space, we're left with the horrid task of intentionally taking down a server to make room. In my situation we took down our intranet server since it was one of the smaller servers, would take the least amount of time to migrate, and would cause the least impact on employee productivity. The time to complete this task will depend on several factors, specifically in regards to the type and speed of storage that you are using. This process took about 15 minutes for us to migrate our server to a new datastore.


Apply Snapshots

Once drive space has been created you should be allowed to start up the VM and commit the snapshots. Don't forget to turn the migrated server back on!


I have yet to receive a consensus on whether applying the snapshot on a live machine is better than leaving it powered off, but if the server in question is a main production box then it may be worth giving it a shot. You can expect that the process will take longer on a live machine. It will certainly be performing better than it was a few minutes ago!


Applying the snapshots can be a long grueling ordeal depending on the amount of space you have consumed. Be patient and do not be surprised if you see your task timeout in the VI client. VirtualCenter will timeout any task at 15 minutes but your process will still be running. Check to see if your process is complete by keeping an eye on the datastore browser in the VI client. You will be looking to see that the delta files are no longer there and you will also note that there is storage available in your datastore again. You may need to hit refresh occasionally in order to witness the disappearing files.


Our environment provided us with over 70GB of snapshot files over the course of 3 days time which took approximately 90 minutes to apply. Eric Siebert speaks to this in the second part of his snapshot post where he states that "A 100 GB snapshot can take 3-6 hours to merge back into the original disk." Suffice it to say that the larger the snapshot the longer it will take, and the more storage "cushion" you have on the datastore, the greater your risk for a long wait.


Recovery

Once the snapshots have merged back into the original file you can get the VM back up and running (if you haven't done so already) and then Storage Vmotion the server you moved previously if that is available in your environment.


Prevention

Vmware does not provide any tools natively for monitoring active snapshots in your ESX environment. Third-party applications are available to help automate the process of finding these active snapshots. I have not personally used them yet but Jason Boche mentions a few of them in his blog where he briefly displays Xtravirt Snaphunter, RVTools, and hyper9.


I will probably get my hands on a couple of these in the coming weeks and will certainly provide some posts. If you are in a fix to get some monitoring on your snapshots it looks as though SnapHunter can notify you via email when you have snapshots or even commit them if you so choose.


If you want to go low tech and only manage a few machines, you can check for snapshots by looking in the VI client or keeping an eye out for delta files in the datastore browser. Be vigilant regardless of your method for tracking active snapshots. It certainly doesn’t look good to the bosses when your highly robust ESX environment fails your company, especially when it can be easily prevented.


Despite the agony that can be caused by an unchecked snapshot, Vmware's snapshot feature is a true saving grace for the administrator and should be used without too much trepidation. The ability to apply a patch, test a deployment, or change a configuration and then quickly revert the system is more than I'd be willing to give up. Just keep your eyes open to the snapshots that are out there and everything should run smoothly and optimally, which the bosses definitely appreciate.


Recovery Outline

  1. Identify the server(s) affected and determine priority on bringing them back online.
  2. Shutdown and cold migrate another virtual machine from the filled datastore to a new location. Not always necessary if you have only a single snapshot since applying a single snapshot requires no disk space.
  3. Apply snapshots to the affected server. You may power on server if you prefer but this will have an adverse effect on performance and cause this step to take longer.
  4. Be patient. A 100GB snapshot could take 3-6 hours to commit. VirtualCenter will timeout your task after 15 minutes so don't panic.
  5. Monitor the Datastore Browser in the VI client and wait for the delta file(s) to disappear. You will likely need to refresh occasionally which may take a moment to process each time.
  6. Once the snapshot is committed you can safely turn on the VM (if you haven't already) and hopefully breathe a sigh of relief.
  7. Play it safe and set up a system of monitoring your VMs for active snapshots through either an automated software like SnapHunter from Xtravirt or simply monitoring for delta files in the datastore browser

No comments:

Post a Comment