FTF Veeam v6 Replication – Optimizing Snapshots
You don't really know your VMs until you try to backup or replicate them on a schedule. You may think you do, but you really don't. I'll even broaden the claim: You don't really know your virtual infrastructure (V.I.) until you start creating and committing snapshots on a schedule.
Fact is, if you want cost effective replicas of your VMs for disaster recovery fail-over using hypervisor based technologies you are about to ask your V.I. to “turn your head and cough”. Reliable, fast and efficient replication in a minimal window starts with providing good health in your storage, hosts, and virtual machines. You will probably have to make some changes to your architecture and design. As I've pointed out before, few planned and then actually built their virtual datacenter with a specific tool for disaster recovery in mind. The good news is that by optiizing snapshots you will probably fix many those nagging performance issues you've struggled with for a long time.
I know. I know. I and others have said in the past that snapshots in in your virtual datacenter are a bad idea. Well, the context of the discussion before was not to leave running snapshots open. That is not the context here. In fact, hypervisor based replication should never leave behind snapshots in your environment. Veeam Backup and Replication does not, nor do all of the other alternatives that I know about, leave behind snapshots. Sure, things can go wrong, but that is usually because of abnormal events in the environment (connectivity loss, Veeam architecture server reboots during jobs, etc). Things go wrong with snapshots without Veeam in the mix all the time. More on this later in the post.
BTW, Veeam is also smart enough to clean up left behind snaphots from previous jobs, and smart enough to leave a snapshot alone that Veeam did not create. There is never a “delete all” command issued to the hypervisor. If a snapshot already exists then Veeam does it thing with a new snapshot without touching the existing one.
The point is snapshots play a major role in hypervisor based replication and you can optimize their performance. An unhealthy snapshot process means that the replication job will take excessively long or will fail. This post explores what can be done.
Snapshot Best Practices
Both VMware and Microsoft have snapshot best practices. For starters, both vendors plainly state in almost the same words: “do not use snapshots as backups”. In fact, both state the issue with snapshots is not creating or using them but instead recognize the potential issues of filling up a datastore and committing the data accumulated inside a snapshot back to the production VM.
We do not recommend using virtual machine snapshots as a permanent data or system recovery solution. Even though virtual machine snapshots provide a convenient way to store different points of system state, data, and configuration, there are some inherent risks of unintended data loss if they are not managed appropriately. A backup solution helps provide protection that is not provided by snapshots.
Since Veeam does not use Hyper-V snapshots to backup or replicate VMs, I will focus on VMware for the rest of this post.
Use no single snapshot for more than 24-72 hours.
This prevents snapshots from growing so large as to cause issues when deleting/committing them to the original virtual machine disks. Take the snapshot, make the changes to the virtual machine, and delete/commit the snapshot as soon as you have verified the proper working state of the virtual machine.
Be especially diligent with snapshot use on high-transaction virtual machines such as email and database servers. These snapshots can very quickly grow in size, filling datastore space. Commit snapshots on these virtual machines as soon as you have verified the proper working state of the process you are testing.|
Read the entire best practices for both hypervisors using the links above.
Problems when using snapshots
VMware has a KB article that lists 9 things that can cause headaches with snapshots. Here's a quick summary in my own words: Snapshots are special virtual disks that need space on your datastores, and if you have permissions to create and commit them they will cause extra I/O that might have a negative impact. Make sure you have resources to support them.
Review the KB article for yourself, and, especially if you have had problems in the past, manually create and commit some snapshots to get an idea of where you stand before you start to replicate VMs.
Improving snapshot commit times to minimize replication windows
After you've got your snapshots to work and have successfully run your replication jobs examine the replication job report to see the total time the snapshot creates and commits take. With Veeam you can click on each VM in the Statistics window and find the lines where Veeam is waiting for the snapshots. You could also look at the tasks in the vSphere Client to see the snapshot progress.
Excessive time to commit a VMware snapshot, especially with a database or messaging VM, is often the result of too much I/O for the underlying datastore. You've got the normal production I/O of the VM combined with the extra I/O of commiting the changes in the 0001.vmdk (for example) made during the time the snapshot was open (the job).
If you are using vSphere 4 or earlier, the snapshot is actually created on the datastore where the VM's .vmx file lives. I've seen the time it takes to commit snapshots drastically reduced just by moving the VM's .vmx file to another datastore. Oversimplified, you are then dividing the production and snapshot I/O across the different datatstores. If you have different spindles and controllers for the different datastores performance gets even better. Beware – make sure the datastore you move the .vmx to has a block size and enough free space to handle the snapshot. I've known customers to build a 1TB datastore just for .vmx files so VMware snapshots would perform better.
If you are using vSphere 5 or newer then the snapshots are now created on each datastore where there is a vmdk. If you want to, you can still specify a working directory for snapshots even in ESXi 5.
Of course you can also improve the disk performance of your datatstores. The number of disks in the array, the RAID level, or even the type and number of disks all make a difference. SIOC might even help out.
Identifying the VMs that have to wait on poor snapshot performance is part of the 80-20 VM adjustment phase I talked about in the previous post of this series. If you can't make any changes to your VMs or your datastores right away then consider separating these VMs from the ones that finish quickly by creating a new job. As long as jobs are finishing and the problem seems to be isolated to a few VMs then you can live with this for a while.
Be sure to check out the other posts from this series!