FTF Veeam v6 Replication – Phased Implementaton Methodology
If you've got the time to do a readiness assessment or a environemnt health check then great, but, more times than not, a company needs to get the DR solution up and running and figure out the weakest links in the chain along the way. This post offers some suggestions for standing up hypervisor based replication jobs and then adjusting them as you go.
I suggest that you give yourself at least 30 -45 days to optimize and finalize your replication design and schedule. Breaking it down in phases, I would recommend an approach using Veeam v6.x as your solution like this:
- Phase 1 (2 weeks) – install, configure, and get jobs running
- Phase 2 (2 weeks) – The 80/20 VM(s) Adjustment
- Phase 3 (1 week) – Final Tweaks
- Ongoing monitoring (forever!)
I am not diving into the technical details in this series, but there are already many posts on installing Veeam v6 and creating Proxies, creating replication jobs, or you can follow the Evaluators Guides for vSphere or Hyper-V. I've also linked to various blogs and helpful info all throughout this post.
Let's expand on what to do and what to expect in each phase.
Phase 1 – Install, configure, and get jobs running
Phase 1 should be looked as a proof of concept or as a test phase regardless if implemented in production and regardless whether you plan to keep the initial Veeam install. In fact, agentless replication via the hypervisor API makes it easy to start in production and adjust later. Why? Because the infrastructure you introduce to the production environment is minimal, and unless your utilization is pushing 80% across your clusters you should not see a significant negative impact.
Before you start I recommend you think about the last time you committed a snapshot, however. Did it cause any hiccups? I'll talk more about snapshots in another post of this series, but on the production side that will be the biggest reason not to move forward with running replica jobs.
At the target, or fail-over location, you need a minimal number of hosts with configured datastore(s). Veeam replica VMs will be created powered off, so the storage is really the key until you are ready to test or run the replicas. You do have the option to create thin provisioned replica VMs, but I would recommend matching available storage at the target to the amount of storage at the source. You don't need to worry about similar storage (vendor, type, etc). You just need to have the room to store the replicas.
The following is an oversimplified whiteboard of minimal Veeam architecture needed to get jobs started
Summary install and configure points
- Veeam is only installed once at the target location (more on picking an install location later in this series).
- A Veeam proxy is required at the source location. The default Veeam install provides the required proxy at the target location.
- All Veeam servers must be able to resolve and communicate with each other across both locations.
- vCenter or SCVMM is not needed (but certainly can be used).
- Veeam architecture can be all virtual, all physical, or any combination you want.
Once Veeam is installed and the Proxies are deployed, it's time to think about running jobs.
I always suggest running backup jobs first. Running local backups establishes a Veeam processing baseline and allows you to create a set of files that can be used as seeding for the replication jobs. Run the backup jobs using portable disk as a backup repository – USB drive, server with local storage, a notebook, whatever. You just need some way to get the backup files to the target site.
Here's the whiteboard updated to show how a disk drive was configured first a Veeam Repository at the source for backups and then moved to be a Repository at the target for seeding. Note that Proxies at each location need to use this Seeding target.
Seeding is not required if your WAN is fast enough to transfer replicas of your VMs in a reasonable window, or if you have the patience and can wait for the first full replication to complete across your WAN – no matter how long it takes!
Another benefit of running a backup first is it lets you get a better idea of how much data you actually need to move, and then how much data rate of change you might expect in your VMs. ReplicaCalc is a tool you can use to guestimate the time your actual replication jobs will take based on info from Veeam backup jobs combined with your available bandwidth between locations.
Eventually, you can then divide your VMs into replication jobs. The strategy I suggest is that you split your VMs into groups that create replication jobs that will take 2 – 4 hours each. Then you want to schedule the jobs to run one at a time and on a staggered schedule so all jobs will fit in your replication window.
It doesn't matter how many VMs are in a single job or how many jobs you create. Just make sure you pay attention to the total time it takes to run a job. You only run into trouble when you schedule many jobs simultaneously, and I am not suggesting that for the first phase.
Run your jobs! If you use seeding then Veeam will use the files you put at the target site to do restores there and then only replicate the current changes across the WAN. Problems on the first run usually are the result of connectivity, DNS resolution, permissions, or other environmental scenarios. Look again at the whiteboards earlier in this post. Often the Veeam replication infrastructure stretches across domains, DNS servers, subnets, firewalls, and even virtual infrastructures with separate permissions. More on these gotchas later in the series.
Phase 2 – The 80-20 VM(s) Adjustment
I have seen very few customers get their replication job / schedule / proxy mix right the first time. It doesn't matter if a customer is an SMB or an Enterprise, the initial replication jobs are rarely the final, go forward configuration. What I do see is that for 80% of the VMs the first runs are fine. Then there is always that other 20%.
Databases, messaging, or any other applications that generate significant I/O during the replication window usually result in VMs that take significantly longer to replicate then the rest of the VMs. Look at the Statistics window or html report of each job and you will see how much time each VM needed to complete. Furthermore, look at the Statistics of each VM (click on the VM in the Statistics window) and see what part of the job took the longest for that VM. Be sure to analyze your jobs after you have successfully run jobs more than once. Get a feel for the average time each one of your VMs takes to replicate incrementally and not just during that first full (regardless of seeding).
A simple adjustment example
A customer has 2 replication jobs each with 20 VMs. Looking at the Statistics of Job 1 they see a job usually takes around 4 hours and 15 minutes. When you look at the time it takes for each VM in that job you see that 18 VMs usually finish in 5 minutes each (90 minutes total) but the other 2 VMs take 2:45. For simplicity lets say Job 2 has the same results.
The customer started with 1 proxy at each location (like the whiteboard). Each proxy has 4 CPUs and can run 2 concurrent tasks (jobs).
Job 1 is scheduled to start at 10 pm. Job 2 is scheduled to start at 2 am. Job 2 is finishing between 6:30 and 7 am every morning.
As you can see the jobs results show that 36 VMs finish quickly and 4 VMs do not. We won't worry about why they don't for this example. That's for later in the series. Let's focus on what you can do in Veeam to make these jobs finish faster.
If you removed the 4 VMs taking the longest from their current jobs, Job 1 and Job 2 would finish in 90 minutes. Since the proxy can handle 2 jobs concurrently and only 1 job is running at 10 pm we have room for another job. Taking the 2 VMs that used to be in Job 1 and creating a Job 3 would be my advice. Doing the same to create a Job 4 at 2 am would also make sense. Now when the jobs run Job 1 is not held up by the VMs in Job 3 and likewise for Job 2 and Job 4. The net result is that Job 1 and 3 are finished by 12:45 am and Jobs 2 and 4 are finished by 4:45 am.
Taking this one step further, nothing is happening between 12:45 am and 2:00 am. In fact, the Veeam Proxy has an open task after 11:30 pm because Job 1 only took 90 minutes. The customer could move Job 2's start time to 11:45 -ish and move Job 4's start time to 12:45 -ish. Every job could be complete by 4 am.
Remember the Seeding option we used to establish the first fulls for Job 1 and 2? You can do the same for the new Jobs 3 and 4. This time you could even use the existing replicas of the 4 VMs instead of making new backups. The Veeam job also let you choose existing VMs at the target site as Seeds.
Of course there are many other factors that influence the speed and efficiency of replication jobs, but the point here is by going through the process of running replication jobs for a while and getting a true understanding of average job times you can make intelligent adjustments to the Veeam jobs and Proxies. As you can see, the results are vastly different for each customer. There is no easy best practice design or deployment guide document to refer to. Running jobs and adjusting is the best way to finalize the replication infrstructure design and DR solution.
BTW, I realize the math makes this a 90-10 example. You get the point!
Phase 3 – Final Tweaks
After the adjustments run your jobs and make sure things happen as expected. Using our example, what if all the jobs were over at 6 am instead of 4 am? This is better than before, but why could this be happening?
Perhaps the Veeam Proxies, the “muscle” for moving all data across the replication path, are undersized or resource constrained. Maybe the Proxy OS is pegged at 100% CPU during the replication jobs. If the proxies are VMs, maybe the hosts are resource constrained and can't dedicate the resources needed to the Proxies. If you have to, add more Veeam proxies or switch from VMs to physical servers. Maybe your network or even your disk arrays (datastores) are the bottleneck.
Ongoing monitoring (forever!)
Even though you've worked through all the jobs, made the adjustments, fixed the bottlenecks, and your jobs have run smoothly for weeks that doesn't mean things can't suddenly change. Using our example again:
- 1 of the 36 VMs now takes 1 hour to replicate
- 20 new VMs have been added and need to be replicated
- 20 new VMs have “stolen” resources from Proxies during the replication window
- The V.I. has been upgraded / changed in a way that could impact jobs
There are many, many environmental factors that can cause unforseen disruptions in your replication jobs. be sure to keep an eye on your jobs and “rinse and repeat” the adjustments as needed.
Be sure to check out the rest of the posts in this series.