Planning ESX host capacity
How many VMs should run on each ESX host? The answer is determined mostly by the physical resources of the host’s platform (storage, ram, cpu, etc.). Before VI3 introduced ESX Clusters with DRS and HA squeezing as many VMs on each ESX host as possible was acceptable. Today it’s not just ESX host capacity, but ESX Clusters need to be take into consideration. Planning Cluster capacity means ensuring availability of VMs while maintaining acceptable host performance in a fail over scenarios.
First, what is a fail over scenario? The first thing that comes to mind is a problem. One or more of your ESX hosts unexpectedly crashed. This is considered unplanned downtime. Another fail over scenario to consider is planned downtime such as rebooting after applying ESX patches. For both of these types of scenarios you want to make sure your VMs stay online.
VMware’s solution for planned downtime is VMotion. The solution for unplanned downtime is the HA feature of ESX Clusters. When determining your ESX capacity be sure to allow room to leverage these features.
VMotion migrates a VM to a different ESX host without users losing connectivity. Evacuating an ESX server by VMotion enables you
to reboot ESX for planned downtime. The ESX HA feature recognizes an ESX host that is no longer online and communicating with the other ESX hosts in the cluster, or considered isolated, and automatically restarts the VM on other available ESX servers with the capacity to run the VM. Making sure that the extra capacity is available is the trick.
I’ve heard a lot of companies adopting an N + 1 mentality recently. This is typically based on a consolidation estimate’s results. If the capacity analysis study determined that 4 ESX hosts are needed then include an extra ESX host just for added capacity and fail over resources. That’s a good start, but it might not be enough.
Remember that a capacity analysis project is centered around the “WOW! factor” of demonstrating how many servers can be consolidated on ESX hosts. The ESX host utilization is often targeted to be 80% or higher – meaning host as many VMs as possible until the host’s cpu and ram is 80% in use. Using the N + 1 strategy that means that losing one host will work fine. 80% load spread over 4 hosts is maintained by having the 5th ESX host. What if you need to allow for 2 ESX host failures? In that case there is a problem. Here’s some math to demonstrate why:
- original design was 4 ESX servers at 80% = 320%, but if you lose 2 ESX hosts, even with the N + 1 server, 3 hosts will be running at over 106% utilization.
That may be acceptable for a short period of time, but an Administrator should be ready to add another ESX server ASAP to maintain normal performance levels. It gets worse when you lose more ESX hosts simultaneously. A consolidation scenario where ESX hosts are only 65% utilized helps the numbers, but it also means more hosts.
The reality of the IT budget and the reliability of modern hardware keep most companies from planning for more than 1 ESX host failure. Maybe that’s why the N + 1 strategy is so popular? If this is your case you maybe able to still plan for more than 1 host failure by categorizing your VMs as critical and non critical. Create 2 separate ESX Clusters. Enable DRS and HA on the Cluster with the critical VMs, and only enable DRS on the non critical cluster. The N + 1 design already gives you enough capacity for planned downtime (by rotating maintenance on one ESX at a time), and by limiting the number of VMs that HA will restart, you might be able to maintain critical VM availability through multiple unplanned host failures. You can achieve the same design with a single cluster by setting the non critical VMs fail over behavior to not restart in the properties of the Cluster.
Monitoring your ESX Clusters is critical to be sure to maintain the design’s expected availability. Obviously, As your number of VMs increases there will quickly come a point where the number of ESX hosts will need to increase.
For more information about HA check out VMware’s HA page. For more information on VMotion then check out VMware’s VMotion page.










Pingback: VMware ESX Memory Over Commit Technology Explained | VM /ETC