I got an email today about a problem I had not seen in a while. This company was still using local storage only and has not migrated to shared storage. So, unfortunately they have not been able to leverage DRS yet!
Here’s a cut and paste from my customer’s original email.
The process vmmemctl went crazy today for 30 seconds or so and made the machine unusable; after that, kswapd went nuts for about 30 seconds. Then things were back to normal. What’s up with that stuff? It seems every VMware virtual machine we’ve seen these kinds of problems on. They’re pretty annoying on a development machine, and really problematic on a production machine.
Here’s my reply:
It’s been a while since I’ve seen that! This problem used to occur more often in ESX 2.X days before VI3 - before shared storage, vmotion, and DRS became the norm. Back then this always surfaced when an ESX host’s physical resources were over committed.
The reason is because your ESX servers guest VMs are battling over RAM, and how ESX manages that (without DRS in VI3 Enterprise) is to write out the RAM to a balloon driver on the VMFS LUN. Unfortunately that process zaps the VM(s) and spikes the ESX CPUs.
Here’s some quick links for more about this:
http://communities.vmware.com/thread/55488
http://communities.vmware.com/message/769479#769479
http://www.vmware.com/pdf/vi3_esx_resource_mgmt.pdf ( ! ! check out page 132 for vmmemctl info )
You can try to work around this by reserving RAM for each VM to 50% of the assigned VM memory. For example, if your VM has 1GB ram then create a memory reservation for at least 512 MB RAM. That was done by default back in ESX 2.X, but it is no longer done with VI3. This will quickly limit how many VMs you can host though! Maybe start with the VMs that seem to be affected the most? I would also look closely at all of your VMs and scale back virtual RAM where possible - do all of your VMs really need all the RAM they were created with?
Of course you can always add other ESX servers and spread the VMs out across more hosts. Finally, once you get to shared storage then DRS will auto manage contention for you by auto vmotion, but you will probably still need more ESX servers!