vsphere_static_160x300
Free Business and Tech Magazines and eBooks
Badges

vexpert_logo_100x57

gestaltitbadge

follow-me-twitter

Subscribe to me on FriendFeed

Comments / DISQUS
Feedjit.com

My thoughts on the reactions to the ESX 3.5 Update 2 BUG

The product expiration time bomb that was mistakenly left in the first versions of the ESX 3.5 and ESXi 3.5 Update 2 download media is no doubt an embarrassing and horrible mistake by VMware. The timing of this disaster couldn’t be worse with Microsoft Hyper-V, Citrix XenServer, and others starting to be considered as an alternative virtual infrastructure platform for companies just beginning to explore the benefits of virtualization. How could this have happened and what are some lessons to be learned, not just for VMware, but for VI administrators around the world?

First this was an internal VMware blunder

At some point over the last 2 days I read where VMware publicly admitted they had a hole in their regression testing process. I can’t seem to find that comment today. I assume that the KB articles have been updated to include updated and more important information for VMware customers. I know I read it because I am not a developer and I turned to Wikipedia to understand what regression testing is.

Regression testing is any type of software testing which seeks to uncover software regressions. Such regressions occur whenever software functionality that was previously working correctly stops working as intended. Typically regressions occur as an unintended consequence of program changes.

Common methods of regression testing include re-running previously run tests and checking whether previously fixed faults have re-emerged.

Obviously this process did not work correctly at VMware for Update 2, and to their credit they have been open and honest about that fact. Evidence of the continued honesty can be found in their recent communications.

From my post yesterday Patch for ESX 3.5 U2 BUG promised by 6:00 PM today in the FAQ section at the end of the email:

We are making improvements on all fronts. The product team had endeavored to deliver a release with support customers deem important. But we fell short and we are deeply sorry about all the disruption and inconveniences we have caused. We have identified where the holes are and they will be addressed to restore customers’ confidence.

From the current version of the original KB article about this issue at http://kb.vmware.com/kb/1006716

The problem is caused by a build timeout that was mistakenly left enabled for the release build.

I am satisfied with VMware’s attitude, intentions, and actions to support customers world wide and resolve this issue. Frankly, I am bit surprised with the unrealistic expectations and sense of injustice expressed by some VMware partners and customers that anonymously left comments on the VMware Communities thread VMware Communities: BIG bug in ESX 3.5 Update 2 – If you’re …

Secondly, Update 2 was widely implemented because of exciting, new features in ESX 3.5 and ESXi 3.5.

Although Update 2 contained patches, the mass appeal that motivated IT departments around the globe to apply the upgrade was the new features. There has been a lot of rhetoric on numerous posts and threads with a theme similar to “how could smart administrators roll this out in production so quickly? Shame on them!”. Don’t kid yourself. Update 2’s appeal was not for the patching and furthermore, VMware has made the upgrade / patching process so simple that we all let our guards down. What other product can you think of that allows you to migrate production workload over to other hosts so easily and let you make these changes during business hours? I have to believe that this bug’s impact is so widespread because it has been so easy to upgrade and patch ESX in the past without consequence.

I also want to point out that the bug is not a result of current or new ESX functionality.

Look in the mirror and examine how we all helped make this happen.

First thing we all need to realize is that although VMware ESX and ESXi is an operating system and can be considered software, when we migrate all of our production systems to VI 3.5 Enterprise, Microsoft Hyper-V, Citrix Xenserver, etc. it is no longer that simple. If you haven’t already, realize now your entire business infrastructure is in a virtual data center created by this software. The implications of this are painfully obvious today. Yes, VMware’s success has made this possible, but is it any different for any other virtualization vendor’s products?

Go ahead, be frustrated with VMware, but be ANGRY with yourself. Use that emotional energy to make sure this doesn’t happen again regardless of the virtualization platform you use. Get your internal change control process in check.

Another good opinion blog post about the reaction to this bug was written by Matthijs Haverink at Virtualfuture.info:

Sure, VMWare made a (critical) mistake, but what’s all the fuzz about ? | Virtualfuture.info

Related Posts

  • Chris
    I couldn't agree more on the "Look at yourself" sentiment.

    Every IT department should be doing some level of regression testing internally. There's no good reason not to have a test ESX cluster, especially now that ESXi is free.
  • Hey Rich,

    Great post too!

    I totally share your opinion and agree strongly on the fact that you point out : VMWare can make updating so easy but you have to bare in mind that this is not just a piece of software, this is your virtual hardware!

    Administrators and IT organisations should really learn how to handle virtual environments because it all slipped in so easily but it cannot be treated like any other object within your IT infrastructure! And let that indeed be a leason learned to all IT Infrastructure managers and administrators!
  • Phil
    If you're going to deliver a product which has time-expiry in ANY of its incarnations make very sure that users are given prominent warning well before the licence or product expires. Preferably at least a month in advance.
  • Phil,

    VMware's did not intend to deliver the GA version of ESX/ESXi 3.5 with a time expiration. CEO Paul Maritz explained in his blog that the expiration was there apparently because preview / RC versions were available and in use. Paul, and VMware, has been very open and honest about the fact that their QA process failed to catch the expiration was not removed or disabled.

    Here's my post about Paul Maritz's blog entry:
    http://vmetc.com/2008/08/13/vmware-ceo-paul-mar...
  • James Shelton
    I'd just like to add that no amount of "Change Control" will prevent all date triggered catastrophes like this. This date could have just as easily been many weeks or months in the future and could have been rolled out to production by even some of the most conservative of procedures and been subsequently nailed with the bug anyway. I guess the old bumper sticker is true...it just "Happens".
  • James,

    I'm not a developer, and I've had limited personal experience administrating infrastructure for product development, but it's my understanding that conducting tests to determine if the licenses expire at any date would not be attempted. What should catch something like this is a QA process that tracks if the release candidate's expiration code has been disabled in the final version that goes to general availability.
  • James Shelton
    Rich,

    I'm no professional developer either...but I have extensive experience installing and supporting infrastructure for several Fortune 500's. It's only logical that such date-triggered bugs would be practically impossible to properly vet in any testing process (I can think of other errors that would be equally difficult to detect and/or test for). Sure...in this particular case it was an actual expiring license....but what if it had been an actual programming error that triggered upon reaching some arbitrary date or time? My point is that there are some things that you cannot stop no matter your testing process. There is only one process that will ensure that you never are hit by these newly introduced errors or bugs...sit on your hands and never upgrade anything. Of course...there is a business cost to making that decision as well...it just takes longer to realize that problem...

    The fact is that software is only becoming more complex, more powerful, more difficult to fully test, more difficult to track, (you can see where I'm going here...). Companies should develop strong change management procedures to attempt to mitigate issues like this...but no one should confuse the word "mitigate" with the word eliminate...because I can assure you...no process will eliminate the introduction of bugs, errors, or vulnerabilities into any company's production systems.
  • Our support team practically begged for us to sanction this patch for us in production. I told them that beyond the issue of having to test it in our environment, we never deploy any non security updates within the first thirty days. We usually wait for the BEEs (Bleeding Edge Engineers) to test it and post on the forums. I still find it hard to believe people had this progressed from posted on the VMware site to production within three weeks.

    Great point about the reputation damage at a time when the Microsoft marketing machine has the volume turned to eleven. I received many calls from people I know all over the country asking me if we were impacted. I can't remember the last time there was that much negative buzz about a product that we actually use.
blog comments powered by Disqus
Hyper9 Cowabunga
Support VM /ETC
Support VMETC.com

Support VMETC.com

@rbrambley tweets
Advertisements
VMTN Roundtable Podcasts
Subscribe



Add to Google Reader or Homepage
Subscribe in NewsGator Online
Add to netvibes
Add to Plusmo