The Virtualization Management Opportunity
A very smart and experienced executive in the systems management industry once remarked to me that innovations in platforms always outstrip the ability of the major management vendors to keep up with them. The world needs management startups because corporate IT adopts new platforms before they can be managed by the incumbent frameworks from the major vendors like CA, IBM/Tivoli, HP/Mercury, and BMC. If you think about the major applications architecture and platform innovations that have occurred in our industry in the last 20 years, each has required and created a new set of management vendors. Some examples are:
- Client/Server computing – Tivoli
- Windows Servers for production applications – NetIQ
- Web application response time management – Keynote
- J2EE Web Application Management – Wily (now part of CA)
- HTTP as the standard application level protocol – Coradient, and the other web appliances
- TCP/IP as the standard transport – NetQos, Network General, and the other TCP/IP appliances
So, what is the next one of these, and what are the implications of this particular one? The answer is that virtualization, and in particular VMware has been widely adopted in corporate enterprise IT, and with this adoption has come a big problem for IT, and an even bigger opportunity for new application management vendors. However, this opportunity is not just about a new class of applications, or a new applications architecture, it is about changes to the management of every application that has been built and deployed since the start of business computing. That is because VMware is fundamentally changing things that have not changed in a long time, if ever. As a result of these changes, VMware raises the following questions:
- Who budgets for and controls server capacity?
- Who is responsible for applications performance?
- How does the dynamic nature of virtualized environments change application performance management?
- What metrics about server performance can be trusted?
- How does virtualization impact root cause analysis?
- What approaches to applications performance management stand a chance of working (and which ones do not)?
Let’s address each one of these in order:
Who budgets for and controls server capacity?
In most enterprise IT organizations, server budgets are split. IT controls the budget and capacity for commodity servers that provide horizontal services like file, print, and email. But the business units that own business critical applications (like SAP, CRM, etc.) own the budget for the servers that run their mission critical applications. Each business unit buys servers for “their” set of mission critical applications, which creates massive silos of application specific capacity within an enterprise.
With virtualization, this dynamic changes. As opposed to silos of servers that support each application (which is an incredible waste of server resources), IT provides a shared resource pool of server capacity. Each business units applications still run in their own OS, but instead of the OS being locked to one instance of server hardware, the interface between the OS and the server hardware is virtualized, and one server can support many instances of different types of operating systems. This allows server utilization to rise dramatically, and allows IT Operations to buy server capacity incrementally across all of the supported applications.
The change in who buys server capacity also creates a problem for both IT Operations and the application owner. That problem is that there needs to be rational way to know how much server capacity to buy to virtualize the next application, and when the lack of capacity is hurting applications performance. Related to this problem is that IT Operations needs to be able to prove to the application owner that the application will perform at least as well once virtualized as it did when it was running on its own hardware. These new problems are created by virtualization because virtualization invalidates the traditional metrics used in capacity planning (question #3).
Who is responsible for applications performance?
In the physical world of one application and one OS per server, there was often a clear line between IT Operations and the applications owner as to who was responsible for what. IT Operations was responsible for supporting the platform for the application (the hardware and the Windows OS), and the application owner was responsible for ensuring the performance of their own application. Since the application owner could add capacity whenever they wanted to, the application owner felt secure that sufficient resources were available to allow their application to run effectively. When there were problems with applications performance, IT Operations had a well understood process to prove their innocence, and to dump the problem in the lap of the applications owner. The process was basically, “My metrics prove that the OS is running fine, and that your application is using no more resources than it should, so it has to be a problem with your application, not my platform”.
Once an application is virtualized, IT Operations loses this defense, since the hardware is now shared between multiple instances of the OS, running multiple different applications. When the application does not perform well, the application owner immediately points the finger at the new virtulization layer as being at fault and IT Operations does not have the tools or the metrics to defend themselves. This brings up another problem created by the virtualization of applications. That problem is that tools do not exist that allow IT Operations to prove in a defensible manner (with numbers that people believe in) that the environment (including impacts upon application A by application B running in a different VMware Guest) are not at fault. In other words, IT Operations has lost its ability to defend itself in the blamestorming meetings that inevitably occur when applications do not perform well.
A related problem is that neither IT Operations nor the applications owners have any tools that can credibly compare the performance of applications across physical and virtual implementations. The primary reason for this is that the resource based applications performance metrics used by most application performance management (APM) vendors in the physical world, do not work in the virtual world, and therefore cannot be compared across applications running in physical and virtual environments.
How does the dynamic nature of virtualized environments change application performance management?
In physical implementations (dedicated implementations of physical servers, operating systems, and applications), APM products assume a specific and fixed set of hardware provides the resources that are used by the application. These same APM products assume via configuration that the web layer of an application is talking to a specific middle tier layer, which in turn is talking to a specific database server.
When problems are reported about the performance of an application, that report contains references to the physical environment of the application, like the name of the server, its CPU rating, the total amount of memory in the server, its IP address, etc. APM products assume that these physical elements are a fixed reference point through time against which utilization of resources and performance can be judged. These products often create baselines, or statistical representations of what is “normal” based upon how much of a resource an application is using at a point in time. These products also assume that relationships between layers of an appliation system are fixed, instead of dynamic.
The dynamic nature of virtualized applications creates two more points of pain. One is that baselines for normal performance related to the specific hardware upon which a portion of an application is running at a moment in time are invalidated. The other is that products that rely upon manual configuration to understand the mappings between applications components are just too brittle to be able to deal with the rate of change in the virtual platform underneath virtualized application components.
What metrics about server performance can be trusted?
The answer to this question needs to start with a discussion of what cannot be trusted, and then to see what is left. The basic problem is that an operating system that measures the performance and resource utilization of its own processes, and applications running on that operating system assumes it is the sole user of the hardware resources on the server or workstation it is running on. For any resource that involves time (CPU %, Disk Time, Page Faults per Second, Context Switches per second), the OS assumes that it is the sole user of the system clock. So, when the OS measures how much CPU a process (an application) has used in the last N milliseconds, the OS assumes that it is the only user of the CPU.
Once you virtualize an OS, all time based metrics collected by the OS about itself and the applications running on top of the OS are shifted by the degree to which that OS is now one of many OS’s sharing the same hardware. If there are 5 guests on a server, and one application running in each guest, and all are doing equal work at that moment in time, then the metrics reported by a guest OS will be off by the fact that each guest only sees one-fifth of the hardware resources at that moment in time. Of course if in the next second, a guest is shut down, then the degree of time shift changes.
The first conclusion about metrics in virtualized environments is that any metric about resources that is based upon the use of a resource over time which includes all of the ones listed directly above, is invalidated (made irrelevant and untrustworthy) by virtualization. The only resource based metrics that remain valid are ones like how many bytes of memory an individual process is using, and how big a file or database is in actual bytes.
However, the problem gets worse from here. The holy grail of applications performance metrics, response time, is also impacted by this time-shifting. If an APM product reports that transaction A as measured by the elapsed clock time from action B to response C is .5 seconds in a physical implementation, then it really took .5 seconds. If that same measurement occurs within a VMware Guest that is one of 10 guests on a server, and all guests are equally busy, then the response time monitor could well again report .5 seconds, but the actual clock time that elapsed could well have been 5 whole seconds. This is because the Guest OS only knows about the clock ticks that it gets, and as opposed to getting all of them in a physical environment, that guest is only getting a variable share of those clicks at each moment in time. So, virtualization can also invalidate the most valuable and credible of APM metrics (response time) if those metrics are collected from within a virtualized guest OS.
How does Virtualization impact root cause analysis?
Virtualization makes traditional root cause analysis much more difficult for all of the reasons mentioned above. By invalidating many of the metrics and their baselines that server and application support teams rely upon, virtualization makes it much more difficult to use those metrics to pinpoint abnormal behavior.
Virtualization also creates a whole new root cause problem. That problem is to answer the question, "Why does this application perform poorly when virtualized, but performs just fine when it is using completely normal amounts of CPU and memory on a physical server". The inability to know how well a particular application will perform once virtualized means that the only method that is feasible is to "try it and see how well it works (or does not)". Since IT does not inspire confidence with many business units and applications teams, having IT have to use the "trust me it will work" promise is a major roadblock to virtualizing the 80% of the applications that really matter in a corporate enterprise.
Which approaches to applications performance management stand a chance of working (and which ones do not)?
Before virtulization, IT Operations had resource based metrics to fall back upon when questions of applications performance arose. Now these metrics are either unavailable, or not credible. The shared and dynamic nature of virtualized environments makes APM approaches based upon how much CPU, or Disk I/O an application is generating at a moment in time untrustworthy and invalid. Furthermore, approaches that gather response time data via scripted synthetic agents, or real time passive agents are also impacted when those metrics are collected inside of guest OS.
So, what works today, and what new approaches are needed? There are two approaches that stand a chance of working. One is to rationalize the resource based metrics, by allowing metrics collected by the host OS, to be combined at any moment in time with metrics inside of the Guest to provide a true picture of resources utilization. This requires VMware to either collect and publish the host metrics in a usable form, or requires third party agents running in the host OS (something VMware is reluctant to get behind since it wants to keep the host OS as thin and efficient as possible). The second approach is to rely upon actual transaction response time metrics collected at actual end user workstations, from within the actual applications or upon response times collected by network appliances connected to mirror ports on switches. This has been the holy grail of APM for several years now, and this approach has now been made all the more valuable by the demise of the traditional resource based metrics. Whichever approach is used, it will have to be combined with an ability to dynamically understand changes to the enviroment of a guest OS, and to the components of an applications system at a point in time. So, changes in the environment of an application needs to be constantly auto-discovered by APM tools in order for the tools to be able to provide relevant information.
The race has started to allow IT Operations and applications owners to know how well their virtualized applications are really performing, and to be able to quantify that performance in ways that allow for more applications to be virtualized in denser implementations then are now possible (since no one knows how densely you can pack guests into a host without causing problems). Successful vendors of response time based APM solutions are effective today when they sell to applications owners. Virtualization creates many additional pain points for applications owners and IT Operations. The successful vendors will figure out how to make their products dynamic and based upon credible metrics and to tune their sales and marketing approaches to address these new points of pain and target audiences.
Bernd Harzog
CEO
Application Performance Management Experts
770-475-4249
bernd.harzog@apmexperts.com
http://www.apmexperts.com/

<< Home