Friday, September 30, 2005

Shrek, Onions, & Virtualized Application Architectures

What do the three things in the title have in common? Answer - they all have “layers”. If you do not have kids you may not know that Shrek, the ogre in the movie by the same name, has (contrary to popular opinion about ogres) layers to his personality. Onions have layers, and so do applications deployed via application virtualization technologies like Citrix, VMware, Microsoft Virtual Server, and Softricity. Why is this interesting or relevant to the topic of Applications Performance Management?

Because, when you use a virtualization technology to simplify some aspect of managing your applications, you at the minimum inject another layer or two of software into the mix. Let’s take VMware as a case in point. Before VMware you have the OS (Windows Server 200X) talking directly to the hardware. Applications talk more or less directly to the OS. So you have the hardware, the OS, and the applications to worry about.

Again using Windows as an example let’s look and see what happens when you deploy these applications in virtual machines hosted on VMware ESX. The OS that sits on top of the hardware is now the slimmed down version of Linux, upon which the VMware ESX host OS resides. Each VM consists of a VMware guest supporting a guest operating system. So now your layers consist of hardware, Linux, the VMware Host, and N instances of a VM Guest with the Guest OS and the applications. So you have gone from three layers to six. Do you feel as layered as an onion or an ogre yet? Or perhaps having to support all of this complexity makes you want to turn into an ogre so that you can deal with people who do not understand all of this the way that Shrek did (he could growl at people and they would all run away).

But seriously now, the move to virtualized applications architectures while laudable for reasons of flexibility and disaster recovery, has interjected a new level of complexity into the mix. And this new level of complexity is not just due to the fact that one more vendor’s product is running on your servers. It is due to the two facts:

VMware changes how certain aspects of Windows works. Specifically, if you are monitoring how a Windows OS is using resources to do performance management or capacity planning you cannot compare numbers gathered from a Windows OS running natively on its own hardware to numbers gathered from a Windows OS running inside of a VM. If you are running a web based application under IIS that typically used 50% of CPU on a dual-CPU server with Windows deployed on native hardware, you cannot compare that number to any number you get from the same application running on the same copy of IIS running on the same version of Windows Server inside of a VM. The reason is that the first number is the percentage of the total CPU available on the server that is being used by the application. The second number is the portion of the CPU that VMware has given that guest that is being used by the application. So, you cannot make apples to apples physical to virtual comparisons with Windows Perfmon counters that have to do with how much of a given resource or activity is being used or generated per unit of time. In other words all of the data generated by Windows Perfmon counters that pertain to time based resources (CPU Utilization per unit of time, page faults per second, etc.) is useless when that Windows OS is hosted inside of a VMware VM.

The second reason is that systems management tools and performance management tools have not (with some very rare exceptions) woken up to the fact that the world is getting more layered. Core infrastructure management products from vendors like HP, CA, IBM, and NetIQ are built around the assumption that there is an OS, some OS related middleware (IIS, COM+, .Net, etc.) and an application to manage. These products look at resources in a very simple way, by apportioning them between how they are used by either the OS, some middleware, or an instance of an application. These products do not understand the proliferation in layers that have occurred in our production server environments, nor do they understand how those layers are related to each other. It is the relationship between the layers that will prove to be the greatest challenge. In other words, what we need to know about a VMware hosted applications environment is not the VMware Guest’s opinion of how much of its resources an application is using, but the facts of how the server hardware resources and their utilization are being apportioned between all of these layers and the key applications running in each Guest OS. This requires a substantial change in how system management data is being gathered and presented, since neither the Window OS, nor the systems management products are built around the assumptions of these layered and virtualized environments.

If you are experiencing challenges in managing your virtualized applications environment, I would like to hear from you. Please send me an email at bernd.harzog@apmexperts.com, and let me know what your challenges are. Perhaps together we can work through them.

Bernd Harzog
CEO
Applications Performance Management Experts
www.apmexperts.com
bernd.harzog@apmexperts.com
770-475-4249

Monday, September 12, 2005

Beware of the Franken-Monitor

As I have mentioned many times in my various blogs, it continues to amaze me that we as an industry have gone so far down the road of building and deploying management tools, and we have yet to solve the most basic and most important problem. That problem is knowing when a user of a business critical application is having a problem with that application, and then combining that knowledge with knowledge of the state of the application’s infrastructure to find the likely source of the problem.

Let’s focus on the response time part of that challenge first. Here is a little ad hoc list of vendors who in some way or another measure response time:

  • Mercury can measure response time to scripted synthetic agents (an approach that I am not a fan of), and also measure response time at the HTTP level via it Real User Manager (RUM)
  • Wily can measure the response time from the perspective of a browser used by a real production user via its Browser Response Time Adaptor which is a companion product to Introscope its J2EE deep dive product.
  • Cesura (formerly VEIO) can measure the response time for URL classes at the web server (this misses the part of the round trip out to the client, but is valuable nonetheless)
  • Premitech can measure the response time of the network (network latency) between a Citrix server and the network out to the clients
  • A variety of network management vendors (NetQos, Visual Networks, NetScout) can measure network latency in various hops of the network that supports the application in question.

Now for the second part of the challenge, figuring out where the source of the problem is. Unfortunately most enterprises have deployed infrastructure management products from vendors like IBM/Tivoli, HP, CA, BMC, and NetIQ that collect a variety of metrics about how servers are using various resources and use this information to report availability (which they are pretty good at), and to infer something about performance (which they are terrible at since you cannot infer response time of an application from its underlying resource utilization statistics).

So, if you have a utilization monitor, and if you have a separate response time monitoring tool (you should) then you are well on your way towards building and supporting your own Franken-Monitor, a hodge-podge of ad-hoc tools each designed to solve one very specific problem, which do not put their information into a common database, and which therefore cannot work together to help you really find the source of your response time problems.

If you feel bad about having your own custom Franken-Monitor that you have carefully assembled over the last ten years, do not despair; there are vendors who would happily sell you one. Vendors that grown their portfolio of systems management products through acquisition are especially guilty of this. Let’s pick on some people to make an example of them:

  • Mercury acquired SiteScope (an agentless infrastructure monitor), acquired Real User Manager (an HTTP level response time monitor), acquired its J2EE Deep Diagnostics product, and then converted is WinRunner test tool, into Topaz to do scripted response time monitoring of web-based applications. Guess what, none of these products can put their respective information into a common unified database to support an integrated problem resolution process.
  • IBM/Tivoli will happily sell you a combination of OMEGAMON for WebSphere Applications Server, Tivoli Web Response Monitor, and a variety of Tivoli monitors for your server infrastructure, but, guess what? Again, none of these products integrate their data with each other, nor do they feed any kind of integrated problem resolution process.
  • Compuware has aquired Adlex to bolster its ability to measure response time at the network level, and has recently acquired a J2EE deep dive tool as well. Do these new products integrate with Compuware Vantage, which is core systems management product from Compuware? Of course not.

So, is there a solution to this mess? Is there one thing you can buy that monitors true end user response time, collects the infrastructure metrics for the application system, and feeds all of that into some sort of automated process for finding the likely source of the problem? I know of a couple of vendors who are working on various aspects of this problem, so there is hope.


If you are an enterprise struggling with these issues, I would like to hear from you. If you send me an email, we can set up a one hour brainstorming session at no cost to you, and discuss your environment, your problems, and how they might be best addressed.

Best Regards,

Bernd Harzog
CEO
Applications Performance Management Experts
www.apmexperts.com
bernd.harzog@apmexperts.com
770-475-4249