Monday, September 12, 2005

Beware of the Franken-Monitor

As I have mentioned many times in my various blogs, it continues to amaze me that we as an industry have gone so far down the road of building and deploying management tools, and we have yet to solve the most basic and most important problem. That problem is knowing when a user of a business critical application is having a problem with that application, and then combining that knowledge with knowledge of the state of the application’s infrastructure to find the likely source of the problem.

Let’s focus on the response time part of that challenge first. Here is a little ad hoc list of vendors who in some way or another measure response time:

  • Mercury can measure response time to scripted synthetic agents (an approach that I am not a fan of), and also measure response time at the HTTP level via it Real User Manager (RUM)
  • Wily can measure the response time from the perspective of a browser used by a real production user via its Browser Response Time Adaptor which is a companion product to Introscope its J2EE deep dive product.
  • Cesura (formerly VEIO) can measure the response time for URL classes at the web server (this misses the part of the round trip out to the client, but is valuable nonetheless)
  • Premitech can measure the response time of the network (network latency) between a Citrix server and the network out to the clients
  • A variety of network management vendors (NetQos, Visual Networks, NetScout) can measure network latency in various hops of the network that supports the application in question.

Now for the second part of the challenge, figuring out where the source of the problem is. Unfortunately most enterprises have deployed infrastructure management products from vendors like IBM/Tivoli, HP, CA, BMC, and NetIQ that collect a variety of metrics about how servers are using various resources and use this information to report availability (which they are pretty good at), and to infer something about performance (which they are terrible at since you cannot infer response time of an application from its underlying resource utilization statistics).

So, if you have a utilization monitor, and if you have a separate response time monitoring tool (you should) then you are well on your way towards building and supporting your own Franken-Monitor, a hodge-podge of ad-hoc tools each designed to solve one very specific problem, which do not put their information into a common database, and which therefore cannot work together to help you really find the source of your response time problems.

If you feel bad about having your own custom Franken-Monitor that you have carefully assembled over the last ten years, do not despair; there are vendors who would happily sell you one. Vendors that grown their portfolio of systems management products through acquisition are especially guilty of this. Let’s pick on some people to make an example of them:

  • Mercury acquired SiteScope (an agentless infrastructure monitor), acquired Real User Manager (an HTTP level response time monitor), acquired its J2EE Deep Diagnostics product, and then converted is WinRunner test tool, into Topaz to do scripted response time monitoring of web-based applications. Guess what, none of these products can put their respective information into a common unified database to support an integrated problem resolution process.
  • IBM/Tivoli will happily sell you a combination of OMEGAMON for WebSphere Applications Server, Tivoli Web Response Monitor, and a variety of Tivoli monitors for your server infrastructure, but, guess what? Again, none of these products integrate their data with each other, nor do they feed any kind of integrated problem resolution process.
  • Compuware has aquired Adlex to bolster its ability to measure response time at the network level, and has recently acquired a J2EE deep dive tool as well. Do these new products integrate with Compuware Vantage, which is core systems management product from Compuware? Of course not.

So, is there a solution to this mess? Is there one thing you can buy that monitors true end user response time, collects the infrastructure metrics for the application system, and feeds all of that into some sort of automated process for finding the likely source of the problem? I know of a couple of vendors who are working on various aspects of this problem, so there is hope.


If you are an enterprise struggling with these issues, I would like to hear from you. If you send me an email, we can set up a one hour brainstorming session at no cost to you, and discuss your environment, your problems, and how they might be best addressed.

Best Regards,

Bernd Harzog
CEO
Applications Performance Management Experts
www.apmexperts.com
bernd.harzog@apmexperts.com
770-475-4249