Open Telemetry (OTEL) is a CNCF project designed to create an APM vendor neutral method of collection of application metrics, traces and logs. This is an ambitious undertaking, which is uncertain to bear fruit.
The Current State of Open Telemetry (OTEL)
Open Telemetry is a Cloud Native Compute Foundation (CNCF) open source project. The goal is to create a standard “agent” that collects application behavior telemetry in a standard way. If this works it would relieve vendors of the cost and expense of managing the ongoing development of their own agents, and make it easy for customers to switch the back ends that store and process this data. So the idea is a “free” agent that collects the data and a choice of open source or vendor provided back ends to process the data and make it useful (find problems and help customers solve those problems). So, what is not to like? Below is a list of concerns.
Whose Throat Are You Going to Choke?
One of the characteristics of application monitoring is that unlike infrastructure and cloud monitoring there is no API to call to get the application level data. The only way to get the data is to insert some code (either Open Telemetry libraries or a vendor provided agent) into your application. This is like inserting a “good virus” into your code. The job of this “good virus” is to collect the data about the behavior of the application that you want and need WITHOUT creating any performance, throughput, error or security concerns in the process. Customers have learned the hard way that they need to test every vendor agent against every application, every language and every runtime to make sure that problems are not created by the data collection process. When problems occur customers rely upon their vendors to fix these problems. In the case of agents provided by Observability and APM vendors it is clear who supports and stands behind the agent. In the case of an open source project where you may not have a maintenance and support agreement in place for that agent who will fix this for you when it does not work it is not so clear. Let’s be clear that the risk of the Observability agent not working is that it breaks your application in some way. This concern is amplified by the fact that Open Telemetry is an “incubating” project not a “graduated” project at the CNCF.
Open Telemetry Requires Code Instrumentation
If you (a customer) download the Open Telemetry bits from GitHub, what you will get is a library that you or one of your developers needs to integrate with your applications. The Open Telemetry libraries will do what your developer tells them to do. And therein lies two problems. The first is that no one has enough developers and enough development capacity to build and maintain the applications that implement your core business processes in software (Digital Transformation). Given the shortage of developers and development capacity, it just does not make sense to use your limited development capacity to build and maintain monitoring. The second problem is that each of your developers (or development teams) will by their nature implement Open Telemetry differently. This will make it very hard (impossible) to compare data collected by Open Telemetry across applications.
There is only one way to get around this code instrumentation problem, which is to work with a vendor that has packaged Open Telemetry into an installable agent. For example, the Splunk Observability Cloud product includes agents based upon Open Telemetry. But the result of this is still language specific agents (see below) that require the customer to know what is running in each container, and to install the correct agent in each case.
Open Telemetry is Language Specific
Both the code that you download from GitHub and the agents that you get from vendors who support Open Telemetry are language specific. This means that as a customer using Open Telemetry you have to figure out which of the 11 languages supported by Open Telemetry are running in each of your thousands of containers and either install the correct agent or integrate the correct libraries. This makes Open Telemetry into a problematic approach for large scale cloud projects that have thousands of containers and a CI/CD process that rapidly changes the applications (often multiple times a day). Leading edge agents like the OneAgent from Dynatrace, the IBM/Instana agent and the Pixie agent from New Relic are language independent. You install them into the underlying OS or as a Daemon Set in Kubernetes, they figure out what language is running and insert the correct instrumentation accordingly.
Open Telemetry is Missing Observability Functionality
There is general agreement in the APM and Observability industries that the minimum level of functionality for Observability is that metrics, traces, and logs are collected by the agents from every application and for every transaction of interest. Open Telemetry is based upon the Open Tracing project, so support for Tracing as was the first thing to happen. Metrics were added next and logs are in the process of being added. Leading edge products from the Observability vendors include collecting the full topology for a transaction including where it runs and what it is dependent upon, collecting what has changed in the environment and collecting security issues that might be affecting the application or that are being introduced by the application itself. So at this point the agents from the Observability vendors include important functionality which is missing from the Open Telemetry libraries and agents.
Open Telemetry Does Not Concern the Back End
It is a “feature” of Open Telemetry to focus upon the collection of the application level data and not to focus upon how this data is stored and processed. The problem is that turning the flood of Observability data into useful and actionable information and answers is a big problem that cannot be solved without paying attention to how the data is collected in the first place. If you want to get the metrics, traces, logs, and topology for a transaction when you query the back end for that transaction then you had better collect all of these data types “in context” and store them, in the back end in a related manner. So the question of how to make this data useful for solving and preventing problems is ignored by Open Telemetry as is the cost of storing and processing this data.
Dynatrace has built Grail, New Relic has built the NRDB, Splunk has acquired Omnition and Instana has acquired BeeInstant specifically because each of those vendors concluded that they could not cobble together a viable Observability back end out of open source databases. In these cases viability concerned both the functionality of the back end and the cost of supporting the vendor back end in a public cloud. With this much focus being applied to the back ends of Observability vendors, it is only natural the the data collection process be customized to fit into how the data is stored and processed.
The Pace of Innovation is too High for Open Telemetry to Keep Up
Go back five years. The cloud was not the dominant place to run new applications. Microservices were not an important and some say dominant application architecture. Containers were not the dominant application deployment platform. The list of languages was shorter. CI/CD was not used to automate much of the process of delivering new code into production and applications code was not changed in production multiple times per day. The list of supporting software and run times was simpler and shorter. Kubernetes was not used to automatically make changes to applications running in production. All this has changed in the last five years, resulting in a new set of requirements for how applications are monitored in production. An open source project like Open Telemetry does not have the development and product leadership to be able to “skate to where the puck is going to be” and stay in a leadership position in the Observability industry. Things are likely to change as much in the next five years as they have changed in the last five years, and it will be crucial for the leading Observability vendors to keep up with those changes.
Can Open Telemetry Be Successful?
Yes, Open Telemetry can be successful. But only if the following things happen:
- The Open Telemetry project produces agents instead of libraries of code that require developer attention on the part of the customer to integrate with their applications.
- The Open Telemetry agent becomes language independent like the Dynatrace agent, the IBM (Instana) agent, and the New Relic Pixie agents are now
- The Open Telemetry agent standardizes upon a set of functionality, and lets each vendor add the things to it that they need to differentiate
- The leaders in Observability (Datadog, Dynatrace, and New Relic) agree to use the standard agent and to add their differentiation to it
- The manner in which the Open Telemetry Agent collects data and sends it to the back end becomes aware of and sensitive to the cost of processing that data and storing that data in the various back ends of the Observability vendors in their respective clouds.
- The risk of running Open Telemetry in production becomes managed either by existing Observability vendors or by including the Open Telemetry agents into something else for which enterprises already have a support process in place.
- The entire Open Telemetry chain of delivery of functionality to customers becomes superior to the existing chain based upon vendor developed agents in keeping up with the pace of innovation in the application development and deployment industries.
While Open Telemetry may seem attractive because it is “free” and “open” it currently comes with limitations that prevent its use in rapidly evolving large scale enterprise cloud deployments. If those limitations are addressed then Open Telemetry may become a standard core of the application data collection landscape. But this will depend upon significant cooperation by vendors who compete aggressively with each other – which is by no means guaranteed to occur.