Observability Takes the Driver’s Seat….

By , Monday, May 2nd 2022

For years, software solutions offering centralized management of heterogenous systems have received little respect from the market. Offerings provided by major systems vendors were viewed as doing nothing more than patching holes in the vendor’s underlying products, while competing offerings from independent solution providers often struggled to get traction or sufficient margins. Investor multiples assigned to these companies reflected the market opinion that this was software which could be easily re-written by a newer market entrant.

With each Hyperscaler providing a rich management suite for its individual cloud, the recent operating landscape dominated by VMware (with vSphere management) and Kubernetes being flush with competing open-source monitoring solutions, a 2021 forecast stating that this situation was not going to change anytime soon would have been a reasonable prediction.

Oh, how wrong that forecast would have been! The converging forces of distributed applications, multi-cloud adoption, the intelligent edge, the distributed workforce, and the need to continue support for virtual machine environments are rapidly establishing global management as one of the most critical decisions that the Enterprise will need to make as it standardizes its modern computing architecture. Customers have moved from landing in the cloud and innovating in the cloud to scaling their production in the cloud(s). Earlier adopters are seeing that managing all this distributed and heterogenous activity at scale is not easy. Choosing a partner (or two) to enable production at scale is going to be a critical IT decision in 2022.

Enterprise decision makers will have a critical choice to make. Let’s quickly review a few of the top considerations.

Monitoring or Observability?

Commercial Kubernetes distributions include monitoring (and metrics and logging) for individual Kubernetes clusters and all the included namespaces, typically delivered by an integration of several hardened open-source programs. Atop these distributions, almost all container management systems include (or offer the ability to add) integrated global monitoring of multiple clusters, with drill down to the individual namespace. Hyperscalers each offer a similar capability for their individual clouds and may also include some network and edge monitoring. But monitoring is not enough. Monitoring is a visualization (or alerting) on specific information that you (or the provider) defined in advance. In a complex distributed environment, you need observability. Observability extends beyond monitoring, allowing you to determine what’s important by watching how the system performs over time (and asking relevant questions about it). Similar to traditional monitoring, the outputs may include visualizations of current state, system history or even trend analysis to forecast a future state, but the questions (and answers) morph according to the actual state and history of the system. Observability systems are often “intelligent,” with the ability to tie together inputs from various sources in the software/hardware stack to provide the user with added insights with the goal, e.g., to identify the source of a problem (or warn of a future problem). Depending on the nature of the information being delivered, you may have the option to see it on a dashboard – to receive an alert – or to output the information to other presentation tools for reporting.

Sources – Coverage; Investment and Time to Production?

A successful observability system needs to have access to data from the right sources – the software, hardware systems and even components that are critical to a customer’s system availability, security, and performance. Critical sources can range from the component level to an entire ecosystem. A customer using specialized compute processors (as product or as a service) for graphics, video or training will need to be certain that data collectors are available for the specific components driving system performance. At the system level, a customer who has standardized on a particular storage solution (e.g., NetApp) across both on-premises and the cloud may wish to leverage an observability system (e.g., NetApp Cloud Insights) that can offer detailed analytics about how this storage – either on- prem or in the cloud – is performing and being used. At a full ecosystem level, customers who decide to use VMware Cloud across on-premises and multiple disparate cloud environments may find Tanzu Mission Control with Tanzu Observability by Wavefront to be an interesting alternative for managing the combination of container and virtual machine environments. Independent observability solutions with a more extensive history in the space (e.g., Dynatrace, Sumo Logic, Datadog etc.) may offer a broader set of pre-packaged collectors on their menu. In addition to pre-packaged collectors, some of these independent observability providers offer the user the option to write their own collectors. This last alternative offers the flexibility to cover the broadest set of required inputs but requires a higher upfront investment to put the system in production (and maintain it over time). Customers will want to strike a balance.

Beyond Observability – AIOps

Perhaps the most critical consideration for a buyer selecting a global management solution is its current (and future) capabilities for AIOps. Observability has rapidly become table-stakes for providers offering global management of container and multi-cloud environments. But automation is rapidly becoming essential to successful operation of a scaled-out distributed and multi-cloud environment. The volume of tasks that must be managed – and the speed at which events must be handled demand autonomic response. This is already a visible requirement in managing container security – given the new attack surfaces, and the speed at which different services and images are established (and shut down), an Enterprise system cannot wait for a human to make decisions; a machine must do it on their behalf. But this requirement is rapidly becoming relevant for the entire computing ecosystem. Distributed container and multi-cloud systems must be adaptive – adjusting to changing conditions without the need for human intervention. This automation requires artificial intelligence for IT operations (AIOps). AIOps is the application of artificial intelligence (AI), such as machine learning, to automate and streamline operational workflows. Enterprise buyers intending to scale a modern computing architecture need to select solutions that are capable (now or in the near future) of providing this functionality. In a universe where microservices drive the need to manage tens to hundreds of thousands of “systems,” you will need AIOps to automatically (and securely) address changing system conditions without the need for human intervention.