Why Observability Needs Both AI and Humans

While traditional monitoring ways struggle with complexity, AI-powered observability is helping teams prevent incidents before they happen.

Oct 30, 2024

I've been closely tracking the development in system monitoring and observability over the past few years, and we're at a fascinating turning point.

I noticed that what used to be a manageable flow of alerts and metrics has turned into a tsunami of data that is overwhelming even to the most experienced teams.

But we're witnessing something remarkable: the integration of AI into observability. It isn't just another tech trend—it's altering how we think about system monitoring.

From automating alert correlation to predicting potential failures before they happen, AI is helping teams move from reactive firefighting to proactive system management.

In today's newsletter, we'll explore:

How AI is changing enterprise observability
How to build a culture for AI-driven observability
How Intuit and Meta are using AI in observability and SRE

Let’s get started!

How AI is changing enterprise observability

According to recent studies, 75% of consumers worry about AI misinformation, while 35% of organizations identify security as their primary concern in AI adoption. As AI systems become more prevalent, you'll need to change how you monitor and maintain them.

When you look at enterprise observability today, you'll find three interconnected factors driving its transformation:

Your traditional monitoring approaches may be overwhelmed by the growing complexity of modern distributed systems. Current AIOps solutions struggle to effectively process the sheer volume of telemetry data, leading to delayed responses and missed insights.
With Generative AI and large language models (LLMs), you can transform your data analysis by simultaneously processing diverse data types (logs, metrics, traces).

This provides contextualized insights previously impossible to obtain, helping your teams identify and resolve issues before they impact business operations.

As you scale your AI implementations across required business functions, you'll need greater transparency and control.

This drives the need for intelligent observability solutions that give you clear visibility into AI system behavior while ensuring reliability and performance.

Foundation Capital's Ashu Garg and Jaya Gupta recently highlighted this shift in their blog "Goodbye AIOps: Welcome AgentSREs—The Next $100B Opportunity".

Their research suggests that your traditional AIOps approaches need to evolve to meet modern observability demands.

The adoption of intelligent observability varies across industries. Companies like Apica lead the charge by embracing Generative AI and incorporating advanced techniques like Retrieval Augmented Generation (RAG).

This approach not only improves accuracy but also provides deeper insights into system behavior.

Ranjan Parthasarathy, Apica's Chief Product and Technology Officer, explains, "GenAI can provide you with a more robust and flexible approach to addressing the challenges of modern observability."

Other organizations are taking a more measured approach, combining traditional AIOps with newer AI capabilities to create hybrid solutions that balance innovation with proven methodologies.

This diversity in approaches reflects different organizations' varying needs and maturity levels.

You'll find intelligent observability particularly valuable if you're:

1. Deploying large-scale AI applications

2. Requiring real-time insights into system behavior

3. Managing complex, distributed systems

4. Focusing on proactive incident prevention rather than reactive problem-solving

Learn more about it here.

How to build a culture for AI-driven observability

When you adopt AI-powered observability, you'll face challenges beyond technology adoption. Creating a culture that effectively uses these advanced capabilities requires you to shift how your teams approach monitoring and incident response.

To successfully implement intelligent observability in your organization, you'll need to focus on three critical areas:

Developer empowerment and responsibility

Your AI-powered observability tools provide developers with unprecedented system insights. However, with this visibility comes new responsibilities.

You'll need your developers to move beyond writing code to understanding its real-world impact.

This means taking ownership throughout the code lifecycle and participating in on-call rotations—not just for incident response but also to better understand how their changes affect your production systems.

Knowledge management

While your AI systems can process vast amounts of telemetry data and identify patterns, your team's expertise remains crucial.

You'll need to prioritize documentation and knowledge sharing that complements your AI capabilities.

Here's what you should focus on:

Document incident patterns your AI systems identify
Create clear escalation paths that combine your automated and human responses
Share insights about system behavior to help train and improve your AI models
Maintain contextual information to help validate your AI-generated insights

Balance automation with human oversight

As you integrate AI into more observability tasks, you'll need to determine the right balance between automated and human-driven processes. Consider these key actions:

Set up reasonable on-call processes that use your AI for initial detection while maintaining human judgment for critical decisions
Focus your team's attention on business-critical issues rather than every alert
Create an environment where your teams can safely experiment with new AI capabilities
Build trust in your AI-powered observability tools while maintaining healthy skepticism

Start testing these practices early in your journey, ideally before production deployment.

By implementing observability practices in your pre-production environments, you can iteratively refine your approach, ensuring your technical and cultural elements are ready for production challenges.

Learn more about it here.

How Intuit and Meta are using AI in observability and SRE

Intuit's AI-powered Kubernetes observability

With over 325 Kubernetes clusters supporting more than 7,000 applications and services, Intuit faced significant complexity in maintaining cluster health.

Their platform's rapid growth and frequent cluster changes led to alert fatigue among on-call engineers, complicating issue detection and remediation.

To address these challenges, Intuit developed a multi-layered solution. At its core, they implemented "Cluster Golden Signals" to provide consolidated health views, mirroring the concept of service golden signals.

This system filters out noise and focuses on critical signals for alerting, allowing engineers to quickly isolate problematic clusters and determine whether issues are service-related or platform-related.

The team further improved its capabilities by integrating K8sGPT, an open-source tool that scans Kubernetes clusters to diagnose and triage issues.

K8sGPT leverages knowledge codified by site reliability engineers and uses resource-specific analyzers to extract relevant error messages from clusters, improving them with AI insights.

For remediation, Intuit developed a proprietary GenAI operating system (GenOS) with retrieval-augmented generation. It addresses the limitations of public LLMs, which lack context about their specific platform configurations.

This implementation has streamlined their detection and debugging processes, significantly improving the mean time to detection.

Learn more about it here.

Meta's AI-powered incident response

Meta's journey with AI in system reliability focused on streamlining its investigation processes, particularly within its monolithic repositories.

The complexity of its systems made investigating anomalies time-consuming, with responders struggling to build context around broken elements, affected systems, and impact scope.

Their solution combines heuristic-based retrieval with large language model-based ranking to accelerate root cause identification. The system first reduces the search space from thousands of changes to a few hundred using code and directory ownership data and runtime code graph analysis.

Meta then leverages a fine-tuned Llama model to identify the most likely root causes. Their innovative ranking approach uses an election-style system, processing 20 changes at a time and aggregating results to identify the top candidates.

The model underwent extensive training using historical investigations and internal data, including limited approved wikis, Q&As, and code.

This sophisticated approach has yielded impressive results, achieving 42% accuracy in identifying root causes for investigations at creation time related to their web monorepo.

The system significantly reduces investigation time and improves responder decision-making capabilities.

Learn more about it here.

Simform Newsletter

Discussion about this post