Generative AI Gets to the Root of the MTTR Problem

A new set of tools uses conversational language to help developers quickly triage and investigate problems the moment they’re detected

Steve Barrett, VP EMEA, Datadog

November 4, 2024

4 Min Read
A programmer at work
Getty Images

Developers have a wide array of different generative AI tools to choose from in their work. But, while solutions such as Google Gemini and GitHub Copilot help them write functional code, can generative AI help speed up mean time to recovery (MTTR) when problems occur?

Fortunately, a new set of tools has emerged that use conversational language to help developers quickly triage and investigate problems the moment they’re detected and are also useful to security teams for diagnosing and mitigating threats. By leveraging machine learning algorithms, these autonomous tools identify the root cause of incidents and work out how to plan and execute remediation.

Streamlining Processes

The massive volumes of data generated from various disparate sources mean that, when an incident is detected in a production environment, it can often be difficult to triage and investigate the issue. Efficiently tracking and managing the process to ensure everyone involved can access only the most current information and context is essential, but it can be time-consuming and resource-intensive.

With the help of a generative AI-based copilot, however, it’s possible to streamline the process of investigating alerts and responding to incidents. Leveraging AI and machine learning (ML) technologies and methods to analyze incoming and historical data from a range of different sources would enable a copilot such as this to detect anomalies and outliers, for instance, and provide options to address the issues.

Related:How the AI Revolution is Reshaping the Future of Telecom Networks

 Furthermore, correlating events to determine whether they’re interrelated or interconnected, could reduce alert noise and MTTR, and prevent multiple teams from duplicating the same efforts. Indeed, it could lead to further efficiencies by assigning tasks to staff to implement solutions and automating remediation workflows across relevant tools and services to resolve issues.

Acting Autonomously

An autonomous agent capable of performing complex operational tasks, such as investigating alerts and coordinating incident responses – without human prompting – would actively work alongside a developer during their investigations, effectively driving them toward resolution much like a human colleague would.

As soon as a specific alert is triggered, the AI draws upon its comprehensive knowledge of a developer’s systems, documented troubleshooting procedures and best practices to identify potential root causes, providing the developer with investigation notes that are ready to review and act upon. Should they choose to escalate the alert into a full incident, the AI will join as an additional responder, surfacing key telemetry data, and continuously monitoring for signs of recovery.

Related:Self-Driving Operations That Can Free up Busy IT Teams Are Close

Conversations With a Copilot

Given the pace at which incidents can progress, it can be hard for everyone to stay in the loop. However integrating a generative AI copilot into a team’s incident response Slack channel will provide them with the details they need to identify problems, determine their scope, and begin root cause analysis. Then, when new responders join the channel, the AI will automatically provide them with a summary of everything that has happened to date. 

This ability to converse with a copilot directly via a dedicated Slack channel while debugging an issue can be hugely valuable to developers. Allowing them to perform important tasks such as gaining insights from data, finding active issues, and generating code fixes, can free up time to tackle more complex issues. And, by harnessing state-of-the-art LLMs to reason, make decisions, and orchestrate remedial processes and actions, generative AI can effectively emulate a human colleague, allowing developers to operate more effectively across the entire SDLC.

Importantly, though, it would remain a copilot, but the developer will always have ultimate control. Only recommending remediation actions to take, the AI should never make any changes to a user’s systems before it receives a confirmation from the developer, thereby allowing them to evaluate the risks of such changes and weigh them against their potential impact.

Reducing MTTR

By automating as much of the DevSecOps lifecycle as possible, generative AI tools such as these can help developers detect, investigate, and remediate any issues that arise, working as a human teammate would help reduce MTTR.

Instead of multiple teams duplicating activity and chasing separate events, a single developer or team can focus on resolving the overall problem. The benefits of this speak for themselves. 

A reduction in MTTR means faster troubleshooting, less time spent searching for or testing fixes, and reduced user downtime – all of which means faster delivery, a higher quality end product, and – ultimately - better business outcomes.

About the Author

Steve Barrett

VP EMEA, Datadog, Datadog

Steve is an experienced and passionate leader with over two decades of experience working in the technology sector, specializing in bringing SaaS and DevOps platform management solutions to market to support business leaders and DevOps communities. During his career, he has consistently delivered exceptional individual and team results by building a collaborative culture focused on driving the sustainable growth of companies.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like