Welcome to the SRE Agent project! This open-source AI agent helps you debug, keep your systems on Kubernetes healthy, and make your DevOps life easier.
Now powered by a command-line interface (CLI), you can interact directly with the agent from your terminal. Plug in your Kubernetes cluster, GitHub repo, and let the agent handle the heavy lifting, diagnosing, reporting, and keeping your team in the loop.
SRE Agent is your AI-powered teammate for monitoring application and infrastructure logs, diagnosing issues, and reporting diagnostics after errors. With the new CLI, it’s easier than ever to connect the agent into your stack and start focusing on building instead of firefighting.
We wanted to learn best practices, costs, security, and performance tips for AI agents in production. Our journey is open-source, check out our Production Journey Page and Agent Architecture Page for the full story.
We've been writing blogs and sharing our learnings along the way. Check out our blog for insights and updates.
Contributions welcome! Join us and help shape the future of AI-powered SRE.
- 🕵️♂️ Root Cause Debugging – Finds the real reason behind app and system errors
- 📜 Kubernetes Logs – Queries your cluster for logs and info
- 🔍 GitHub Search – Digs through your codebase for bugs
- 🚦 CLI Powered – Interact with the agent directly from your terminal, with guided setup and zero manual image building required. Run diagnostics, manage integrations, and get insights without leaving the CLI.
Powered by the Model Context Protocol (MCP) for seamless LLM-to-tool connectivity.
The SRE Agent currently supports:
- Models: e.g. "claude-4-0-sonnet-latest"
- Setup: Requires
ANTHROPIC_API_KEY
- Python 3.12 or higher
- An app deployed on AWS EKS (Elastic Kubernetes Service)
- Anthropic API key
pip install sre-agentsre-agentThis is what you’ll see when the agent starts up for the first time.
The first step is setting up your AWS credentials so the agent can access the cluster where your app is deployed.
From your AWS portal, click Access keys:
Copy the credentials shown under Option 2 and paste them into the CLI.
Next, provide your cluster name. This should be the cluster where your app is deployed and where you want to monitor your deployments.
Once entered, the agent will automatically test the connection to the cluster using the credentials you provided.
Select the specific services you want to monitor by using a list such as [2,6,7] or all of them if you would like,
We will need to configure github access with a pat token so that the agent can read your repository and look at the code to find out what's causing the error.
Follow the guided step, this should be straight forward:
Start by configuring your AWS credentials so the agent can access the cluster where your app is deployed.
From your AWS portal, click Access keys:
Copy the credentials shown under Option 2 and paste them into the CLI.
Next, enter your cluster name. This should be the cluster where your app is deployed and where you want to monitor your deployments.
The agent will then test the connection to the cluster using the credentials you provided.
After that, select the specific services you want to monitor. You can choose by index (for example, [2,6,7]) or select all of them.
Next, configure GitHub access using a Personal Access Token (PAT). This allows the agent to read your repository and inspect the code when diagnosing issues.
Follow the guided step in the CLI—it’s straightforward:
Finally, provide your Anthropic API key, which will be used as the model provider powering the agent.
You’re now inside the sre-agent CLI and ready to run diagnostics.
For example, if your cluster has a service named currencyservice, you can run:
diagnose currencyserviceWhen the diagnosing is completed, you should see the result inside the cli.
To exit the agent, just run the exit command.
You can use the config command to set up options such as the cluster name, GitHub settings, and model providers. It also lets you enable additional add-ons, like sending diagnostic results to Slack or activating the Llama Firewall.
📦 Development Workflow
This is a uv workspace with multiple Python services and TypeScript MCP servers:
sre_agent/client/: FastAPI orchestrator (Python)sre_agent/llm/: LLM service with multi-provider support (Python)sre_agent/firewall/: Llama Prompt Guard security layer (Python)sre_agent/servers/mcp-server-kubernetes/: Kubernetes operations (TypeScript)sre_agent/servers/github/: GitHub API integration (TypeScript)sre_agent/servers/slack/: Slack notifications (TypeScript)sre_agent/servers/prompt_server/: Structured prompts (Python)sre_agent/cli/: The Python CLI that powers the agent (Python)
make project-setup # Install uv, create venv, install pre-commit hooks
make check # Run linting, pre-commit hooks, and lock file check
make tests # Run pytest with coverage
make license-check # Verify dependency licenses# Kubernetes MCP server
cd sre_agent/servers/mcp-server-kubernetes
npm run build && npm run test
# GitHub/Slack MCP servers
cd sre_agent/servers/github # or /slack
npm run build && npm run watchAt a high level, there are two main parts you can work on:
- The CLI, which you can think of as the “front end.”
- The agents/MCP servers, which run in the background.
If you want to work on the CLI, you can install and run it locally with:
source .venv/bin/activate && pip install -e .If you’re working on the MCP servers, you’ll need to rebuild the Docker images for any server you modify.
We provide two Compose files:
-
compose.agent.yaml: uses images hosted on GHCR
-
compose.dev.yaml: uses images built locally on your machine
To test local changes, start the sre-agent with the --dev flag:
sre-agent --devThis will start the agent using the compose.dev.yaml file.
Find all the docs you need in the docs folder:
Big thanks to:
- Suyog Sonwalkar for the Kubernetes MCP server
- Anthropic's Model Context Protocol team for the Slack and GitHub MCP servers
Check out our blog posts for insights and updates:









