Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

SRE Agent is an AI-powered Site Reliability Engineering assistant that automates debugging, monitors application/infrastructure logs, diagnoses issues, and reports diagnostics. It integrates with Kubernetes clusters, GitHub repositories, and Slack for comprehensive incident response automation.

## Architecture

### Microservices Design
The system uses a microservices architecture with the following components:

- **Orchestrator (Client)**: FastAPI-based MCP client (`sre_agent/client/`) that coordinates all services and handles incoming diagnostic requests
- **LLM Server**: Text generation service (`sre_agent/llm/`) supporting multiple AI providers (Anthropic, OpenAI, Gemini, Ollama)
- **Llama Firewall**: Security layer (`sre_agent/firewall/`) using Meta's Llama Prompt Guard for content validation
- **MCP Servers**:
- Kubernetes MCP (`sre_agent/servers/mcp-server-kubernetes/`) - TypeScript/Node.js K8s operations
- GitHub MCP (`sre_agent/servers/github/`) - TypeScript/Node.js repository operations
- Slack MCP (`sre_agent/servers/slack/`) - TypeScript/Node.js team notifications
- Prompt Server MCP (`sre_agent/servers/prompt_server/`) - Python structured prompts

### Key Technologies
- **Languages**: Python 3.12+ (core services), TypeScript/Node.js (MCP servers)
- **Communication**: Model Context Protocol (MCP) with Server-Sent Events (SSE) transport
- **Infrastructure**: Docker Compose, AWS EKS deployment, GCP GKE deployment
- **AI/ML**: Multiple LLM providers, Hugging Face transformers

### LLM Provider Support
- **Anthropic**: Claude models (API key required)
- **Google Gemini**: Gemini models (API key required)
- **Ollama**: Local LLM inference (no API key, privacy-focused)
- **OpenAI**: Placeholder (not yet implemented)
- **Self-hosted**: Placeholder (not yet implemented)

## Common Development Commands

### Project Setup
```bash
make project-setup # Install uv, create venv, install pre-commit hooks
```

### Code Quality
```bash
make check # Run linting, pre-commit hooks, and lock file check
make tests # Run pytest with coverage
make license-check # Verify dependency licences
```

### Service Management
```bash
# Local development - AWS
docker compose -f compose.aws.yaml up --build

# Local development - GCP
docker compose -f compose.gcp.yaml up --build

# Production with ECR images
docker compose -f compose.ecr.yaml up

# Production with GAR images (Google)
docker compose -f compose.gar.yaml up

# Test environment
docker compose -f compose.tests.yaml up
```

### Testing
```bash
# All tests
make tests

# Specific test file
uv run python -m pytest tests/unit_tests/test_adapters.py

# Specific test function
uv run python -m pytest tests/unit_tests/test_adapters.py::test_specific_function

# With coverage
uv run python -m pytest --cov --cov-config=pyproject.toml --cov-report=xml

# Security tests only
uv run python -m pytest tests/security_tests/
```

## Configuration

### Environment Variables Required
- `DEV_BEARER_TOKEN`: API authentication for the orchestrator
- `ANTHROPIC_API_KEY`: Claude API access (for Anthropic models)
- `GEMINI_API_KEY`: Google Gemini API access (for Gemini models)
- `OLLAMA_API_URL`: Ollama API endpoint (for local LLM inference, default: http://localhost:11434)
- `GITHUB_PERSONAL_ACCESS_TOKEN`: GitHub integration
- `SLACK_BOT_TOKEN`, `SLACK_TEAM_ID`, `CHANNEL_ID`: Slack notifications
- `AWS_REGION`, `TARGET_EKS_CLUSTER_NAME`: AWS EKS cluster access
- `GCP_PROJECT_ID`, `TARGET_GKE_CLUSTER_NAME`, `GKE_ZONE`: GCP GKE cluster access
- `HF_TOKEN`: Hugging Face model access

### Cloud Platform Setup
- **AWS**: Credentials must be available at `~/.aws/credentials` for EKS cluster access
- **GCP**: Use `gcloud auth login` and `gcloud config set project YOUR_PROJECT_ID` for GKE access

### Ollama Setup (Local LLM)
- **Install**: Visit [ollama.ai](https://ollama.ai) and follow installation instructions
- **Start**: Run `ollama serve` in your terminal
- **Models**: Download models like `ollama pull llama3.1`
- **Benefits**: Privacy-focused, no API costs, offline capable

### Credential Setup Script
Use the interactive setup script for easy configuration:
```bash
python setup_credentials.py
# or with platform selection
python setup_credentials.py --platform aws
python setup_credentials.py --platform gcp
```

## Service Architecture Details

### Communication Flow
1. Orchestrator receives `/diagnose` requests on port 8003
2. Requests pass through Llama Firewall for security validation
3. LLM Server processes AI reasoning (using Anthropic, Gemini, or Ollama)
4. MCP servers handle tool operations (K8s, GitHub, Slack)
5. Results reported back via Slack notifications

### Health Checks
All services implement health monitoring accessible via `/health` endpoints.

## Development Patterns

### MCP Integration
All external tool interactions use the Model Context Protocol standard. When adding new tools:
- Follow existing MCP server patterns in `sre_agent/servers/`
- Implement SSE transport for real-time communication
- Add health check endpoints

### Security Considerations
- All requests pass through Llama Firewall validation
- Bearer token authentication required for API access
- Input validation at multiple service layers
- No secrets in code - use environment variables

**IMPORTANT: Never commit the .env file!**
- The `.env` file contains sensitive credentials (API keys, tokens, secrets)
- It is included in `.gitignore` and should never be committed to the repository
- Use `python setup_credentials.py` to generate the `.env` file locally
- Each developer/environment needs their own `.env` file with appropriate credentials
- For production deployments, use proper secret management (AWS Secrets Manager, K8s secrets, etc.)

### Code Style
- **Language**: Use British English spelling throughout (e.g., "specialised", "organised", "recognised")
- **Python**: Uses ruff, black, mypy for formatting and type checking
- **TypeScript**: Standard TypeScript/Node.js conventions
- **Line length**: 88 characters
- **Docstrings**: Google-style docstrings for Python
- **Type checking**: Strict type checking enabled

### British English Spelling Guidelines
The project uses British English spelling. Common differences from American English:
- **-ise/-ize**: Use "-ise" endings (e.g., "organise", "recognise", "specialise")
- **-our/-or**: Use "-our" endings (e.g., "colour", "honour", "behaviour")
- **-re/-er**: Use "-re" endings (e.g., "centre", "metre", "theatre")
- **-ence/-ense**: Use "-ence" endings (e.g., "defence", "licence" as noun)
- **-yse/-yze**: Use "-yse" endings (e.g., "analyse", "paralyse")

**Examples in SRE context:**
- "optimise" (not "optimize")
- "customise" (not "customize")
- "analyse logs" (not "analyze logs")
- "centralised monitoring" (not "centralized monitoring")
- "behaviour analysis" (not "behavior analysis")

## Workspace Structure
This is a uv workspace with members:
- `sre_agent/llm`: LLM service with multi-provider support
- `sre_agent/client`: FastAPI orchestrator service
- `sre_agent/servers/prompt_server`: Python MCP server for structured prompts
- `sre_agent/firewall`: Llama Prompt Guard security layer
- `sre_agent/shared`: Shared utilities and schemas

Each Python service has its own `pyproject.toml`. TypeScript MCP servers use `package.json`:
- `sre_agent/servers/mcp-server-kubernetes/`: Kubernetes operations (Node.js/TypeScript)
- `sre_agent/servers/github/`: GitHub API integration (Node.js/TypeScript)
- `sre_agent/servers/slack/`: Slack notifications (Node.js/TypeScript)

## API Usage

### Primary Endpoint
```bash
POST http://localhost:8003/diagnose
Authorization: Bearer <DEV_BEARER_TOKEN>
Content-Type: application/json
{"text": "<service_name>"}
```

### Health Check
```bash
GET http://localhost:8003/health
```

## Deployment
- **Local**: Docker Compose with local builds (AWS: `compose.aws.yaml`, GCP: `compose.gcp.yaml`)
- **Production AWS**: ECR-based images on AWS EKS (`compose.ecr.yaml`)
- **Production GCP**: GAR-based images on GCP GKE (`compose.gar.yaml`)
- See [EKS Deployment](https://github.com/fuzzylabs/sre-agent-deployment) for cloud deployment examples

## TypeScript MCP Server Development
For TypeScript MCP servers in `sre_agent/servers/`:

### Building and Testing
```bash
# Kubernetes MCP server
cd sre_agent/servers/mcp-server-kubernetes
npm run build # Build TypeScript
npm run test # Run vitest tests
npm run dev # Watch mode

# GitHub/Slack MCP servers
cd sre_agent/servers/github # or /slack
npm run build
npm run watch # Watch mode
```
38 changes: 35 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,19 +33,51 @@ We've been writing blogs and sharing our learnings along the way. Check out our
The SRE Agent supports multiple the following LLM providers:

### Anthropic
- **Models**: e.g. "claude-4-0-sonnet-latest"
- **Models**: e.g. "claude-3-5-sonnet-latest"
- **Setup**: Requires `ANTHROPIC_API_KEY`

### Google Gemini
- **Models**: e.g, "gemini-2.5-flash"
- **Models**: e.g. "gemini-2.5-flash"
- **Setup**: Requires `GEMINI_API_KEY`

### Ollama (Local)
- **Models**: e.g. "llama3.1", "mistral", "codellama"
- **Setup**: Install Ollama locally, no API key needed
- **Benefits**: Privacy, no API costs, offline capable

<details>
<summary>πŸ¦™ Ollama Setup Guide</summary>

### Installing Ollama
1. **Install Ollama**: Visit [ollama.ai](https://ollama.ai) and follow installation instructions
2. **Start Ollama**: Run `ollama serve` in your terminal
3. **Pull a model**: Download a model like `ollama pull llama3.1`

### Recommended Models for SRE Tasks
- **llama3.1** (8B): Fast, good general reasoning
- **mistral** (7B): Excellent for technical tasks
- **codellama** (7B): Specialised for code analysis
- **llama3.1:70b**: Most capable but requires more resources

### Configuration
Set these in your `.env` file:
```bash
PROVIDER=ollama
MODEL=llama3.1
OLLAMA_API_URL=http://localhost:11434 # default
```

</details>


## πŸ› οΈ Prerequisites

- [Docker](https://docs.docker.com/get-docker/)
- A `.env` file in your project root ([see below](#getting-started))
- An app deployed on AWS EKS (Elastic Kubernetes Service) or GCP GKE (Google Kubernetes Engine)
- A Kubernetes cluster:
- **Cloud**: AWS EKS, GCP GKE
- **Local**: minikube, Docker Desktop, kind, k3s
- For Ollama: Local installation ([see Ollama Setup Guide](#ollama-setup-guide))

## ⚑ Getting Started

Expand Down
9 changes: 8 additions & 1 deletion setup_credentials.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,20 @@ def get_credential_config(platform: str) -> dict[str, dict[str, Any]]:
"prompt": "Enter your Github project root directory: ",
"mask_value": False,
},
"PROVIDER": {"prompt": "Enter your LLM provider name: ", "mask_value": False},
"PROVIDER": {
"prompt": "Enter your LLM provider name (anthropic/gemini/ollama): ",
"mask_value": False,
},
"MODEL": {"prompt": "Enter your LLM model name: ", "mask_value": False},
"GEMINI_API_KEY": {"prompt": "Enter your Gemini API Key: ", "mask_value": True},
"ANTHROPIC_API_KEY": {
"prompt": "Enter your Anthropic API Key: ",
"mask_value": True,
},
"OLLAMA_API_URL": {
"prompt": "Enter your Ollama API URL (default: http://localhost:11434): ",
"mask_value": False,
},
"MAX_TOKENS": {
"prompt": "Controls the maximum number of tokens the LLM can generate in "
"its response e.g. 10000: ",
Expand Down
2 changes: 2 additions & 0 deletions sre_agent/llm/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
BaseClient,
DummyClient,
GeminiClient,
OllamaClient,
OpenAIClient,
SelfHostedClient,
)
Expand All @@ -32,6 +33,7 @@
Provider.MOCK: DummyClient(),
Provider.OPENAI: OpenAIClient(),
Provider.GEMINI: GeminiClient(),
Provider.OLLAMA: OllamaClient(),
Provider.SELF_HOSTED: SelfHostedClient(),
}

Expand Down
Loading
Loading