Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions examples/realtime-agents/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
CF_ACCOUNT_ID=
CF_API_TOKEN=
DEEPGRAM_API_KEY=
ELEVENLABS_API_KEY=
RTK_MEETING_ID=
RTK_AUTH_TOKEN=
198 changes: 198 additions & 0 deletions examples/realtime-agents/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Realtime Voice Assistant Agent

This example demonstrates how to build a complete voice assistant using Cloudflare's AI Agent framework with realtime capabilities. The assistant can:

- Listen to audio input via RealtimeKit
- Convert speech to text using Deepgram STT
- Process conversations with intelligent responses
- Convert responses back to speech using ElevenLabs TTS
- Stream audio output back to the client

## Architecture

The voice assistant uses a pipeline architecture:

```
Audio Input → RealtimeKit → Deepgram STT → Agent Logic → ElevenLabs TTS → Audio Output
```

## Setup

1. **Environment Variables**: Configure the following in your `wrangler.toml` or environment:

```toml
[vars]
ACCOUNT_ID = "your-cloudflare-account-id"
API_TOKEN = "your-cloudflare-api-token"
DEEPGRAM_API_KEY = "your-deepgram-api-key"
ELEVENLABS_API_KEY = "your-elevenlabs-api-key"
RTK_MEETING_ID = "your-realtimekit-meeting-id" # Optional
RTK_AUTH_TOKEN = "your-realtimekit-auth-token" # Optional
```

2. **API Keys**:
- Get a Deepgram API key from [https://deepgram.com](https://deepgram.com)
- Get an ElevenLabs API key from [https://elevenlabs.io](https://elevenlabs.io)
- Get your Cloudflare Account ID and API token from the Cloudflare dashboard

3. **Deploy**:

```bash
npm run dev # For local development
wrangler deploy # For production deployment
```

## Usage

Once deployed, the agent creates WebSocket connections for real-time voice interaction.

### Basic Flow:

1. Client connects to the agent WebSocket endpoint
2. Agent initializes the realtime pipeline
3. Client streams audio → Agent processes → Agent streams audio back
4. Agent handles conversation logic in `onRealtimeTranscript()` method

### Customization:

- Modify `onRealtimeTranscript()` method to add your own conversational AI logic
- Integrate with OpenAI, Anthropic, or other language models
- Add knowledge base queries, tool calling, or context management
- Customize voice settings in ElevenLabsTTS configuration

## Key Components

### RealtimeVoiceAgent

- Extends `Agent` class with realtime pipeline components
- Implements `onRealtimeTranscript()` for conversation handling
- Manages pipeline initialization and cleanup via `realtimePipelineComponents`

### MyAgent (Durable Object)

- Manages agent lifecycle and WebSocket connections
- Handles client connect/disconnect events
- Implements alarm handling for maintenance tasks

### Pipeline Components:

- **RealtimeKitTransport**: Audio input/output via RealtimeKit
- **DeepgramSTT**: Speech-to-text conversion
- **ElevenLabsTTS**: Text-to-speech synthesis

## Pipeline Configuration

The agent uses a pipeline component system defined in `realtimePipelineComponents` method:

```typescript
createRealtimePipeline() {
const rtk = new RealtimeKitTransport(
this.env.RTK_MEETING_ID || "default-meeting",
this.env.RTK_AUTH_TOKEN || "default-token",
[{
media_kind: "audio",
stream_kind: "microphone",
preset_name: "*"
}]
);

const stt = new DeepgramSTT(this.env.DEEPGRAM_API_KEY);
const tts = new ElevenLabsTTS(this.env.ELEVENLABS_API_KEY);

// Pipeline: Audio Input → STT → Agent → TTS → Audio Output
return [rtk, stt, this, tts, rtk];
}
```

### Pipeline Flow

1. **Audio Input**: RealtimeKit captures microphone audio
2. **Speech Recognition**: Deepgram converts audio to text
3. **Agent Processing**: Your agent receives transcribed text via `onRealtimeTranscript()`
4. **Response Generation**: Agent generates text response
5. **Speech Synthesis**: ElevenLabs converts response to audio
6. **Audio Output**: RealtimeKit streams audio back to client

### Customizing the Pipeline

You can modify the pipeline components in `createRealtimePipeline()`:

```typescript
// Different STT provider
const stt = new CustomSTT(this.env.CUSTOM_API_KEY);

// Multiple TTS voices
const tts1 = new ElevenLabsTTS(this.env.ELEVENLABS_KEY, { voice_id: "voice1" });
const tts2 = new ElevenLabsTTS(this.env.ELEVENLABS_KEY, { voice_id: "voice2" });

// Audio preprocessing
const processor = new AudioProcessor();

return [rtk, processor, stt, this, tts1, rtk];
```

## Implementation Details

The Agent class implements the `RealtimePipelineComponent` interface, allowing it to be used directly in realtime pipelines:

```typescript
class RealtimeVoiceAgent extends Agent<Env> {
realtimePipelineComponents = this.createRealtimePipeline;

createRealtimePipeline() {
const rtk = new RealtimeKitTransport(...);
const stt = new DeepgramSTT(...);
const tts = new ElevenLabsTTS(...);

// Use 'this' to include the agent in the pipeline
return [rtk, stt, this, tts, rtk];
}

// This method receives transcribed text
onRealtimeTranscript(text: string, reply: (response: string) => void) {
// Your conversation logic here
const response = processConversation(text);
reply(response);
}
}
```

**Key Features:**

- ✅ **Direct agent integration** - Use `this` to include your agent in the pipeline
- ✅ **Type safety** - Full TypeScript support for pipeline components
- ✅ **Flexible positioning** - Place the agent anywhere in the processing flow
- ✅ **Clean separation** - Clear distinction between pipeline setup and conversation logic

## Examples

The current implementation includes basic conversational responses like:

- Greetings and farewells
- Time and date queries
- Simple jokes
- Help information

You can extend this by integrating with:

- OpenAI GPT models for advanced conversations
- Knowledge bases for domain-specific responses
- Weather APIs, calendars, or other external services
- Custom business logic and workflows

## Development

Run locally:

```bash
npm run dev
```

The agent will be available at the WebSocket endpoint provided by Wrangler.

## Troubleshooting

- Ensure all API keys are properly configured
- Check Cloudflare account ID and API token permissions
- Verify RealtimeKit meeting configuration
- Monitor logs for pipeline initialization errors
11 changes: 11 additions & 0 deletions examples/realtime-agents/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"name": "@cloudflare/realtime-agents-example",
"author": "Manish",
"keywords": [],
"private": true,
"scripts": {
"dev": "wrangler dev",
"types": "wrangler types"
},
"type": "module"
}
161 changes: 161 additions & 0 deletions examples/realtime-agents/src/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
import {
RealtimeKitTransport,
DeepgramSTT,
ElevenLabsTTS
} from "agents/realtime";
import { Agent, routeAgentRequest } from "agents";

// Environment interface for required secrets and configuration
interface Env {
// Cloudflare credentials
CF_ACCOUNT_ID: string;
CF_API_TOKEN: string;

// Third-party API keys
DEEPGRAM_API_KEY: string;
ELEVENLABS_API_KEY: string;

// RealtimeKit meeting configuration
RTK_MEETING_ID?: string;
RTK_AUTH_TOKEN?: string;

// Durable Object binding
REALTIME_VOICE_AGENT: DurableObjectNamespace;
}

export class RealtimeVoiceAgent extends Agent<Env> {
realtimePipelineComponents = () => {
// RealtimeKit transport for audio I/O
const rtk = new RealtimeKitTransport(
this.env.RTK_MEETING_ID || "default-meeting",
this.env.RTK_AUTH_TOKEN || "default-token",
[
{
media_kind: "audio",
stream_kind: "microphone",
preset_name: "*"
}
]
);

// Deepgram for speech-to-text (Audio → Text)
const stt = new DeepgramSTT(this.env.DEEPGRAM_API_KEY);

// ElevenLabs for text-to-speech (Text → Audio)
const tts = new ElevenLabsTTS(this.env.ELEVENLABS_API_KEY);

return [rtk, stt, this, tts, rtk];
};

/**
* Handle incoming transcribed text and generate intelligent responses
* This is where you implement your AI logic, knowledge retrieval, etc.
*/
async onRealtimeTranscript(
text: string,
reply: (text: string | ReadableStream<Uint8Array>) => void
): Promise<void> {
console.log(`Received transcript: ${text}`);

// Simple response logic - you can enhance this with:
// - Integration with language models (OpenAI, Anthropic, etc.)
// - Knowledge base queries
// - Context management
// - Intent recognition
// - Tool calling

let response = "";

// Basic conversational responses
const lowerText = text.toLowerCase().trim();

if (lowerText.includes("hello") || lowerText.includes("hi")) {
response = "Hello! I'm your voice assistant. How can I help you today?";
} else if (lowerText.includes("time")) {
const now = new Date();
response = `The current time is ${now.toLocaleTimeString()}.`;
} else if (lowerText.includes("date")) {
const now = new Date();
response = `Today's date is ${now.toLocaleDateString()}.`;
} else if (lowerText.includes("weather")) {
response =
"I'd love to help with weather information, but I don't have access to weather data right now. You could integrate a weather API for real weather updates!";
} else if (lowerText.includes("joke")) {
const jokes = [
"Why don't scientists trust atoms? Because they make up everything!",
"Why did the scarecrow win an award? He was outstanding in his field!",
"What do you call a fake noodle? An impasta!"
];
response = jokes[Math.floor(Math.random() * jokes.length)];
} else if (
lowerText.includes("help") ||
lowerText.includes("what can you do")
) {
response =
"I can help you with basic conversations, tell you the time and date, share jokes, and more. Try asking me about the weather or saying hello!";
} else if (lowerText.includes("goodbye") || lowerText.includes("bye")) {
response = "Goodbye! It was nice talking with you.";
} else {
// Default response for unrecognized input
response = `You said: "${text}". I'm still learning how to respond to that. Try asking about the time, weather, or say hello!`;
}

// Send the response back through the pipeline
reply(response);
}

/**
* Cleanup resources when the agent is no longer needed
*/
async cleanup(): Promise<void> {
try {
if (this.realtimePipelineRunning) {
await this.stopRealtimePipeline();
console.log("Agent stopped successfully");
}
} catch (error) {
console.error("Error during cleanup:", error);
}
}
}

/**
* Worker fetch handler - routes requests to the appropriate Durable Object
*/
export default {
async fetch(request: Request, env: Env): Promise<Response> {
try {
const url = new URL(request.url);

// Health check endpoint
if (url.pathname === "/health") {
return new Response(
JSON.stringify({
status: "healthy",
timestamp: new Date().toISOString(),
service: "realtime-voice-assistant"
}),
{
headers: { "Content-Type": "application/json" }
}
);
}

// Forward the request to the Durable Object
const response = await routeAgentRequest(request, env);
return response || new Response("Not found", { status: 404 });
} catch (error) {
console.error("Worker fetch error:", error);
return new Response(
JSON.stringify({
error: "Internal server error",
message: error instanceof Error ? error.message : "Unknown error"
}),
{
status: 500,
headers: { "Content-Type": "application/json" }
}
);
}
}
};
3 changes: 3 additions & 0 deletions examples/realtime-agents/tsconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"extends": "../../tsconfig.base.json"
}
Loading
Loading