English Documentation | 中文文档
Orchestrator is a real-time intelligent conversation system for building personalized multimodal AI interaction workflows, including speech recognition (ASR), text conversation (LLM), text-to-speech (TTS), emotion analysis (Classification & Reaction), memory management (Memory), and 3D animation generation (Audio2Face & Speech2Motion). The system supports multiple AI service providers through modular design, providing streaming processing and complete conversation management capabilities.
Main application scenarios: personalized role-playing, customized virtual companions, education and training, intelligent customer service, office assistants, etc.
- Multimodal Interaction: Voice interaction, text conversation, 3D animation generation
- Real-time Streaming Processing: Real-time data stream processing with low-latency response
- Multi-AI Service Provider Support: Integration with mainstream AI services including SenseNova, OpenAI, Anthropic, Gemini, xAI, DeepSeek, ElevenLabs, Volcano Engine, etc.
- Intelligent Memory Management: Multi-level conversation memory, relationship status, and emotional state management
- Emotional Intelligence Analysis: Real-time analysis of character emotional changes, relationship changes, and triggered actions
- Highly Scalable Architecture: Modular design, easy to add new AI services and custom features
- Character Customization: Custom character personalities, voices, emotions, and actions
- Interaction Customization: Flexible configuration of conversation modes, reaction mechanisms, and memory management
- Service Combination: Support for combining multiple AI service providers, flexible selection based on scenario requirements
orchestrator/
├── proxy.py # Core orchestrator, manages DAG workflows
├── service/ # Web service layer
│ ├── server.py # FastAPI server, provides WebSocket interface
│ ├── requests.py # Request data models
│ └── responses.py # Response data models
├── conversation/ # Conversation management module
│ ├── conversation_adapter.py # Text conversation adapter base class
│ ├── audio_conversation_adapter.py # Audio conversation adapter base class
│ ├── openai_conversation_client.py # OpenAI text conversation client
│ ├── openai_audio_client.py # OpenAI audio conversation client
│ ├── anthropic_conversation_client.py # Anthropic conversation client
│ ├── gemini_conversation_client.py # Gemini conversation client
│ ├── xai_conversation_client.py # xAI conversation client
│ ├── deepseek_conversation_client.py # DeepSeek conversation client
│ ├── sensechat_conversation_client.py # SenseChat conversation client
│ ├── sensenova_conversation_client.py # SenseNova conversation client
│ └── sensenova_omni_conversation_client.py # SenseNova real-time conversation client
├── generation/ # Generation management module
│ ├── speech_recognition/ # Speech Recognition (ASR)
│ │ ├── asr_adapter.py # ASR adapter base class
│ │ ├── huoshan_asr_client.py # Volcano Engine ASR
│ │ ├── openai_realtime_asr_client.py # OpenAI real-time ASR
│ │ ├── sensetime_asr_client.py # SenseTime ASR
│ │ └── softsugar_asr_client.py # Softsugar ASR
│ ├── text2speech/ # Text-to-Speech (TTS)
│ │ ├── tts_adapter.py # TTS adapter base class
│ │ ├── elevenlabs_tts_client.py # ElevenLabs TTS
│ │ ├── huoshan_tts_client.py # Volcano Engine TTS
│ │ ├── sensenova_tts_client.py # SenseNova TTS
│ │ ├── sensetime_tts_client.py # SenseTime TTS
│ │ └── softsugar_tts_client.py # Softsugar TTS
│ ├── speech2motion/ # Speech-to-Motion
│ │ ├── speech2motion_adapter.py # S2M adapter base class
│ │ └── speech2motion_streaming_client.py # S2M streaming client
│ └── audio2face/ # Audio-to-Face
│ ├── audio2face_adapter.py # A2F adapter base class
│ └── audio2face_streaming_client.py # A2F streaming client
├── memory/ # Memory management module
│ ├── memory_adapter.py # Memory adapter base class
│ ├── memory_manager.py # Memory manager
│ ├── memory_processor.py # Memory processor
│ ├── task_manager.py # Task manager
| ├── openai_memory_client.py # OpenAI memory client
│ ├── xai_memory_client.py # xAI memory client
│ └── sensenova_omni_memory_client.py # SenseNova real-time memory client
├── classification/ # Classification module
│ ├── classification_adapter.py # Classification adapter base class
│ ├── sensenova_omni_classification_client.py # SenseNova real-time classification client
│ ├── openai_classification_client.py # OpenAI classification client
│ ├── gemini_classification_client.py # Gemini classification client
│ └── xai_classification_client.py # xAI classification client
├── reaction/ # Reaction module
│ ├── reaction_adapter.py # Reaction adapter base class
│ ├── sensenova_omni_reaction_client.py # SenseNova real-time reaction client
│ ├── openai_reaction_client.py # OpenAI reaction client
│ ├── gemini_reaction_client.py # Gemini reaction client
│ └── xai_reaction_client.py # xAI reaction client
├── aggregator/ # Data aggregators
│ ├── conversation_aggregator.py # Conversation aggregator
│ ├── tts_reaction_aggregator.py # TTS reaction aggregator
│ ├── blendshapes_aggregator.py # Facial expression aggregator
│ └── callback_aggregator.py # Callback aggregator
├── io/ # Data storage interfaces
│ ├── config/ # Configuration storage
│ │ ├── database_config_client.py # Database configuration client
│ │ ├── dynamodb_config_client.py # DynamoDB configuration client
│ │ └── mongodb_config_client.py # MongoDB configuration client
│ └── memory/ # Memory storage
│ ├── database_memory_client.py # Database memory client
│ ├── dynamodb_memory_client.py # DynamoDB memory client
│ └── mongodb_memory_client.py # MongoDB memory client
├── data_structures/ # Data structure definitions
└── utils/ # Utility modules
- Function: Handles text and audio conversations, supports multiple large language models
- Core Components:
ConversationAdapter: Text conversation adapter base class, handles streaming text conversationsAudioConversationAdapter: Audio conversation adapter base class, handles real-time voice interactions- Supported providers: SenseNova, OpenAI, Anthropic, Gemini, xAI, DeepSeek, etc.
- Features: Streaming output support, long context, multimodal conversations
- Function: Converts text to natural speech, supports multiple voices and emotional expressions
- Core Components:
TextToSpeechAdapter: TTS adapter base class, handles streaming audio generation- Supported providers: ElevenLabs, Volcano Engine, SenseTime, Softsugar, etc.
- Features: Multiple voices, multiple emotions, multi-language support, real-time synthesis
- Function: Real-time speech recognition, supports multiple languages and real-time processing
- Core Components:
ASRAdapter: ASR adapter base class, handles streaming speech recognition- Supported providers: OpenAI, SenseTime, Softsugar, etc.
- Features: Multi-language support, streaming recognition
- Function: Multi-level conversation memory, emotional state, relationship state management
- Core Components:
MemoryAdapter: Memory adapter base classMemoryManager: Memory manager, handles conversation history and contextMemoryProcessor: Memory processor, analyzes and manages memory data
- Features: Multi-level memory storage, emotional state tracking, relationship state management
- Function: Real-time emotion analysis, user intent classification, reaction generation
- Core Components:
ClassificationAdapter: Classification adapter, analyzes user intentReactionAdapter: Reaction adapter, analyzes character emotional changes, relationship changes, and triggered actions
- Features: Real-time emotion analysis, intent classification, personalized reaction generation
- Function: Speech-to-motion conversion, audio-to-facial expression conversion
- Core Components:
Speech2MotionAdapter: Speech-to-motion adapterAudio2FaceAdapter: Audio-to-facial expression adapter
- Features: Real-time motion generation, facial expression synchronization, 3D animation output
- Function: Coordinates data flow between multiple modules, ensures data synchronization
- Core Components:
ConversationAggregator: Conversation aggregator, coordinates conversation flowTTSReactionAggregator: TTS reaction aggregator, synchronizes voice and reactionsBlendshapesAggregator: Facial expression aggregator
- Features: Data flow coordination, real-time synchronization, error handling
- Function: Manages DAG workflows, coordinates interactions between all modules
- Core Components:
Proxy: Main orchestrator, manages complex AI interaction workflows- Supports multiple conversation modes: audio conversation, text conversation, mixed mode
- Features: DAG workflow management, module coordination, process control
The system uses a Directed Acyclic Graph (DAG) architecture to manage complex AI interaction workflows. Each conversation request creates a DAG instance containing multiple processing nodes and dependencies.
Diagram Legend:
- Solid arrows (→): One-time complete data transmission between nodes in a single generation request
- Dashed arrows (⇢): Streaming data transmission between nodes in a single generation request
Workflow Diagrams:
-
Complete Audio Conversation Flow (
audio_chat_with_text_llm_v4) -
Express Audio Conversation Flow (
audio_chat_with_audio_llm_v4) -
Complete Text Conversation Flow (
text_chat_with_text_llm_v4) -
Express Text Conversation Flow (
text_chat_with_audio_llm_v4)
For the best experience, we recommend using Docker Compose to start the complete DLP3D services, which includes the Orchestrator along with all required dependencies (MongoDB, Audio2Face, Speech2Motion, etc.).
Please follow the Quick Start guide on ReadTheDocs to set up and run the entire infrastructure.
Note: The above link will redirect you to the web_backend repository for complete backend setup instructions.
If you need to run the Orchestrator service independently or configure advanced options, please refer to the Docker Configuration Guide for detailed setup instructions, environment variables, and configuration options.
For local development and deployment, please follow the detailed installation guide:
The installation guide provides step-by-step instructions for:
- Setting up Python 3.10+ environment
- Installing Protocol Buffers compiler
- Configuring the development environment
- Installing project dependencies
After completing the environment setup as described in the installation guide, you can start the service locally:
# Activate the conda environment
conda activate orchestrator
# Start the service
python main.py --config_path configs/local.py| Provider | Adapter Class | Default Model |
|---|---|---|
| OpenAI | OpenAIConversationClient |
gpt-4.1-2025-04-14 |
| Anthropic | AnthropicConversationClient |
claude-sonnet-4-5-20250929 |
GeminiConversationClient |
gemini-2.5-flash-lite |
|
| DeepSeek | DeepSeekConversationClient |
deepseek-chat |
| xAI | XAIConversationClient |
grok-3 |
| SenseNova | SenseChatConversationClient |
SenseChat-5-1202 (Large Language Model) |
| SenseNova | SenseNovaConversationClient |
SenseNova-V6-5-Pro (Multimodal Model) |
| SenseNova | SenseNovaOmniConversationClient |
SenseNova-V6-5-Omni (Real-time Interactive Multimodal Model) |
| OpenAI | OpenAIAudioClient |
gpt-4o-mini-realtime-preview-2024-12-17 |
| Provider | Adapter Class |
|---|---|
| OpenAI | OpenAIRealtimeASRClient |
| Volcano Engine | HuoshanASRClient |
| SenseTime | SensetimeASRClient |
| Softsugar | SoftSugarASRClient |
| Provider | Adapter Class |
|---|---|
| Volcano Engine | HuoshanTTSClient |
| Softsugar | SoftSugarTTSClient |
| SenseNova | SensenovaTTSClient |
| ElevenLabs | ElevenLabsTTSClient |
| SenseTime | SensetimeTTSClient |
| Provider | Adapter Class | Default Model |
|---|---|---|
| OpenAI | OpenAIMemoryClient |
gpt-4.1-mini-2025-04-14 |
| xAI | XAIMemoryClient |
Grok-3 |
| SenseNova | SenseNovaOmniMemoryClient |
SenseNova-V6-5-Omni |
| Provider | Adapter Class | Default Model |
|---|---|---|
| OpenAI | OpenAIClassificationClient |
gpt-4.1-mini-2025-04-14 |
| xAI | XAIClassificationClient |
grok-3 |
| Gemini | GeminiClassificationClient |
gemini-2.5-flash-lite |
| SenseNova | SenseNovaOmniClassificationClient |
SenseNova-V6-5-Omni |
| Provider | Adapter Class | Default Model |
|---|---|---|
| OpenAI | OpenAIReactionClient |
gpt-4.1-mini-2025-04-14 |
| xAI | XAIReactionClient |
grok-3 |
| Gemini | GeminiReactionClient |
gemini-2.5-flash-lite |
| SenseNova | SenseNovaOmniReactionClient |
SenseNova-V6-5-Omni |
- Full Documentation - Detailed documentation
- API Documentation - Complete API reference for WebSocket and HTTP endpoints
- Development Guide - Guide for adding new AI services, testing, and code quality standards
This project is licensed under the MIT License. See the LICENSE file for details.
The MIT License is a permissive free software license that allows you to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software with very few restrictions. The only requirement is that the original copyright notice and license text must be included in all copies or substantial portions of the software.