Skip to content

scaledb-io/cloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Production CDC Platform with Intelligent Query Routing

A complete production-ready Change Data Capture (CDC) platform that provides real-time data replication from MySQL to ClickHouse with intelligent query routing through ProxySQL.

πŸ—οΈ Architecture Overview

graph TB
    subgraph "Data Sources"
        APP[Applications]
        API[APIs]
    end
    
    subgraph "OLTP Layer"
        MYSQL[(MySQL 8.0<br/>Primary Database)]
        BINLOG[Binary Logs<br/>ROW Format]
    end
    
    subgraph "CDC Pipeline"
        DEBEZIUM[Debezium Connect<br/>CDC Connector]
        REDPANDA[(Redpanda<br/>High-Performance Messaging)]
    end
    
    subgraph "OLAP Layer"
        CLICKHOUSE[(ClickHouse<br/>Analytics Database)]
        KAFKA_TABLES[Kafka Tables]
        MAT_VIEWS[Materialized Views]
    end
    
    subgraph "Intelligent Routing"
        PROXYSQL[ProxySQL<br/>Query Router]
    end
    
    subgraph "Clients"
        OLTP_CLIENTS[OLTP Clients<br/>Transactions]
        OLAP_CLIENTS[OLAP Clients<br/>Analytics]
    end
    
    APP --> PROXYSQL
    API --> PROXYSQL
    
    PROXYSQL -->|OLTP Queries| MYSQL
    PROXYSQL -->|OLAP Queries| CLICKHOUSE
    
    MYSQL --> BINLOG
    BINLOG --> DEBEZIUM
    DEBEZIUM --> REDPANDA
    REDPANDA --> KAFKA_TABLES
    KAFKA_TABLES --> MAT_VIEWS
    MAT_VIEWS --> CLICKHOUSE
    
    OLTP_CLIENTS --> PROXYSQL
    OLAP_CLIENTS --> PROXYSQL
    
    style MYSQL fill:#e1f5fe
    style CLICKHOUSE fill:#f3e5f5
    style REDPANDA fill:#e8f5e8
    style PROXYSQL fill:#fff3e0
Loading

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose
  • 8GB+ RAM available
  • 20GB+ disk space

1. Deploy the Platform

./setup-production-cdc.sh

2. Load Test Data (Optional)

./load-imdb-data.sh  # Loads 194M+ IMDb records

3. Test the Setup

./test-cdc.sh

4. Connect and Query

# All queries through intelligent routing
mysql -h127.0.0.1 -P6033 -uproxyuser -pproxypass123

# OLAP queries automatically go to ClickHouse
SELECT titleType, COUNT(*) FROM imdb.title_basics GROUP BY titleType;

# OLTP queries automatically go to MySQL  
INSERT INTO imdb.title_basics VALUES ('tt9999999', 'movie', 'Test', 'Test', 0, 2024, NULL, 120, 'Drama');

πŸ“Š Performance Results

Test Environment: MacBook Pro M2

  • Dataset: 194M+ IMDb records across 7 tables
  • Largest Table: 94M records (title_principals)
  • Total Processing Time: ~45 minutes end-to-end

Key Metrics

Metric Value
CDC Processing Speed ~200K records/second
Large Transaction Processing 94M records in 12 minutes
ClickHouse Query Performance 21x faster than MySQL on aggregations
OLAP Query Response 0.033s for COUNT with WHERE on 2M+ records
OLTP Query Response 0.080s for point lookups
Resource Usage Max 1141% CPU during complex analytics

πŸ”§ Components

Core Services

  • MySQL 8.0: Primary OLTP database with CDC-enabled binary logging
  • Debezium Connect: Real-time CDC connector for MySQL
  • Redpanda: High-performance Kafka-compatible message broker
  • ClickHouse: Column-oriented analytics database
  • ProxySQL: Intelligent query routing based on query patterns

Networking

  • MySQL: Port 3306
  • ClickHouse: Ports 9000 (native), 8123 (HTTP), 9004 (MySQL protocol)
  • Redpanda: Ports 9092 (external), 29092 (internal)
  • ProxySQL: Port 6033 (application), 6032 (admin)
  • Redpanda Console: Port 8080

🎯 Query Routing Rules

ProxySQL automatically routes queries based on patterns:

OLAP β†’ ClickHouse

  • COUNT(*), SUM(), AVG(), MAX(), MIN()
  • GROUP BY operations
  • Complex aggregations and analytics
  • Time-series queries

OLTP β†’ MySQL

  • INSERT, UPDATE, DELETE
  • Point lookups (WHERE id = ?)
  • Transactional operations
  • SELECT without aggregations

πŸ› Lessons Learned & Production Caveats

1. Decimal Handling in CDC

Issue: Debezium's default decimal handling caused JSON parsing errors in ClickHouse.

Solution: Added "decimal.handling.mode": "double" to all Debezium connectors.

{
  "config": {
    "decimal.handling.mode": "double"
  }
}

2. Large Transaction Processing

Issue: Single 94M record transaction created 42M+ Kafka messages, causing significant processing lag.

Observations:

  • Peak lag reached 42M messages
  • Processing completed in 12 minutes
  • System remained stable throughout
  • CPU usage peaked at 1141% on M2

Production Recommendations:

  • Split large data loads into smaller batches (1M records)
  • Use chunked loading with manual commits
  • Consider CDC pause/resume for initial bulk loads
  • Monitor Kafka topic retention and storage growth

3. ClickHouse Kafka Table Configuration

Issue: Initial setup failed with heredoc syntax errors in Kafka table creation.

Solution: Switched from multiquery heredocs to individual command execution:

# Before (failed)
clickhouse-client --multiquery << 'EOF'
CREATE TABLE kafka_table...
CREATE MATERIALIZED VIEW...
EOF

# After (working)
clickhouse-client --query "CREATE TABLE kafka_table..."
clickhouse-client --query "CREATE MATERIALIZED VIEW..."

4. CDC DELETE Operation Handling

Issue: DELETE operations from MySQL don't replicate to ClickHouse by default.

Current Behavior: Materialized views filter for ('c', 'u', 'r') operations only.

Production Solution: Implement ReplacingMergeTree with CDC operation tracking:

CREATE TABLE table_name (
    -- columns
    cdc_operation String,
    cdc_timestamp DateTime64,
    is_deleted UInt8 DEFAULT 0
) ENGINE = ReplacingMergeTree(cdc_timestamp)
ORDER BY primary_key;

5. Resource and Storage Management

Observations:

  • Kafka topic storage grew to 102GB during large transaction processing
  • Memory usage peaked at 10.7GB for ClickHouse during complex queries
  • Docker volume management is critical for production deployments

6. Database Naming Consistency

Issue: Mixed usage of imdb vs imdb_cdc database names caused routing confusion.

Solution: Standardized on imdb database name across all components for consistent ProxySQL routing.

7. Kafka Partition Scaling

Observation: Large tables (50M+ records) benefit from multiple partitions:

  • title_principals: 5 partitions for 94M records
  • title_akas: 3 partitions for 52M records
  • Optimal partition count: ~10-20M records per partition

πŸ”§ Troubleshooting

Check Service Health

docker compose ps

Monitor CDC Lag

./monitor-cdc-lag.sh

View Kafka Topics

docker exec redpanda-cdc rpk topic list
docker exec redpanda-cdc rpk group describe clickhouse-cdc-group

Check ProxySQL Routing

mysql -h127.0.0.1 -P6032 -uadmin -padmin -e "SELECT * FROM stats_mysql_query_rules;"

ClickHouse Query Performance

docker exec clickhouse-cdc clickhouse-client --password clickhouse123 --query "
SELECT query, elapsed, memory_usage 
FROM system.query_log 
WHERE event_date = today() 
ORDER BY event_time DESC 
LIMIT 10;"

Check Volume Sizes

docker system df -v

Kafka Retention Management

# Check topic storage usage
docker exec redpanda-cdc du -sh /var/lib/redpanda/data/kafka/

# Clean up initial load data (reduces 100GB+ to ~7GB)
./configure-kafka-retention.sh

# Monitor CDC lag during cleanup
./monitor-cdc-lag.sh

πŸš€ Production Considerations

Scaling ClickHouse Horizontally

  • Distributed Tables: Shard data across multiple ClickHouse nodes
  • Replication: Use ClickHouse Keeper for automatic failover
  • Mixed Clusters: Combine sharding and replication for ultimate performance

CDC Pipeline Optimization

  • Parallel Processing: Increase Kafka partitions for high-throughput tables
  • Batch Sizing: Tune Debezium connector batch settings
  • Compression: Enable Kafka message compression for storage efficiency

Monitoring and Observability

  • Metrics: Integrate with Prometheus/Grafana
  • Alerting: Set up alerts for CDC lag, query performance, and resource usage
  • Logging: Centralized logging for all components

🎯 Production Deployment Patterns

Chunked Loading Strategy

# Split large files
split -l 1000000 large_file.tsv chunk_

# Load in batches with progress tracking
for chunk in chunk_*; do
    echo "Loading $chunk..."
    mysql -h127.0.0.1 -P3306 -uroot -ppassword db -e "
    SET autocommit=0;
    LOAD DATA INFILE '/data/$chunk' INTO TABLE large_table;
    COMMIT;"
    
    # Monitor CDC lag before next batch
    ./check-cdc-lag.sh
done

CDC Connector Tuning

{
  "config": {
    "max.batch.size": "2048",
    "max.queue.size": "8192", 
    "poll.interval.ms": "1000",
    "decimal.handling.mode": "double"
  }
}

πŸ“ˆ Next Steps & TODO

Phase 2: Production Enhancements

  1. Horizontal Scaling: Multi-node ClickHouse cluster configuration
  2. DELETE Handling: Implement ReplacingMergeTree pattern for full CDC operations
  3. Security: Add TLS encryption and proper authentication
  4. Backup/Recovery: Automated backup strategies for both databases
  5. Monitoring: Full observability stack integration

Performance Optimization

  1. Chunked Loading: Implement batched data loading to avoid massive transactions
  2. Resource Tuning: Optimize container resource allocation for production workloads
  3. Storage Management: Implement Kafka topic retention policies

πŸ† Success Metrics

This platform successfully demonstrated:

  • βœ… Real-time CDC processing of 194M+ records
  • βœ… Intelligent query routing with 21x performance improvement
  • βœ… Stable handling of massive transactions (94M records)
  • βœ… Production-grade architecture on consumer hardware
  • βœ… Sub-second analytics on complex queries
  • βœ… Sustained processing at 200K+ records/second

Tested on MacBook Pro M2 - Ready for production deployment and scaling! πŸš€

🀝 Contributing

This platform was built and tested with real production workloads, processing 194M+ IMDb records with lessons learned from handling massive datasets, complex queries, and CDC pipeline optimization.

For issues, improvements, or questions about production deployment, please refer to the project repository.

About

ScaleDB offer to support OLTP and OLAP with MySQL and ClickHouse

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •