lele is a standalone, dependency-free inference engine for intelligence, built from scratch in pure Rust.
It rejects the "general-purpose runtime" approach (wrapping C++ libs like ORT or using heavy Torch ports) in favor of hand-crafted, domain-specific kernels.
lele is designed to run deep learning models (specifically speech-related ones like SenseVoice, Silero VAD, and TTS, even yolo26 ) with minimal overhead.
Latest comparison between lele and ONNX Runtime (CPU) on macOS (Apple Silicon), single-threaded ORT (intra_op_num_threads=1, inter_op_num_threads=1).
For fairness and stability, SenseVoice uses steady-state metrics (warmup + multi-run average).
| Model | ORT | lele | Speedup |
|---|---|---|---|
| Silero VAD | RTF 0.002882 | RTF 0.0022 | 1.31x |
| SenseVoice | Steady Model RTF 0.0294 | Steady Model RTF 0.0256 (Cold RTF 0.0549) | 1.15x |
| Supertonic | RTF 0.1667 | RTF 0.0648 | 2.57x |
| Yolo26 | Avg 704.50ms (RTF 21.1350) | Avg 534.97ms (RTF 16.0490) | 1.32x |
| Yolo26n-seg | Avg 126.51ms (RTF 3.7953) | Avg 64.82ms (RTF 1.9445) | 1.95x |
Note: For speech models we report steady-state RTF (warmup + average). For yolo models we report avg latency over 10 runs (and include RTF@30fps).
- Zero Runtime Dependencies: Generated models are pure Rust.
- AOT Compilation: Converts ONNX models to specialized Rust code for maximum performance.
- SIMD Optimized: Hand-written kernels using Apple Silicon (NEON) and x86_64 (AVX/SSE) intrinsics.
- Memory Efficient: Static buffer allocation and zero-copy weight loading.
- Speech Optimized: Built-in feature extraction for audio (FFT, Mel-spectrogram, LFR, CMVN).
- WebAssembly Ready: Full browser compatibility with WASM SIMD128 optimizations.
lele supports a comprehensive set of ONNX operators:
- Math: Add, Sub, Mul, Div, Pow, Sqrt, Neg, Abs, Exp, Log, Sin, Cos, Erf, Softplus, Clip, Mod
- Neural Network: Conv, ConvTranspose, Gemm, MatMul, MatMulInteger, LSTM, GRU, BatchNormalization, LayerNormalization
- Activation: Relu, Sigmoid, Tanh, Softmax, Gelu, PRelu, Silu
- Tensor: Reshape, Transpose, Concat, Split, Slice, Gather, GatherElements, Pad, Expand, Tile, Where, TopK, Flatten, Squeeze, Unsqueeze
- Reduction: ReduceSum, ReduceMean, ReduceMax, ReduceL2
- Comparison: Equal, Less, Greater, LessOrEqual, GreaterOrEqual
- Signal: STFT (Short-Time Fourier Transform)
- Other: Shape, Size, Cast, ConstantOfShape, DynamicQuantizeLinear, Identity, Range
lele compiles to WebAssembly and runs ML inference directly in the browser with no server required.
| Optimization | Impact |
|---|---|
| WASM SIMD128 | Tiled matmul micro-kernel with f32x4_mul/f32x4_add (4x unroll) |
| Optimized Activations | SIMD paths for tanh/sigmoid/relu/silu using polynomial exp approximation |
| Vectorized Normalization | SIMD softmax and layer_norm with horizontal reduction |
| Release Settings | opt-level=3, lto=true, codegen-units=1, panic="abort" |
| Post-Processing | wasm-opt -O3 for additional 5-15% size/speed gains |
Binary Size Reduction: Dev builds (2.9M → 1.7M for SenseVoice, 42% smaller with optimizations)
Expected Runtime Speedup: 20-100x over unoptimized scalar WASM (10-50x from release mode + 2-4x from SIMD128)
cd examples/web-demo
./build_wasm.sh
python3 -m http.server 8080 -d web
# Open http://localhost:8080See examples/web-demo/README.md for details.
- SenseVoiceSmall: High-accuracy multi-lingual ASR.
- Silero VAD: Reliable Voice Activity Detection.
- Supertonic 2: Fast and high-quality Text-to-Speech (5 languages).
- Supertonic 3: Improved TTS with 31 languages, better reading stability, and expression tags (
<laugh>,<breath>,<sigh>). - Yolo26: Real-time object detection.
- Rust (Latest stable)
cargo
To compile an ONNX model into Rust code:
cargo run --release --bin lele_gen -- <model.onnx> <output_path.rs># SenseVoice ASR
./run_sensevoice.sh
# Supertonic 2 TTS (5 languages)
./run_supertonic.sh
# Supertonic 3 TTS (31 languages, expression tags)
./run_supertonic3.sh
# Silero VAD
./run_silero.sh
# Yolo26 Object Detection
./run_yolo26.sh- Performance optimizations (SIMD, multi-threading, etc.), better than ONNX Runtime.
- Support for more audio models (e.g., Whisper, CosyVoice, etc.)
- GPU acceleration backend (wgpu); Quantization (INT8/FP16)
- Advanced attention mechanisms (FlashAttention, PagedAttention)
- Voice API server (RESTful service), including ASR/TTS/Denoise endpoints.
MIT
