This document describes BitNet-rs quantization formats and device-aware acceleration surfaces.
Claim boundary: feature flags, kernel names, and acceleration surfaces here do not by themselves prove product readiness, speedup, server readiness, fallback behavior, or full residency. Current hardware and model claims must be checked against active model coverage, receipts, status docs, specs, and claim gates.
BitNet-rs contains multiple quantization formats with device-aware acceleration surfaces:
- Native Rust implementation with device selection and explicit fallback reporting
- Device-aware quantization surfaces with feature-gated CUDA kernels and CPU SIMD optimization
- Accuracy: Target ≥99.8% correlation with FP32 reference (defined in test fixtures; formal measurement pending)
- Performance: Hardware-dependent; SIMD-optimised. QK256 path uses scalar kernels (~0.1 tok/s for 2B models).
- 2-bit signed quantization with optimized bit-packing (4 values per byte)
- Strict Mode: Use
BITNET_STRICT_MODE=1to prevent mock fallbacks and ensure real quantized computation - Real Computation: Native quantized GEMV kernel eliminates FP32 dequantization staging (Issue #261 - AC3)
- QuantizedLinear Integration: Replaces standard Linear layers in transformer architecture (Issue #261 - AC5)
User-facing runtime and proof paths must not silently use the reduced-feature GGUF minimal loader. The enhanced GGUF loader is the default expectation for real inference claims.
BITNET_STRICT_MODE=1orBITNET_DISABLE_MINIMAL_LOADER=1fails fast when the enhanced loader cannot parse or validate the model.BITNET_ALLOW_MINIMAL_LOADER=1is the explicit compatibility opt-in for the minimal loader. It may initialize missing transformer tensors with compatibility defaults and cannot support correctness or performance claims.bitnet run --strict-loadersets strict loader mode for CLI proof paths.bitnet run --allow-mockis a smoke/UX-test escape hatch and enables compatibility fallback only by request.- JSON output from
bitnet run --json-outrecords the loader mode so receipts or adjacent proof artifacts can distinguishenhancedfrom explicitly requestedcompatibility_fallback.
- Table lookup quantization optimized for ARM NEON architecture (4-bit, 2 elements per byte with nibble packing)
- Accuracy: Target ≥99.6% correlation with FP32 reference (defined in test fixtures)
- Performance: Hardware-dependent; optimised for ARM NEON.
- NEON Improvements: ARM NEON kernel throughput and accuracy improvements added in #988
- Device-Aware Selection: Automatic ARM NEON vectorization with scalar fallback
- Memory-efficient lookup tables (16-256 entries, cache-friendly)
- Parallel processing with configurable block sizes
- Real Computation: Direct table lookup matmul without FP32 staging (Issue #261)
- Safe LUT Index Calculation: Uses
bitnet_kernels::tl_lut::lut_index()with checked arithmetic and overflow protection
- Advanced table lookup quantization optimized for x86 AVX2/AVX-512 (8-bit, 1 element per byte)
- Accuracy: Target ≥99.6% correlation with FP32 reference (defined in test fixtures)
- Performance: Hardware-dependent; optimised for x86 AVX2/AVX-512.
- SIMD Optimization: AVX2 (32-byte) and AVX-512 (64-byte) vectorization
- AVX-512 Kernels: Dedicated AVX-512 TL2 kernels added in #997 for 64-byte wide SIMD lanes
- Enhanced vectorized operations (256-4096 entry tables) for large tensor processing
- CPU feature detection with graceful fallback to scalar implementation
- Real Computation: Direct table lookup matmul without FP32 staging (Issue #261)
- 2-bit Domain: Input quantization stays in the 2-bit domain throughout (fixed in #978)
- Safe LUT Index Calculation: Uses
bitnet_kernels::tl_lut::lut_index()with checked arithmetic and overflow protection
- GGML I2_S format with 256-element blocks (QK_K = 256 per GGML conventions)
- Block size: 256 elements
- Format: 64 bytes per block (no per-block scales), scales in separate tensor
- Support: ✅ Pure Rust (kernel:
i2s_qk256::gemv_qk256) - no FFI required - Status: Working (scalar kernels; ~0.1 tok/s for 2B models)
- Use case: MS BitNet GGUF models using GGML format
- Accuracy: Target ≥99.8% correlation with FP32 reference
- Performance: 2-bit signed quantization: [-2, -1, +1, +2] mapping
- Automatic detection: Loader detects QK256 format from tensor sizes
- Transparent dispatch: Transformer automatically uses QK256 kernel when weights present
- See also: Dual I2_S Flavor Explanation
- GGML-compatible quantization with 82-byte block layout and 4-level [-2,-1,1,2] mapping
- Q4_0, Q5_0, Q8_0, etc. (planned for future releases)
The bitnet_kernels::tl_lut module provides safe, bounds-checked index calculation for TL1/TL2 quantization kernels.
Calculate validated index into table lookup buffer with overflow protection.
Signature:
pub fn lut_index(
block_idx: usize,
elem_in_block: usize,
block_bytes: usize,
elems_per_block: usize,
lut_len: usize,
) -> Result<usize>-
Parameters:
-
block_idx: Block index in quantized buffer -
elem_in_block: Element position within block (0..elems_per_block) -
block_bytes: Size of each block in bytes -
elems_per_block: Number of elements per quantized block -
lut_len: Total length of LUT buffer (for bounds checking)
Returns: Validated LUT index or error if overflow/out-of-bounds
-
Safety Guarantees:
-
Validates
elem_in_block < elems_per_block(bounds check) -
Uses checked arithmetic to prevent integer overflow
-
Validates final index
< lut_lenbefore returning -
100% mutation testing coverage (6/6 mutants killed, Issue #462)
Example Usage:
use bitnet_kernels::tl_lut::lut_index;
// Calculate LUT index for block 0, element 0
let idx = lut_index(0, 0, 32, 128, 1024)?;
assert_eq!(idx, 0);
// Calculate LUT index for block 1, element 8
// Formula: 1 * 32 + (8 / 8) = 32 + 1 = 33
let idx = lut_index(1, 8, 32, 128, 1024)?;
assert_eq!(idx, 33);
// Bounds check prevents out-of-range access
let result = lut_index(0, 128, 32, 128, 1024);
#assert!(result.is_err()); // elem_in_block >= elems_per_block
#```
**Testing Commands:**
```bash
# Run TL LUT helper tests
cargo test -p bitnet-kernels --no-default-features --features cpu tl_lut
# Specific test cases
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_basic
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_overflow_detection
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_boundary_validationSee also: Issue #462 for TL LUT helper implementation and mutation testing results.
All quantizers support device-aware operations with:
- Automatic GPU acceleration: CUDA kernels with performance monitoring (alpha)
- Metal acceleration: macOS/iOS GPU via
feature = "metal"(#992) - Vulkan compute: Cross-platform GPU via
feature = "vulkan"(#993) - Intel oneAPI: Intel CPU/GPU acceleration via
feature = "oneapi"(#986) - ROCm support: AMD GPU detection via
rocm_availablefield inDeviceProbe(#995) - Transparent CPU fallback: Graceful degradation with maintained accuracy (SIMD-optimised)
- Memory optimization: GPU memory leak detection and efficient allocation
- Feature gating: Proper
#[cfg(feature = "gpu")]guards for CPU-only builds - Strict Mode Enforcement:
BITNET_STRICT_MODE=1prevents mock fallbacks - FFI Bridge Support: C++ kernel integration for I2S, TL1, and TL2 quantization (requires
--features ffi) - Cross-Validation: <5% performance variance from C++ reference implementation
The FFI bridge enables gradual migration from C++ to Rust while maintaining functionality:
- Quantization Types: Full support for I2S, TL1, and TL2 via C++ kernels
- Performance Comparison: Built-in tools to compare FFI vs Rust quantization
- Migration Path: Systematic approach to replace C++ kernels with native Rust
- Safety: Safe Rust wrappers with proper error handling and memory management
- Testing: Comprehensive test suite ensuring FFI/Rust quantization parity
BitNet-rs provides native CUDA mixed precision support for enhanced GPU performance:
-
FP32: Full precision (reference implementation)
-
FP16: Half-precision floating point with Tensor Core acceleration (compute capability 6.1+)
-
BF16: Brain floating point format for modern architectures (compute capability 8.0+)
-
Auto: Automatic precision selection based on device capabilities
-
Automatic Detection: Hardware capability detection determines optimal precision
-
Device ID Tracking: GPU kernels expose device ID for multi-GPU debugging scenarios (PR #201)
-
Capability Querying: Direct access to FP16/BF16 support via
supports_fp16()andsupports_bf16()methods (PR #201) -
Graceful Fallback: Automatic CPU fallback when GPU operations fail
-
Performance Monitoring: Comprehensive metrics for each precision mode
-
Memory Tracking: GPU memory allocation and deallocation monitoring
-
Tensor Core Optimization: Leverages WMMA API for maximum performance (CC 7.0+)
-
Native CUDA Kernels: Custom PTX kernels optimized for each precision mode
-
Matrix Multiplication: Optimized matmul operations with device-specific launch parameters
-
Precision Conversion: Efficient FP32↔FP16↔BF16 conversion utilities
-
Memory Optimization: Vectorized memory operations and bandwidth optimization
-
Error Handling: Comprehensive error propagation with detailed diagnostics
# Test device-aware quantization with strict mode (prevents mock fallbacks)
BITNET_STRICT_MODE=1 cargo test -p bitnet-quantization --no-default-features --features gpu test_dequantize_cpu_and_gpu_paths
# GPU kernel validation with numerical accuracy testing
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_gpu_vs_cpu_quantization_accuracy
# Enhanced GPU validation with performance metrics and error handling
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_cuda_validation_comprehensive
# Validate quantization accuracy targets (I2S >99.8%, TL1/TL2 >99.6%)
cargo test -p bitnet-quantization --no-default-features --features cpu test_quantization_accuracy_targets# Test mixed precision with strict mode (no mock GPU fallbacks)
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_mixed_precision_kernel_creation
# Test FP16/BF16 matrix multiplication accuracy against FP32 reference
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_mixed_precision_matmul_accuracy
# Test precision mode validation and automatic fallback
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_precision_mode_validation
# Benchmark mixed precision performance with strict mode (realistic baselines)
BITNET_STRICT_MODE=1 cargo bench -p bitnet-kernels --no-default-features --features gpu --bench mixed_precision_bench
# Test device-aware precision selection and optimization
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_precision_detection_optimization# FFI quantization bridge validation with strict mode
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi test_ffi_quantize_matches_rust
# FFI kernel creation and availability testing
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi test_ffi_kernel_creation
# FFI performance comparison against C++ reference (cross-validation)
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi --release test_performance_comparison_structure
# Cross-validation with C++ reference implementation
BITNET_GGUF="path/to/model.gguf" BITNET_STRICT_MODE=1 cargo run -p xtask -- crossval# SIMD kernel validation and performance testing
cargo test -p bitnet-quantization --no-default-features --features cpu --test simd_compatibility
cargo bench -p bitnet-quantization --no-default-features --features cpu --bench simd_comparison
# SIMD vs scalar parity testing
cargo test -p bitnet-quantization --no-default-features --features cpu test_i2s_simd_scalar_parity
cargo test -p bitnet-quantization --no-default-features --features cpu test_simd_performance_baselineBitNet-rs provides comprehensive strict mode controls to eliminate mock inference paths and ensure real quantized computation:
# Enable strict mode for production deployments
BITNET_STRICT_MODE=1 cargo run -p xtask -- infer --model model.gguf --prompt "Test"
# This enables ALL strict mode checks:
# - fail_on_mock: Fails when mock computation detected
# - require_quantization: Requires real I2S/TL1/TL2 kernels
# - validate_performance: Rejects suspicious metrics (>150 tok/s)# Fail immediately on mock detection (Issue #261 - AC2)
BITNET_STRICT_FAIL_ON_MOCK=1 \
cargo test -p bitnet-inference --no-default-features --features cpu
# Require real quantization kernels (Issue #261 - AC3)
BITNET_STRICT_REQUIRE_QUANTIZATION=1 \
cargo test -p bitnet-quantization --no-default-features --features cpu
# Validate performance metrics (Issue #261 - AC6)
BITNET_STRICT_VALIDATE_PERFORMANCE=1 \
cargo run -p xtask -- benchmark --model model.gguf
# CI enhanced strict mode (Issue #261 - AC6)
CI=1 BITNET_CI_ENHANCED_STRICT=1 BITNET_STRICT_MODE=1 \
cargo test --workspace --no-default-features --features cpuuse bitnet_common::strict_mode::{StrictModeConfig, StrictModeEnforcer};
// Production inference with strict mode
std::env::set_var("BITNET_STRICT_MODE", "1");
let enforcer = StrictModeEnforcer::new_detailed();
// Validate inference path (fails on mock usage)
enforcer.validate_inference_path(&inference_path)?;
// Validate quantization kernel availability
enforcer.validate_kernel_availability(&kernel_scenario)?;
// Validate performance metrics (rejects >150 tok/s as suspicious)
enforcer.validate_performance_metrics(&performance_metrics)?;Strict mode validates performance metrics against realistic baselines:
| Metric | Threshold | Reasoning |
|---|---|---|
| Throughput | ≤150 tok/s | Values >150 tok/s flag potential mock computation |
| Computation Type | Must be Real |
Rejects Mock computation type |
| Quantization Accuracy | I2S ≥99.8%, TL1/TL2 ≥99.6% | Validates against FP32 reference |
| Device Utilization | GPU >80% | Ensures efficient GPU utilization |
# .github/workflows/performance-tracking.yml
- name: Run strict mode tests
env:
BITNET_STRICT_MODE: "1"
BITNET_CI_ENHANCED_STRICT: "1"
BITNET_DETERMINISTIC: "1"
BITNET_SEED: "42"
run: |
cargo test --workspace --features cpu
cargo run -p xtask -- crossvalBitNet-rs provides comprehensive strict quantization guards to prevent silent FP32 fallback in quantized layers.
This three-tier validation strategy ensures production-grade quantized inference with honest performance claims.
- Purpose: Catch FP32 fallback immediately during development
- Scope: Debug builds only (
#[cfg(debug_assertions)]) - Behavior: Panic with detailed error message
- Overhead: Zero in release builds (compiled out)
# Debug builds automatically include assertions
cargo test -p bitnet-inference --no-default-features --features cpu
# If fallback occurs:
# thread 'test' panicked at 'fallback to FP32 in debug mode: layer=blk.0.attn_q, qtype=I2S, reason=kernel_unavailable'- Purpose: Reject FP32 fallback in production deployments
- Scope: Release builds with
BITNET_STRICT_MODE=1 - Behavior: Return
Err(BitNetError::StrictMode(...)) - Overhead: <1% (single boolean check per forward pass)
# Production inference with strict mode
BITNET_STRICT_MODE=1 \
cargo run --release -p bitnet-cli --no-default-features --features cpu -- \
infer \
--model model.gguf \
--prompt "Test" \
--max-tokens 16
# If kernel unavailable: Fails with detailed error
# Otherwise: Succeeds with guaranteed quantized computation- Purpose: Validate receipts accurately reflect computation path
- Scope: Post-inference verification (
xtask verify-receipt) - Behavior: Exit code 1 if receipt claims don't match kernel IDs
- Overhead: Zero (offline verification)
# Run benchmark
cargo run -p xtask -- benchmark --model model.gguf --tokens 128
# Verify receipt honesty
cargo run -p xtask -- verify-receipt ci/inference.json
# Checks:
# - compute_path="real" matches actual kernel IDs
# - GPU claims require GPU kernel IDs (gemm_*, i2s_gpu_*)
# - CPU claims require CPU kernel IDs (i2s_gemv, tl1_neon_*, tl2_avx_*)Primary Strict Mode:
# Enable all strict mode checks
export BITNET_STRICT_MODE=1
# This enables:
# - fail_on_mock: Fails when mock computation detected
# - require_quantization: Requires real I2S/TL1/TL2 kernels
# - enforce_quantized_inference: Rejects FP32 fallback in quantized layers
# - validate_performance: Rejects suspicious metrics (>150 tok/s)Granular Strict Mode Controls:
# Fail immediately on mock detection (Issue #453 - AC2)
export BITNET_STRICT_FAIL_ON_MOCK=1
# Require real quantization kernels (Issue #453 - AC3)
export BITNET_STRICT_REQUIRE_QUANTIZATION=1
# Validate performance metrics (Issue #453 - AC6)
export BITNET_STRICT_VALIDATE_PERFORMANCE=1
# CI enhanced strict mode (Issue #453 - AC6)
export CI=1
export BITNET_CI_ENHANCED_STRICT=1Strict mode errors provide actionable context for debugging:
Error: Strict mode: FP32 fallback rejected - qtype=I2S, device=Cuda(0), layer_dims=[2048, 2048], reason=kernel_unavailable
^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
(1) (2) (3) (4) (5) (6)
- Strict mode: Indicates strict mode validation failure
- FP32 fallback rejected: System tried to fall back to FP32 but strict mode prevented it
- qtype=I2S: The quantization type that was attempted
- device=Cuda(0): The device where inference was attempted
- layer_dims=[2048, 2048]: Layer dimensions (in_features × out_features)
- reason=kernel_unavailable: Why fallback was needed
| Reason | Meaning | Solution |
|---|---|---|
kernel_unavailable |
Feature not compiled | cargo build --no-default-features --features cpu or --features gpu |
device_mismatch |
Tensor on wrong device | Ensure model loaded on same device as inference |
unsupported_dimensions |
Layer size not supported | Check model architecture compatibility |
gpu_oom |
GPU out of memory | Reduce batch size or use smaller model |
simd_unavailable |
SIMD features not detected | Rebuild with RUSTFLAGS="-C target-cpu=native" |
Strict mode extends to receipt validation, ensuring performance claims are backed by evidence:
Quantized Kernel ID Patterns:
- GPU Kernels:
gemm_*,wmma_*,cuda_*,i2s_gpu_*,tl1_gpu_*,tl2_gpu_* - CPU Kernels (I2S):
i2s_gemv,i2s_matmul_*,quantized_matmul_i2s - CPU Kernels (TL1/ARM):
tl1_neon_*,tl1_lookup_* - CPU Kernels (TL2/x86):
tl2_avx_*,tl2_avx512_*
Fallback Kernel ID Patterns:
- Dequantization:
dequant_*,dequant_i2s_to_fp32 - FP32 Computation:
fp32_matmul,fp32_gemm - Generic Fallback:
fallback_*,scalar_* - Mock/Test:
mock_*,test_stub
Validation Commands:
# Verify quantized kernels are used
cargo run -p xtask -- verify-receipt --require-quantized-kernels ci/inference.json
# Verify GPU kernels for GPU claims
cargo run -p xtask -- verify-receipt --require-gpu-kernels ci/inference.json
# Validate performance metrics
cargo run -p xtask -- verify-receipt --validate-performance ci/inference.jsonuse bitnet_common::strict_mode::{StrictModeConfig, StrictModeEnforcer};
use bitnet_common::{Device, QuantizationType, Result};
// Production inference with strict mode
std::env::set_var("BITNET_STRICT_MODE", "1");
let enforcer = StrictModeEnforcer::new_detailed();
// Validate inference path (fails on mock usage)
enforcer.validate_inference_path(&inference_path)?;
// Validate quantization kernel availability
enforcer.validate_kernel_availability(&kernel_scenario)?;
// Validate quantization fallback (Issue #453 - AC3)
enforcer.validate_quantization_fallback(
QuantizationType::I2S,
Device::Cpu,
&[2048, 2048], // layer_dims
"kernel_unavailable"
)?;
// Validate performance metrics (rejects >150 tok/s as suspicious)
enforcer.validate_performance_metrics(&performance_metrics)?;Integration in Quantized Linear:
// crates/bitnet-inference/src/layers/quantized_linear.rs
async fn forward_i2s(&self, input: &BitNetTensor) -> Result<BitNetTensor> {
let has_native = bitnet_kernels::is_quantized_kernel_available(
QuantizationType::I2S,
self.device,
(self.in_features, self.out_features)
);
// Debug assertions (Tier 1 - Issue #453 - AC1)
#[cfg(debug_assertions)]
if !has_native {
panic!("fallback to FP32 in debug mode: layer={}, qtype=I2S, reason=kernel_unavailable", self.name);
}
// Strict mode enforcement (Tier 2 - Issue #453 - AC3)
if !has_native {
let strict_mode = StrictModeEnforcer::new();
if strict_mode.get_config().enforce_quantized_inference {
return Err(BitNetError::StrictMode(format!(
"FP32 fallback rejected - qtype=I2S, device={:?}, layer_dims=[{}, {}], reason=kernel_unavailable",
self.device, self.in_features, self.out_features
)));
}
}
// Use native quantized matmul (no dequantization)
if has_native {
self.quantized_matmul_i2s(&input_2d, provider).await
} else {
log::warn!("Using FP32 fallback - should not happen in production");
self.fallback_i2s_matmul(&input_2d).await
}
}Unit Tests with AC Traceability:
# AC1: Debug assertions in QuantizedLinear::forward
cargo test -p bitnet-inference --no-default-features --features cpu \
test_ac1_debug_assert_i2s_fallback -- --nocapture
# AC3: Strict mode rejects FP32 fallback
BITNET_STRICT_MODE=1 \
cargo test -p bitnet-inference --no-default-features --features cpu \
test_ac3_strict_mode_rejects_fallback -- --nocapture
# AC5: 16-token decode in strict mode
BITNET_STRICT_MODE=1 BITNET_DETERMINISTIC=1 BITNET_SEED=42 \
cargo test -p bitnet-inference --no-default-features --features cpu \
test_ac5_16_token_decode_cpu_strict_mode --test strict_quantization_test
# AC6: Receipt validation for quantized computation claims
cargo test -p xtask test_ac6_receipt_quantized_kernels_valid -- --nocaptureIntegration Tests:
# CPU strict mode validation
BITNET_STRICT_MODE=1 \
cargo test --no-default-features --features cpu --test strict_quantization_test
# GPU strict mode validation (requires GPU)
BITNET_STRICT_MODE=1 \
cargo test --no-default-features --features gpu --test strict_quantization_test
# Cross-validation with strict mode
BITNET_STRICT_MODE=1 BITNET_DETERMINISTIC=1 BITNET_SEED=42 \
cargo run -p xtask -- crossvalCombine strict mode with deterministic inference for maximum reproducibility:
# Enable strict mode + deterministic inference
export BITNET_STRICT_MODE=1
export BITNET_DETERMINISTIC=1
export BITNET_SEED=42
export RAYON_NUM_THREADS=1
# Run inference
cargo run -p bitnet-cli --no-default-features --features cpu -- \
infer \
--model model.gguf \
--prompt "Test prompt" \
--max-tokens 16 \
--seed 42
# Outputs will be:
# 1. Identical across runs (deterministic)
# 2. Using real quantized kernels (strict mode)
# 3. Verified via receipt (honest computation)Receipts generated with strict mode include additional validation fields:
{
"schema_version": "1.0.0",
"backend": "cpu",
"compute_path": "real",
"kernels": [
"i2s_gemv",
"quantized_matmul_i2s"
],
"tokens_per_second": 18.5,
"tokens_generated": 128,
"environment": {
"BITNET_STRICT_MODE": "1",
"BITNET_DETERMINISTIC": "1",
"BITNET_SEED": "42"
},
"timestamp": "2025-10-14T12:34:56.789Z"
}For more information, see:
- Tutorial: Getting Started with Strict Mode - Learning-oriented introduction
- How-To: Running Strict Mode Validation Workflows - Problem-oriented workflows
- How-To: Verifying Receipt Honesty - Receipt validation guide
- Reference: Environment Variables - Complete strict mode variable documentation
- Reference: Validation Gates - Receipt honesty technical reference
- Explanation: Strict Mode Rationale - Design rationale
- Explanation: Strict Quantization Guards Specification - Complete feature specification
- Development: GPU Development Guide - GPU-specific quantization details
- Development: Build Commands - Build commands for different quantization features
- Architecture: FFI Threading Architecture - FFI bridge details