Skip to content

Latest commit

 

History

History
673 lines (505 loc) · 26.7 KB

File metadata and controls

673 lines (505 loc) · 26.7 KB

Quantization Support

This document describes BitNet-rs quantization formats and device-aware acceleration surfaces.

Claim boundary: feature flags, kernel names, and acceleration surfaces here do not by themselves prove product readiness, speedup, server readiness, fallback behavior, or full residency. Current hardware and model claims must be checked against active model coverage, receipts, status docs, specs, and claim gates.

Supported Quantization Formats

BitNet-rs contains multiple quantization formats with device-aware acceleration surfaces:

I2S - Native Rust Implementation (Issue #261)

  • Native Rust implementation with device selection and explicit fallback reporting
  • Device-aware quantization surfaces with feature-gated CUDA kernels and CPU SIMD optimization
  • Accuracy: Target ≥99.8% correlation with FP32 reference (defined in test fixtures; formal measurement pending)
  • Performance: Hardware-dependent; SIMD-optimised. QK256 path uses scalar kernels (~0.1 tok/s for 2B models).
  • 2-bit signed quantization with optimized bit-packing (4 values per byte)
  • Strict Mode: Use BITNET_STRICT_MODE=1 to prevent mock fallbacks and ensure real quantized computation
  • Real Computation: Native quantized GEMV kernel eliminates FP32 dequantization staging (Issue #261 - AC3)
  • QuantizedLinear Integration: Replaces standard Linear layers in transformer architecture (Issue #261 - AC5)

GGUF Loader Fallback Boundary

User-facing runtime and proof paths must not silently use the reduced-feature GGUF minimal loader. The enhanced GGUF loader is the default expectation for real inference claims.

  • BITNET_STRICT_MODE=1 or BITNET_DISABLE_MINIMAL_LOADER=1 fails fast when the enhanced loader cannot parse or validate the model.
  • BITNET_ALLOW_MINIMAL_LOADER=1 is the explicit compatibility opt-in for the minimal loader. It may initialize missing transformer tensors with compatibility defaults and cannot support correctness or performance claims.
  • bitnet run --strict-loader sets strict loader mode for CLI proof paths.
  • bitnet run --allow-mock is a smoke/UX-test escape hatch and enables compatibility fallback only by request.
  • JSON output from bitnet run --json-out records the loader mode so receipts or adjacent proof artifacts can distinguish enhanced from explicitly requested compatibility_fallback.

TL1 - Table Lookup Quantization (ARM Optimized - Issue #261)

  • Table lookup quantization optimized for ARM NEON architecture (4-bit, 2 elements per byte with nibble packing)
  • Accuracy: Target ≥99.6% correlation with FP32 reference (defined in test fixtures)
  • Performance: Hardware-dependent; optimised for ARM NEON.
  • NEON Improvements: ARM NEON kernel throughput and accuracy improvements added in #988
  • Device-Aware Selection: Automatic ARM NEON vectorization with scalar fallback
  • Memory-efficient lookup tables (16-256 entries, cache-friendly)
  • Parallel processing with configurable block sizes
  • Real Computation: Direct table lookup matmul without FP32 staging (Issue #261)
  • Safe LUT Index Calculation: Uses bitnet_kernels::tl_lut::lut_index() with checked arithmetic and overflow protection

TL2 - Advanced Table Lookup (x86 Optimized - Issue #261)

  • Advanced table lookup quantization optimized for x86 AVX2/AVX-512 (8-bit, 1 element per byte)
  • Accuracy: Target ≥99.6% correlation with FP32 reference (defined in test fixtures)
  • Performance: Hardware-dependent; optimised for x86 AVX2/AVX-512.
  • SIMD Optimization: AVX2 (32-byte) and AVX-512 (64-byte) vectorization
  • AVX-512 Kernels: Dedicated AVX-512 TL2 kernels added in #997 for 64-byte wide SIMD lanes
  • Enhanced vectorized operations (256-4096 entry tables) for large tensor processing
  • CPU feature detection with graceful fallback to scalar implementation
  • Real Computation: Direct table lookup matmul without FP32 staging (Issue #261)
  • 2-bit Domain: Input quantization stays in the 2-bit domain throughout (fixed in #978)
  • Safe LUT Index Calculation: Uses bitnet_kernels::tl_lut::lut_index() with checked arithmetic and overflow protection

I2S (QK256/GGML) - Pure Rust

  • GGML I2_S format with 256-element blocks (QK_K = 256 per GGML conventions)
  • Block size: 256 elements
  • Format: 64 bytes per block (no per-block scales), scales in separate tensor
  • Support: ✅ Pure Rust (kernel: i2s_qk256::gemv_qk256) - no FFI required
  • Status: Working (scalar kernels; ~0.1 tok/s for 2B models)
  • Use case: MS BitNet GGUF models using GGML format
  • Accuracy: Target ≥99.8% correlation with FP32 reference
  • Performance: 2-bit signed quantization: [-2, -1, +1, +2] mapping
  • Automatic detection: Loader detects QK256 format from tensor sizes
  • Transparent dispatch: Transformer automatically uses QK256 kernel when weights present
  • See also: Dual I2_S Flavor Explanation

IQ2_S - GGML-Compatible

  • GGML-compatible quantization with 82-byte block layout and 4-level [-2,-1,1,2] mapping

Standard Formats (Planned)

  • Q4_0, Q5_0, Q8_0, etc. (planned for future releases)

Table Lookup (TL) Helper API

The bitnet_kernels::tl_lut module provides safe, bounds-checked index calculation for TL1/TL2 quantization kernels.

lut_index Function

Calculate validated index into table lookup buffer with overflow protection.

Signature:

pub fn lut_index(
    block_idx: usize,
    elem_in_block: usize,
    block_bytes: usize,
    elems_per_block: usize,
    lut_len: usize,
) -> Result<usize>
  • Parameters:

  • block_idx: Block index in quantized buffer

  • elem_in_block: Element position within block (0..elems_per_block)

  • block_bytes: Size of each block in bytes

  • elems_per_block: Number of elements per quantized block

  • lut_len: Total length of LUT buffer (for bounds checking)

Returns: Validated LUT index or error if overflow/out-of-bounds

  • Safety Guarantees:

  • Validates elem_in_block < elems_per_block (bounds check)

  • Uses checked arithmetic to prevent integer overflow

  • Validates final index < lut_len before returning

  • 100% mutation testing coverage (6/6 mutants killed, Issue #462)

Example Usage:

use bitnet_kernels::tl_lut::lut_index;

// Calculate LUT index for block 0, element 0
let idx = lut_index(0, 0, 32, 128, 1024)?;
assert_eq!(idx, 0);

// Calculate LUT index for block 1, element 8
// Formula: 1 * 32 + (8 / 8) = 32 + 1 = 33
let idx = lut_index(1, 8, 32, 128, 1024)?;
assert_eq!(idx, 33);

// Bounds check prevents out-of-range access
let result = lut_index(0, 128, 32, 128, 1024);
#assert!(result.is_err()); // elem_in_block >= elems_per_block
#```

**Testing Commands:**
```bash
# Run TL LUT helper tests
cargo test -p bitnet-kernels --no-default-features --features cpu tl_lut

# Specific test cases
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_basic
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_overflow_detection
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_boundary_validation

See also: Issue #462 for TL LUT helper implementation and mutation testing results.

Device-Aware Operations

All quantizers support device-aware operations with:

  • Automatic GPU acceleration: CUDA kernels with performance monitoring (alpha)
  • Metal acceleration: macOS/iOS GPU via feature = "metal" (#992)
  • Vulkan compute: Cross-platform GPU via feature = "vulkan" (#993)
  • Intel oneAPI: Intel CPU/GPU acceleration via feature = "oneapi" (#986)
  • ROCm support: AMD GPU detection via rocm_available field in DeviceProbe (#995)
  • Transparent CPU fallback: Graceful degradation with maintained accuracy (SIMD-optimised)
  • Memory optimization: GPU memory leak detection and efficient allocation
  • Feature gating: Proper #[cfg(feature = "gpu")] guards for CPU-only builds
  • Strict Mode Enforcement: BITNET_STRICT_MODE=1 prevents mock fallbacks
  • FFI Bridge Support: C++ kernel integration for I2S, TL1, and TL2 quantization (requires --features ffi)
  • Cross-Validation: <5% performance variance from C++ reference implementation

FFI Quantization Bridge

The FFI bridge enables gradual migration from C++ to Rust while maintaining functionality:

  • Quantization Types: Full support for I2S, TL1, and TL2 via C++ kernels
  • Performance Comparison: Built-in tools to compare FFI vs Rust quantization
  • Migration Path: Systematic approach to replace C++ kernels with native Rust
  • Safety: Safe Rust wrappers with proper error handling and memory management
  • Testing: Comprehensive test suite ensuring FFI/Rust quantization parity

Mixed Precision GPU Acceleration

BitNet-rs provides native CUDA mixed precision support for enhanced GPU performance:

  • Supported Precision Modes

  • FP32: Full precision (reference implementation)

  • FP16: Half-precision floating point with Tensor Core acceleration (compute capability 6.1+)

  • BF16: Brain floating point format for modern architectures (compute capability 8.0+)

  • Auto: Automatic precision selection based on device capabilities

  • Device-Aware Precision Selection

  • Automatic Detection: Hardware capability detection determines optimal precision

  • Device ID Tracking: GPU kernels expose device ID for multi-GPU debugging scenarios (PR #201)

  • Capability Querying: Direct access to FP16/BF16 support via supports_fp16() and supports_bf16() methods (PR #201)

  • Graceful Fallback: Automatic CPU fallback when GPU operations fail

  • Performance Monitoring: Comprehensive metrics for each precision mode

  • Memory Tracking: GPU memory allocation and deallocation monitoring

  • Tensor Core Optimization: Leverages WMMA API for maximum performance (CC 7.0+)

  • Mixed Precision Features

  • Native CUDA Kernels: Custom PTX kernels optimized for each precision mode

  • Matrix Multiplication: Optimized matmul operations with device-specific launch parameters

  • Precision Conversion: Efficient FP32↔FP16↔BF16 conversion utilities

  • Memory Optimization: Vectorized memory operations and bandwidth optimization

  • Error Handling: Comprehensive error propagation with detailed diagnostics

Testing Commands

Device-Aware Quantization Testing

# Test device-aware quantization with strict mode (prevents mock fallbacks)
BITNET_STRICT_MODE=1 cargo test -p bitnet-quantization --no-default-features --features gpu test_dequantize_cpu_and_gpu_paths

# GPU kernel validation with numerical accuracy testing
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_gpu_vs_cpu_quantization_accuracy

# Enhanced GPU validation with performance metrics and error handling
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_cuda_validation_comprehensive

# Validate quantization accuracy targets (I2S >99.8%, TL1/TL2 >99.6%)
cargo test -p bitnet-quantization --no-default-features --features cpu test_quantization_accuracy_targets

Mixed Precision Testing

# Test mixed precision with strict mode (no mock GPU fallbacks)
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_mixed_precision_kernel_creation

# Test FP16/BF16 matrix multiplication accuracy against FP32 reference
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_mixed_precision_matmul_accuracy

# Test precision mode validation and automatic fallback
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_precision_mode_validation

# Benchmark mixed precision performance with strict mode (realistic baselines)
BITNET_STRICT_MODE=1 cargo bench -p bitnet-kernels --no-default-features --features gpu --bench mixed_precision_bench

# Test device-aware precision selection and optimization
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_precision_detection_optimization

FFI Quantization Testing

# FFI quantization bridge validation with strict mode
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi test_ffi_quantize_matches_rust

# FFI kernel creation and availability testing
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi test_ffi_kernel_creation

# FFI performance comparison against C++ reference (cross-validation)
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi --release test_performance_comparison_structure

# Cross-validation with C++ reference implementation
BITNET_GGUF="path/to/model.gguf" BITNET_STRICT_MODE=1 cargo run -p xtask -- crossval

SIMD Testing

# SIMD kernel validation and performance testing
cargo test -p bitnet-quantization --no-default-features --features cpu --test simd_compatibility
cargo bench -p bitnet-quantization --no-default-features --features cpu --bench simd_comparison

# SIMD vs scalar parity testing
cargo test -p bitnet-quantization --no-default-features --features cpu test_i2s_simd_scalar_parity
cargo test -p bitnet-quantization --no-default-features --features cpu test_simd_performance_baseline

Strict Mode Enforcement (Issue #261 - AC2, AC6)

BitNet-rs provides comprehensive strict mode controls to eliminate mock inference paths and ensure real quantized computation:

Primary Strict Mode Configuration

# Enable strict mode for production deployments
BITNET_STRICT_MODE=1 cargo run -p xtask -- infer --model model.gguf --prompt "Test"

# This enables ALL strict mode checks:
# - fail_on_mock: Fails when mock computation detected
# - require_quantization: Requires real I2S/TL1/TL2 kernels
# - validate_performance: Rejects suspicious metrics (>150 tok/s)

Granular Strict Mode Controls

# Fail immediately on mock detection (Issue #261 - AC2)
BITNET_STRICT_FAIL_ON_MOCK=1 \
cargo test -p bitnet-inference --no-default-features --features cpu

# Require real quantization kernels (Issue #261 - AC3)
BITNET_STRICT_REQUIRE_QUANTIZATION=1 \
cargo test -p bitnet-quantization --no-default-features --features cpu

# Validate performance metrics (Issue #261 - AC6)
BITNET_STRICT_VALIDATE_PERFORMANCE=1 \
cargo run -p xtask -- benchmark --model model.gguf

# CI enhanced strict mode (Issue #261 - AC6)
CI=1 BITNET_CI_ENHANCED_STRICT=1 BITNET_STRICT_MODE=1 \
cargo test --workspace --no-default-features --features cpu

Strict Mode API Usage

use bitnet_common::strict_mode::{StrictModeConfig, StrictModeEnforcer};

// Production inference with strict mode
std::env::set_var("BITNET_STRICT_MODE", "1");
let enforcer = StrictModeEnforcer::new_detailed();

// Validate inference path (fails on mock usage)
enforcer.validate_inference_path(&inference_path)?;

// Validate quantization kernel availability
enforcer.validate_kernel_availability(&kernel_scenario)?;

// Validate performance metrics (rejects >150 tok/s as suspicious)
enforcer.validate_performance_metrics(&performance_metrics)?;

Performance Validation Thresholds

Strict mode validates performance metrics against realistic baselines:

Metric Threshold Reasoning
Throughput ≤150 tok/s Values >150 tok/s flag potential mock computation
Computation Type Must be Real Rejects Mock computation type
Quantization Accuracy I2S ≥99.8%, TL1/TL2 ≥99.6% Validates against FP32 reference
Device Utilization GPU >80% Ensures efficient GPU utilization

CI Integration

# .github/workflows/performance-tracking.yml
- name: Run strict mode tests
  env:
    BITNET_STRICT_MODE: "1"
    BITNET_CI_ENHANCED_STRICT: "1"
    BITNET_DETERMINISTIC: "1"
    BITNET_SEED: "42"
  run: |
    cargo test --workspace --features cpu
    cargo run -p xtask -- crossval

Strict Quantization Guards (Issue #453)

BitNet-rs provides comprehensive strict quantization guards to prevent silent FP32 fallback in quantized layers.

This three-tier validation strategy ensures production-grade quantized inference with honest performance claims.

Three-Tier Validation Strategy

Tier 1: Debug Assertions (Development)

  • Purpose: Catch FP32 fallback immediately during development
  • Scope: Debug builds only (#[cfg(debug_assertions)])
  • Behavior: Panic with detailed error message
  • Overhead: Zero in release builds (compiled out)
# Debug builds automatically include assertions
cargo test -p bitnet-inference --no-default-features --features cpu

# If fallback occurs:
# thread 'test' panicked at 'fallback to FP32 in debug mode: layer=blk.0.attn_q, qtype=I2S, reason=kernel_unavailable'

Tier 2: Strict Mode Enforcement (Production)

  • Purpose: Reject FP32 fallback in production deployments
  • Scope: Release builds with BITNET_STRICT_MODE=1
  • Behavior: Return Err(BitNetError::StrictMode(...))
  • Overhead: <1% (single boolean check per forward pass)
# Production inference with strict mode
BITNET_STRICT_MODE=1 \
cargo run --release -p bitnet-cli --no-default-features --features cpu -- \
  infer \
  --model model.gguf \
  --prompt "Test" \
  --max-tokens 16

# If kernel unavailable: Fails with detailed error
# Otherwise: Succeeds with guaranteed quantized computation

Tier 3: Receipt Validation (Verification)

  • Purpose: Validate receipts accurately reflect computation path
  • Scope: Post-inference verification (xtask verify-receipt)
  • Behavior: Exit code 1 if receipt claims don't match kernel IDs
  • Overhead: Zero (offline verification)
# Run benchmark
cargo run -p xtask -- benchmark --model model.gguf --tokens 128

# Verify receipt honesty
cargo run -p xtask -- verify-receipt ci/inference.json

# Checks:
# - compute_path="real" matches actual kernel IDs
# - GPU claims require GPU kernel IDs (gemm_*, i2s_gpu_*)
# - CPU claims require CPU kernel IDs (i2s_gemv, tl1_neon_*, tl2_avx_*)

Strict Mode Configuration

Primary Strict Mode:

# Enable all strict mode checks
export BITNET_STRICT_MODE=1

# This enables:
# - fail_on_mock: Fails when mock computation detected
# - require_quantization: Requires real I2S/TL1/TL2 kernels
# - enforce_quantized_inference: Rejects FP32 fallback in quantized layers
# - validate_performance: Rejects suspicious metrics (>150 tok/s)

Granular Strict Mode Controls:

# Fail immediately on mock detection (Issue #453 - AC2)
export BITNET_STRICT_FAIL_ON_MOCK=1

# Require real quantization kernels (Issue #453 - AC3)
export BITNET_STRICT_REQUIRE_QUANTIZATION=1

# Validate performance metrics (Issue #453 - AC6)
export BITNET_STRICT_VALIDATE_PERFORMANCE=1

# CI enhanced strict mode (Issue #453 - AC6)
export CI=1
export BITNET_CI_ENHANCED_STRICT=1

Strict Mode Error Messages

Strict mode errors provide actionable context for debugging:

Error: Strict mode: FP32 fallback rejected - qtype=I2S, device=Cuda(0), layer_dims=[2048, 2048], reason=kernel_unavailable
       ^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^  ^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^
       (1)           (2)                     (3)        (4)          (5)                 (6)
  1. Strict mode: Indicates strict mode validation failure
  2. FP32 fallback rejected: System tried to fall back to FP32 but strict mode prevented it
  3. qtype=I2S: The quantization type that was attempted
  4. device=Cuda(0): The device where inference was attempted
  5. layer_dims=[2048, 2048]: Layer dimensions (in_features × out_features)
  6. reason=kernel_unavailable: Why fallback was needed

Common Fallback Reasons and Solutions

Reason Meaning Solution
kernel_unavailable Feature not compiled cargo build --no-default-features --features cpu or --features gpu
device_mismatch Tensor on wrong device Ensure model loaded on same device as inference
unsupported_dimensions Layer size not supported Check model architecture compatibility
gpu_oom GPU out of memory Reduce batch size or use smaller model
simd_unavailable SIMD features not detected Rebuild with RUSTFLAGS="-C target-cpu=native"

Receipt Honesty Validation

Strict mode extends to receipt validation, ensuring performance claims are backed by evidence:

Quantized Kernel ID Patterns:

  • GPU Kernels: gemm_*, wmma_*, cuda_*, i2s_gpu_*, tl1_gpu_*, tl2_gpu_*
  • CPU Kernels (I2S): i2s_gemv, i2s_matmul_*, quantized_matmul_i2s
  • CPU Kernels (TL1/ARM): tl1_neon_*, tl1_lookup_*
  • CPU Kernels (TL2/x86): tl2_avx_*, tl2_avx512_*

Fallback Kernel ID Patterns:

  • Dequantization: dequant_*, dequant_i2s_to_fp32
  • FP32 Computation: fp32_matmul, fp32_gemm
  • Generic Fallback: fallback_*, scalar_*
  • Mock/Test: mock_*, test_stub

Validation Commands:

# Verify quantized kernels are used
cargo run -p xtask -- verify-receipt --require-quantized-kernels ci/inference.json

# Verify GPU kernels for GPU claims
cargo run -p xtask -- verify-receipt --require-gpu-kernels ci/inference.json

# Validate performance metrics
cargo run -p xtask -- verify-receipt --validate-performance ci/inference.json

Programmatic Usage

use bitnet_common::strict_mode::{StrictModeConfig, StrictModeEnforcer};
use bitnet_common::{Device, QuantizationType, Result};

// Production inference with strict mode
std::env::set_var("BITNET_STRICT_MODE", "1");
let enforcer = StrictModeEnforcer::new_detailed();

// Validate inference path (fails on mock usage)
enforcer.validate_inference_path(&inference_path)?;

// Validate quantization kernel availability
enforcer.validate_kernel_availability(&kernel_scenario)?;

// Validate quantization fallback (Issue #453 - AC3)
enforcer.validate_quantization_fallback(
    QuantizationType::I2S,
    Device::Cpu,
    &[2048, 2048],  // layer_dims
    "kernel_unavailable"
)?;

// Validate performance metrics (rejects >150 tok/s as suspicious)
enforcer.validate_performance_metrics(&performance_metrics)?;

Integration in Quantized Linear:

// crates/bitnet-inference/src/layers/quantized_linear.rs

async fn forward_i2s(&self, input: &BitNetTensor) -> Result<BitNetTensor> {
    let has_native = bitnet_kernels::is_quantized_kernel_available(
        QuantizationType::I2S,
        self.device,
        (self.in_features, self.out_features)
    );

    // Debug assertions (Tier 1 - Issue #453 - AC1)
    #[cfg(debug_assertions)]
    if !has_native {
        panic!("fallback to FP32 in debug mode: layer={}, qtype=I2S, reason=kernel_unavailable", self.name);
    }

    // Strict mode enforcement (Tier 2 - Issue #453 - AC3)
    if !has_native {
        let strict_mode = StrictModeEnforcer::new();
        if strict_mode.get_config().enforce_quantized_inference {
            return Err(BitNetError::StrictMode(format!(
                "FP32 fallback rejected - qtype=I2S, device={:?}, layer_dims=[{}, {}], reason=kernel_unavailable",
                self.device, self.in_features, self.out_features
            )));
        }
    }

    // Use native quantized matmul (no dequantization)
    if has_native {
        self.quantized_matmul_i2s(&input_2d, provider).await
    } else {
        log::warn!("Using FP32 fallback - should not happen in production");
        self.fallback_i2s_matmul(&input_2d).await
    }
}

Testing Strict Mode

Unit Tests with AC Traceability:

# AC1: Debug assertions in QuantizedLinear::forward
cargo test -p bitnet-inference --no-default-features --features cpu \
  test_ac1_debug_assert_i2s_fallback -- --nocapture

# AC3: Strict mode rejects FP32 fallback
BITNET_STRICT_MODE=1 \
cargo test -p bitnet-inference --no-default-features --features cpu \
  test_ac3_strict_mode_rejects_fallback -- --nocapture

# AC5: 16-token decode in strict mode
BITNET_STRICT_MODE=1 BITNET_DETERMINISTIC=1 BITNET_SEED=42 \
cargo test -p bitnet-inference --no-default-features --features cpu \
  test_ac5_16_token_decode_cpu_strict_mode --test strict_quantization_test

# AC6: Receipt validation for quantized computation claims
cargo test -p xtask test_ac6_receipt_quantized_kernels_valid -- --nocapture

Integration Tests:

# CPU strict mode validation
BITNET_STRICT_MODE=1 \
cargo test --no-default-features --features cpu --test strict_quantization_test

# GPU strict mode validation (requires GPU)
BITNET_STRICT_MODE=1 \
cargo test --no-default-features --features gpu --test strict_quantization_test

# Cross-validation with strict mode
BITNET_STRICT_MODE=1 BITNET_DETERMINISTIC=1 BITNET_SEED=42 \
cargo run -p xtask -- crossval

Deterministic Inference with Strict Mode

Combine strict mode with deterministic inference for maximum reproducibility:

# Enable strict mode + deterministic inference
export BITNET_STRICT_MODE=1
export BITNET_DETERMINISTIC=1
export BITNET_SEED=42
export RAYON_NUM_THREADS=1

# Run inference
cargo run -p bitnet-cli --no-default-features --features cpu -- \
  infer \
  --model model.gguf \
  --prompt "Test prompt" \
  --max-tokens 16 \
  --seed 42

# Outputs will be:
# 1. Identical across runs (deterministic)
# 2. Using real quantized kernels (strict mode)
# 3. Verified via receipt (honest computation)

Receipt Schema for Strict Mode

Receipts generated with strict mode include additional validation fields:

{
  "schema_version": "1.0.0",
  "backend": "cpu",
  "compute_path": "real",
  "kernels": [
    "i2s_gemv",
    "quantized_matmul_i2s"
  ],
  "tokens_per_second": 18.5,
  "tokens_generated": 128,
  "environment": {
    "BITNET_STRICT_MODE": "1",
    "BITNET_DETERMINISTIC": "1",
    "BITNET_SEED": "42"
  },
  "timestamp": "2025-10-14T12:34:56.789Z"
}

For more information, see: