Quantization Support

This document describes BitNet-rs quantization formats and device-aware acceleration surfaces.

Claim boundary: feature flags, kernel names, and acceleration surfaces here do not by themselves prove product readiness, speedup, server readiness, fallback behavior, or full residency. Current hardware and model claims must be checked against active model coverage, receipts, status docs, specs, and claim gates.

Supported Quantization Formats

BitNet-rs contains multiple quantization formats with device-aware acceleration surfaces:

I2S - Native Rust Implementation (Issue #261)

Native Rust implementation with device selection and explicit fallback reporting
Device-aware quantization surfaces with feature-gated CUDA kernels and CPU SIMD optimization
Accuracy: Target ≥99.8% correlation with FP32 reference (defined in test fixtures; formal measurement pending)
Performance: Hardware-dependent; SIMD-optimised. QK256 path uses scalar kernels (~0.1 tok/s for 2B models).
2-bit signed quantization with optimized bit-packing (4 values per byte)
Strict Mode: Use BITNET_STRICT_MODE=1 to prevent mock fallbacks and ensure real quantized computation
Real Computation: Native quantized GEMV kernel eliminates FP32 dequantization staging (Issue #261 - AC3)
QuantizedLinear Integration: Replaces standard Linear layers in transformer architecture (Issue #261 - AC5)

GGUF Loader Fallback Boundary

User-facing runtime and proof paths must not silently use the reduced-feature GGUF minimal loader. The enhanced GGUF loader is the default expectation for real inference claims.

BITNET_STRICT_MODE=1 or BITNET_DISABLE_MINIMAL_LOADER=1 fails fast when the enhanced loader cannot parse or validate the model.
BITNET_ALLOW_MINIMAL_LOADER=1 is the explicit compatibility opt-in for the minimal loader. It may initialize missing transformer tensors with compatibility defaults and cannot support correctness or performance claims.
bitnet run --strict-loader sets strict loader mode for CLI proof paths.
bitnet run --allow-mock is a smoke/UX-test escape hatch and enables compatibility fallback only by request.
JSON output from bitnet run --json-out records the loader mode so receipts or adjacent proof artifacts can distinguish enhanced from explicitly requested compatibility_fallback.

TL1 - Table Lookup Quantization (ARM Optimized - Issue #261)

Table lookup quantization optimized for ARM NEON architecture (4-bit, 2 elements per byte with nibble packing)
Accuracy: Target ≥99.6% correlation with FP32 reference (defined in test fixtures)
Performance: Hardware-dependent; optimised for ARM NEON.
NEON Improvements: ARM NEON kernel throughput and accuracy improvements added in #988
Device-Aware Selection: Automatic ARM NEON vectorization with scalar fallback
Memory-efficient lookup tables (16-256 entries, cache-friendly)
Parallel processing with configurable block sizes
Real Computation: Direct table lookup matmul without FP32 staging (Issue #261)
Safe LUT Index Calculation: Uses bitnet_kernels::tl_lut::lut_index() with checked arithmetic and overflow protection

TL2 - Advanced Table Lookup (x86 Optimized - Issue #261)

Advanced table lookup quantization optimized for x86 AVX2/AVX-512 (8-bit, 1 element per byte)
Accuracy: Target ≥99.6% correlation with FP32 reference (defined in test fixtures)
Performance: Hardware-dependent; optimised for x86 AVX2/AVX-512.
SIMD Optimization: AVX2 (32-byte) and AVX-512 (64-byte) vectorization
AVX-512 Kernels: Dedicated AVX-512 TL2 kernels added in #997 for 64-byte wide SIMD lanes
Enhanced vectorized operations (256-4096 entry tables) for large tensor processing
CPU feature detection with graceful fallback to scalar implementation
Real Computation: Direct table lookup matmul without FP32 staging (Issue #261)
2-bit Domain: Input quantization stays in the 2-bit domain throughout (fixed in #978)
Safe LUT Index Calculation: Uses bitnet_kernels::tl_lut::lut_index() with checked arithmetic and overflow protection

I2S (QK256/GGML) - Pure Rust

GGML I2_S format with 256-element blocks (QK_K = 256 per GGML conventions)
Block size: 256 elements
Format: 64 bytes per block (no per-block scales), scales in separate tensor
Support: ✅ Pure Rust (kernel: i2s_qk256::gemv_qk256) - no FFI required
Status: Working (scalar kernels; ~0.1 tok/s for 2B models)
Use case: MS BitNet GGUF models using GGML format
Accuracy: Target ≥99.8% correlation with FP32 reference
Performance: 2-bit signed quantization: [-2, -1, +1, +2] mapping
Automatic detection: Loader detects QK256 format from tensor sizes
Transparent dispatch: Transformer automatically uses QK256 kernel when weights present
See also: Dual I2_S Flavor Explanation

IQ2_S - GGML-Compatible

GGML-compatible quantization with 82-byte block layout and 4-level [-2,-1,1,2] mapping

Standard Formats (Planned)

Q4_0, Q5_0, Q8_0, etc. (planned for future releases)

Table Lookup (TL) Helper API

The bitnet_kernels::tl_lut module provides safe, bounds-checked index calculation for TL1/TL2 quantization kernels.

`lut_index` Function

Calculate validated index into table lookup buffer with overflow protection.

Signature:

pub fn lut_index(
    block_idx: usize,
    elem_in_block: usize,
    block_bytes: usize,
    elems_per_block: usize,
    lut_len: usize,
) -> Result<usize>

Parameters:
block_idx: Block index in quantized buffer
elem_in_block: Element position within block (0..elems_per_block)
block_bytes: Size of each block in bytes
elems_per_block: Number of elements per quantized block
lut_len: Total length of LUT buffer (for bounds checking)

Returns: Validated LUT index or error if overflow/out-of-bounds

Safety Guarantees:
Validates elem_in_block < elems_per_block (bounds check)
Uses checked arithmetic to prevent integer overflow
Validates final index < lut_len before returning
100% mutation testing coverage (6/6 mutants killed, Issue #462)

Example Usage:

use bitnet_kernels::tl_lut::lut_index;

// Calculate LUT index for block 0, element 0
let idx = lut_index(0, 0, 32, 128, 1024)?;
assert_eq!(idx, 0);

// Calculate LUT index for block 1, element 8
// Formula: 1 * 32 + (8 / 8) = 32 + 1 = 33
let idx = lut_index(1, 8, 32, 128, 1024)?;
assert_eq!(idx, 33);

// Bounds check prevents out-of-range access
let result = lut_index(0, 128, 32, 128, 1024);
#assert!(result.is_err()); // elem_in_block >= elems_per_block
#```

**Testing Commands:**
```bash
# Run TL LUT helper tests
cargo test -p bitnet-kernels --no-default-features --features cpu tl_lut

# Specific test cases
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_basic
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_overflow_detection
cargo test -p bitnet-kernels --no-default-features --features cpu test_lut_index_boundary_validation

See also: Issue #462 for TL LUT helper implementation and mutation testing results.

Device-Aware Operations

All quantizers support device-aware operations with:

Automatic GPU acceleration: CUDA kernels with performance monitoring (alpha)
Metal acceleration: macOS/iOS GPU via feature = "metal" (#992)
Vulkan compute: Cross-platform GPU via feature = "vulkan" (#993)
Intel oneAPI: Intel CPU/GPU acceleration via feature = "oneapi" (#986)
ROCm support: AMD GPU detection via rocm_available field in DeviceProbe (#995)
Transparent CPU fallback: Graceful degradation with maintained accuracy (SIMD-optimised)
Memory optimization: GPU memory leak detection and efficient allocation
Feature gating: Proper #[cfg(feature = "gpu")] guards for CPU-only builds
Strict Mode Enforcement: BITNET_STRICT_MODE=1 prevents mock fallbacks
FFI Bridge Support: C++ kernel integration for I2S, TL1, and TL2 quantization (requires --features ffi)
Cross-Validation: <5% performance variance from C++ reference implementation

FFI Quantization Bridge

The FFI bridge enables gradual migration from C++ to Rust while maintaining functionality:

Quantization Types: Full support for I2S, TL1, and TL2 via C++ kernels
Performance Comparison: Built-in tools to compare FFI vs Rust quantization
Migration Path: Systematic approach to replace C++ kernels with native Rust
Safety: Safe Rust wrappers with proper error handling and memory management
Testing: Comprehensive test suite ensuring FFI/Rust quantization parity

Mixed Precision GPU Acceleration

BitNet-rs provides native CUDA mixed precision support for enhanced GPU performance:

Supported Precision Modes
FP32: Full precision (reference implementation)
FP16: Half-precision floating point with Tensor Core acceleration (compute capability 6.1+)
BF16: Brain floating point format for modern architectures (compute capability 8.0+)
Auto: Automatic precision selection based on device capabilities
Device-Aware Precision Selection
Automatic Detection: Hardware capability detection determines optimal precision
Device ID Tracking: GPU kernels expose device ID for multi-GPU debugging scenarios (PR #201)
Capability Querying: Direct access to FP16/BF16 support via supports_fp16() and supports_bf16() methods (PR #201)
Graceful Fallback: Automatic CPU fallback when GPU operations fail
Performance Monitoring: Comprehensive metrics for each precision mode
Memory Tracking: GPU memory allocation and deallocation monitoring
Tensor Core Optimization: Leverages WMMA API for maximum performance (CC 7.0+)
Mixed Precision Features
Native CUDA Kernels: Custom PTX kernels optimized for each precision mode
Matrix Multiplication: Optimized matmul operations with device-specific launch parameters
Precision Conversion: Efficient FP32↔FP16↔BF16 conversion utilities
Memory Optimization: Vectorized memory operations and bandwidth optimization
Error Handling: Comprehensive error propagation with detailed diagnostics

Testing Commands

Device-Aware Quantization Testing

# Test device-aware quantization with strict mode (prevents mock fallbacks)
BITNET_STRICT_MODE=1 cargo test -p bitnet-quantization --no-default-features --features gpu test_dequantize_cpu_and_gpu_paths

# GPU kernel validation with numerical accuracy testing
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_gpu_vs_cpu_quantization_accuracy

# Enhanced GPU validation with performance metrics and error handling
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_cuda_validation_comprehensive

# Validate quantization accuracy targets (I2S >99.8%, TL1/TL2 >99.6%)
cargo test -p bitnet-quantization --no-default-features --features cpu test_quantization_accuracy_targets

Mixed Precision Testing

# Test mixed precision with strict mode (no mock GPU fallbacks)
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_mixed_precision_kernel_creation

# Test FP16/BF16 matrix multiplication accuracy against FP32 reference
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_mixed_precision_matmul_accuracy

# Test precision mode validation and automatic fallback
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_precision_mode_validation

# Benchmark mixed precision performance with strict mode (realistic baselines)
BITNET_STRICT_MODE=1 cargo bench -p bitnet-kernels --no-default-features --features gpu --bench mixed_precision_bench

# Test device-aware precision selection and optimization
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features gpu test_precision_detection_optimization

FFI Quantization Testing

# FFI quantization bridge validation with strict mode
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi test_ffi_quantize_matches_rust

# FFI kernel creation and availability testing
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi test_ffi_kernel_creation

# FFI performance comparison against C++ reference (cross-validation)
BITNET_STRICT_MODE=1 cargo test -p bitnet-kernels --no-default-features --features ffi --release test_performance_comparison_structure

# Cross-validation with C++ reference implementation
BITNET_GGUF="path/to/model.gguf" BITNET_STRICT_MODE=1 cargo run -p xtask -- crossval

SIMD Testing

# SIMD kernel validation and performance testing
cargo test -p bitnet-quantization --no-default-features --features cpu --test simd_compatibility
cargo bench -p bitnet-quantization --no-default-features --features cpu --bench simd_comparison

# SIMD vs scalar parity testing
cargo test -p bitnet-quantization --no-default-features --features cpu test_i2s_simd_scalar_parity
cargo test -p bitnet-quantization --no-default-features --features cpu test_simd_performance_baseline

Strict Mode Enforcement (Issue #261 - AC2, AC6)

BitNet-rs provides comprehensive strict mode controls to eliminate mock inference paths and ensure real quantized computation:

Primary Strict Mode Configuration

# Enable strict mode for production deployments
BITNET_STRICT_MODE=1 cargo run -p xtask -- infer --model model.gguf --prompt "Test"

# This enables ALL strict mode checks:
# - fail_on_mock: Fails when mock computation detected
# - require_quantization: Requires real I2S/TL1/TL2 kernels
# - validate_performance: Rejects suspicious metrics (>150 tok/s)

Granular Strict Mode Controls

# Fail immediately on mock detection (Issue #261 - AC2)
BITNET_STRICT_FAIL_ON_MOCK=1 \
cargo test -p bitnet-inference --no-default-features --features cpu

# Require real quantization kernels (Issue #261 - AC3)
BITNET_STRICT_REQUIRE_QUANTIZATION=1 \
cargo test -p bitnet-quantization --no-default-features --features cpu

# Validate performance metrics (Issue #261 - AC6)
BITNET_STRICT_VALIDATE_PERFORMANCE=1 \
cargo run -p xtask -- benchmark --model model.gguf

# CI enhanced strict mode (Issue #261 - AC6)
CI=1 BITNET_CI_ENHANCED_STRICT=1 BITNET_STRICT_MODE=1 \
cargo test --workspace --no-default-features --features cpu

Strict Mode API Usage

use bitnet_common::strict_mode::{StrictModeConfig, StrictModeEnforcer};

// Production inference with strict mode
std::env::set_var("BITNET_STRICT_MODE", "1");
let enforcer = StrictModeEnforcer::new_detailed();

// Validate inference path (fails on mock usage)
enforcer.validate_inference_path(&inference_path)?;

// Validate quantization kernel availability
enforcer.validate_kernel_availability(&kernel_scenario)?;

// Validate performance metrics (rejects >150 tok/s as suspicious)
enforcer.validate_performance_metrics(&performance_metrics)?;

Performance Validation Thresholds

Strict mode validates performance metrics against realistic baselines:

Metric	Threshold	Reasoning
Throughput	≤150 tok/s	Values >150 tok/s flag potential mock computation
Computation Type	Must be `Real`	Rejects `Mock` computation type
Quantization Accuracy	I2S ≥99.8%, TL1/TL2 ≥99.6%	Validates against FP32 reference
Device Utilization	GPU >80%	Ensures efficient GPU utilization

CI Integration

# .github/workflows/performance-tracking.yml
- name: Run strict mode tests
  env:
    BITNET_STRICT_MODE: "1"
    BITNET_CI_ENHANCED_STRICT: "1"
    BITNET_DETERMINISTIC: "1"
    BITNET_SEED: "42"
  run: |
    cargo test --workspace --features cpu
    cargo run -p xtask -- crossval

Strict Quantization Guards (Issue #453)

BitNet-rs provides comprehensive strict quantization guards to prevent silent FP32 fallback in quantized layers.

This three-tier validation strategy ensures production-grade quantized inference with honest performance claims.

Three-Tier Validation Strategy

Tier 1: Debug Assertions (Development)

Purpose: Catch FP32 fallback immediately during development
Scope: Debug builds only (#[cfg(debug_assertions)])
Behavior: Panic with detailed error message
Overhead: Zero in release builds (compiled out)

# Debug builds automatically include assertions
cargo test -p bitnet-inference --no-default-features --features cpu

# If fallback occurs:
# thread 'test' panicked at 'fallback to FP32 in debug mode: layer=blk.0.attn_q, qtype=I2S, reason=kernel_unavailable'

Tier 2: Strict Mode Enforcement (Production)

Purpose: Reject FP32 fallback in production deployments
Scope: Release builds with BITNET_STRICT_MODE=1
Behavior: Return Err(BitNetError::StrictMode(...))
Overhead: <1% (single boolean check per forward pass)

# Production inference with strict mode
BITNET_STRICT_MODE=1 \
cargo run --release -p bitnet-cli --no-default-features --features cpu -- \
  infer \
  --model model.gguf \
  --prompt "Test" \
  --max-tokens 16

# If kernel unavailable: Fails with detailed error
# Otherwise: Succeeds with guaranteed quantized computation

Tier 3: Receipt Validation (Verification)

Purpose: Validate receipts accurately reflect computation path
Scope: Post-inference verification (xtask verify-receipt)
Behavior: Exit code 1 if receipt claims don't match kernel IDs
Overhead: Zero (offline verification)

# Run benchmark
cargo run -p xtask -- benchmark --model model.gguf --tokens 128

# Verify receipt honesty
cargo run -p xtask -- verify-receipt ci/inference.json

# Checks:
# - compute_path="real" matches actual kernel IDs
# - GPU claims require GPU kernel IDs (gemm_*, i2s_gpu_*)
# - CPU claims require CPU kernel IDs (i2s_gemv, tl1_neon_*, tl2_avx_*)

Strict Mode Configuration

Primary Strict Mode:

# Enable all strict mode checks
export BITNET_STRICT_MODE=1

# This enables:
# - fail_on_mock: Fails when mock computation detected
# - require_quantization: Requires real I2S/TL1/TL2 kernels
# - enforce_quantized_inference: Rejects FP32 fallback in quantized layers
# - validate_performance: Rejects suspicious metrics (>150 tok/s)

Granular Strict Mode Controls:

# Fail immediately on mock detection (Issue #453 - AC2)
export BITNET_STRICT_FAIL_ON_MOCK=1

# Require real quantization kernels (Issue #453 - AC3)
export BITNET_STRICT_REQUIRE_QUANTIZATION=1

# Validate performance metrics (Issue #453 - AC6)
export BITNET_STRICT_VALIDATE_PERFORMANCE=1

# CI enhanced strict mode (Issue #453 - AC6)
export CI=1
export BITNET_CI_ENHANCED_STRICT=1

Strict Mode Error Messages

Strict mode errors provide actionable context for debugging:

Error: Strict mode: FP32 fallback rejected - qtype=I2S, device=Cuda(0), layer_dims=[2048, 2048], reason=kernel_unavailable
       ^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^  ^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^
       (1)           (2)                     (3)        (4)          (5)                 (6)

Strict mode: Indicates strict mode validation failure
FP32 fallback rejected: System tried to fall back to FP32 but strict mode prevented it
qtype=I2S: The quantization type that was attempted
device=Cuda(0): The device where inference was attempted
layer_dims=[2048, 2048]: Layer dimensions (in_features × out_features)
reason=kernel_unavailable: Why fallback was needed

Common Fallback Reasons and Solutions

Reason	Meaning	Solution
`kernel_unavailable`	Feature not compiled	`cargo build --no-default-features --features cpu` or `--features gpu`
`device_mismatch`	Tensor on wrong device	Ensure model loaded on same device as inference
`unsupported_dimensions`	Layer size not supported	Check model architecture compatibility
`gpu_oom`	GPU out of memory	Reduce batch size or use smaller model
`simd_unavailable`	SIMD features not detected	Rebuild with `RUSTFLAGS="-C target-cpu=native"`

Receipt Honesty Validation

Strict mode extends to receipt validation, ensuring performance claims are backed by evidence:

Quantized Kernel ID Patterns:

GPU Kernels: gemm_*, wmma_*, cuda_*, i2s_gpu_*, tl1_gpu_*, tl2_gpu_*
CPU Kernels (I2S): i2s_gemv, i2s_matmul_*, quantized_matmul_i2s
CPU Kernels (TL1/ARM): tl1_neon_*, tl1_lookup_*
CPU Kernels (TL2/x86): tl2_avx_*, tl2_avx512_*

Fallback Kernel ID Patterns:

Dequantization: dequant_*, dequant_i2s_to_fp32
FP32 Computation: fp32_matmul, fp32_gemm
Generic Fallback: fallback_*, scalar_*
Mock/Test: mock_*, test_stub

Validation Commands:

# Verify quantized kernels are used
cargo run -p xtask -- verify-receipt --require-quantized-kernels ci/inference.json

# Verify GPU kernels for GPU claims
cargo run -p xtask -- verify-receipt --require-gpu-kernels ci/inference.json

# Validate performance metrics
cargo run -p xtask -- verify-receipt --validate-performance ci/inference.json

Programmatic Usage

use bitnet_common::strict_mode::{StrictModeConfig, StrictModeEnforcer};
use bitnet_common::{Device, QuantizationType, Result};

// Production inference with strict mode
std::env::set_var("BITNET_STRICT_MODE", "1");
let enforcer = StrictModeEnforcer::new_detailed();

// Validate inference path (fails on mock usage)
enforcer.validate_inference_path(&inference_path)?;

// Validate quantization kernel availability
enforcer.validate_kernel_availability(&kernel_scenario)?;

// Validate quantization fallback (Issue #453 - AC3)
enforcer.validate_quantization_fallback(
    QuantizationType::I2S,
    Device::Cpu,
    &[2048, 2048],  // layer_dims
    "kernel_unavailable"
)?;

// Validate performance metrics (rejects >150 tok/s as suspicious)
enforcer.validate_performance_metrics(&performance_metrics)?;

Integration in Quantized Linear:

// crates/bitnet-inference/src/layers/quantized_linear.rs

async fn forward_i2s(&self, input: &BitNetTensor) -> Result<BitNetTensor> {
    let has_native = bitnet_kernels::is_quantized_kernel_available(
        QuantizationType::I2S,
        self.device,
        (self.in_features, self.out_features)
    );

    // Debug assertions (Tier 1 - Issue #453 - AC1)
    #[cfg(debug_assertions)]
    if !has_native {
        panic!("fallback to FP32 in debug mode: layer={}, qtype=I2S, reason=kernel_unavailable", self.name);
    }

    // Strict mode enforcement (Tier 2 - Issue #453 - AC3)
    if !has_native {
        let strict_mode = StrictModeEnforcer::new();
        if strict_mode.get_config().enforce_quantized_inference {
            return Err(BitNetError::StrictMode(format!(
                "FP32 fallback rejected - qtype=I2S, device={:?}, layer_dims=[{}, {}], reason=kernel_unavailable",
                self.device, self.in_features, self.out_features
            )));
        }
    }

    // Use native quantized matmul (no dequantization)
    if has_native {
        self.quantized_matmul_i2s(&input_2d, provider).await
    } else {
        log::warn!("Using FP32 fallback - should not happen in production");
        self.fallback_i2s_matmul(&input_2d).await
    }
}

Testing Strict Mode

Unit Tests with AC Traceability:

# AC1: Debug assertions in QuantizedLinear::forward
cargo test -p bitnet-inference --no-default-features --features cpu \
  test_ac1_debug_assert_i2s_fallback -- --nocapture

# AC3: Strict mode rejects FP32 fallback
BITNET_STRICT_MODE=1 \
cargo test -p bitnet-inference --no-default-features --features cpu \
  test_ac3_strict_mode_rejects_fallback -- --nocapture

# AC5: 16-token decode in strict mode
BITNET_STRICT_MODE=1 BITNET_DETERMINISTIC=1 BITNET_SEED=42 \
cargo test -p bitnet-inference --no-default-features --features cpu \
  test_ac5_16_token_decode_cpu_strict_mode --test strict_quantization_test

# AC6: Receipt validation for quantized computation claims
cargo test -p xtask test_ac6_receipt_quantized_kernels_valid -- --nocapture

Integration Tests:

# CPU strict mode validation
BITNET_STRICT_MODE=1 \
cargo test --no-default-features --features cpu --test strict_quantization_test

# GPU strict mode validation (requires GPU)
BITNET_STRICT_MODE=1 \
cargo test --no-default-features --features gpu --test strict_quantization_test

# Cross-validation with strict mode
BITNET_STRICT_MODE=1 BITNET_DETERMINISTIC=1 BITNET_SEED=42 \
cargo run -p xtask -- crossval

Deterministic Inference with Strict Mode

Combine strict mode with deterministic inference for maximum reproducibility:

# Enable strict mode + deterministic inference
export BITNET_STRICT_MODE=1
export BITNET_DETERMINISTIC=1
export BITNET_SEED=42
export RAYON_NUM_THREADS=1

# Run inference
cargo run -p bitnet-cli --no-default-features --features cpu -- \
  infer \
  --model model.gguf \
  --prompt "Test prompt" \
  --max-tokens 16 \
  --seed 42

# Outputs will be:
# 1. Identical across runs (deterministic)
# 2. Using real quantized kernels (strict mode)
# 3. Verified via receipt (honest computation)

Receipt Schema for Strict Mode

Receipts generated with strict mode include additional validation fields:

{
  "schema_version": "1.0.0",
  "backend": "cpu",
  "compute_path": "real",
  "kernels": [
    "i2s_gemv",
    "quantized_matmul_i2s"
  ],
  "tokens_per_second": 18.5,
  "tokens_generated": 128,
  "environment": {
    "BITNET_STRICT_MODE": "1",
    "BITNET_DETERMINISTIC": "1",
    "BITNET_SEED": "42"
  },
  "timestamp": "2025-10-14T12:34:56.789Z"
}

For more information, see:

Tutorial: Getting Started with Strict Mode - Learning-oriented introduction
How-To: Running Strict Mode Validation Workflows - Problem-oriented workflows
How-To: Verifying Receipt Honesty - Receipt validation guide
Reference: Environment Variables - Complete strict mode variable documentation
Reference: Validation Gates - Receipt honesty technical reference
Explanation: Strict Mode Rationale - Design rationale
Explanation: Strict Quantization Guards Specification - Complete feature specification
Development: GPU Development Guide - GPU-specific quantization details
Development: Build Commands - Build commands for different quantization features
Architecture: FFI Threading Architecture - FFI bridge details

FilesExpand file tree

quantization-support.md

Latest commit

History