@martinholmer raised a concern about whether the current "fingerprint" test for reproducibility across runs and machines is sufficient. It is not. This issue and proposed solution, prepared in collaboration with Claude, reflects that conversation with @martinholmer.
Goals
Area weight fingerprint tests should:
- Catch legitimate differences — any real change to code, data, targets, or solver parameters that alters area weights should be detected, even if the change is small (e.g., 1% shift in a small area's weight sum)
- No false positives from numerical noise — different hardware (x86 vs ARM, different SIMD), different BLAS implementations (MKL vs OpenBLAS vs Accelerate), or different operation ordering in algebraically equivalent code should not cause test failures
- Scale to thousands of areas — if we add county weighting, we may have 3,000+ areas; the test must not become slow or produce spurious failures due to sheer number of comparisons
- Be fast — the fingerprint check itself should take seconds, not minutes; I/O (reading weight files) will dominate, not computation
Current approach and its weaknesses
The current fingerprint (in _compute_fingerprint) works as follows:
- For each area, read the weight file, round each weight to the nearest integer, sum the rounded values → one integer per area
- Sort areas alphabetically, join as
AK:12345|AL:67890|..., take first 16 chars of SHA-256
Problems:
- Insensitive to small areas: Area weights are often in the range 0.1–8.0. Rounding to integers destroys most of the information. A weight shifting from 1.5 to 2.4 (60% change) is invisible if both round to 2.
- Distribution-blind: Two completely different weight vectors with the same integer sum produce the same fingerprint. Redistribution of weight across records goes undetected.
- Rounding boundary false positives: A weight of 2.4999 on one machine and 2.5001 on another (a ~1e-5 relative difference — pure numerical noise) rounds to 2 vs 3, changing the sum and failing the test.
- Hash provides no diagnostics: When the SHA-256 check fails, you know something changed but not what or where. The per-area integer sum comparison helps, but inherits the insensitivity and boundary problems above.
Context: solver numerical properties
The area weight solver uses Clarabel (interior-point QP) with convergence tolerances of 1e-7 on gap and feasibility. Clarabel uses its own QDLDL factorization (not BLAS), so given identical inputs it aims for cross-machine determinism. However, the inputs to Clarabel (the constraint matrix from Tax-Calculator output) depend on numpy/BLAS, so independent make data runs on different machines can produce slightly different inputs, leading to solver path divergence.
Expected cross-machine differences in aggregate weight statistics: O(1e-6) to O(1e-4) relative. Expected differences from a real code/data change: typically > 1% relative. There is a clean separation between noise and signal — the test tolerance just needs to sit between them.
@martinholmer raised a concern about whether the current "fingerprint" test for reproducibility across runs and machines is sufficient. It is not. This issue and proposed solution, prepared in collaboration with Claude, reflects that conversation with @martinholmer.
Goals
Area weight fingerprint tests should:
Current approach and its weaknesses
The current fingerprint (in
_compute_fingerprint) works as follows:AK:12345|AL:67890|..., take first 16 chars of SHA-256Problems:
Context: solver numerical properties
The area weight solver uses Clarabel (interior-point QP) with convergence tolerances of 1e-7 on gap and feasibility. Clarabel uses its own QDLDL factorization (not BLAS), so given identical inputs it aims for cross-machine determinism. However, the inputs to Clarabel (the constraint matrix from Tax-Calculator output) depend on numpy/BLAS, so independent
make dataruns on different machines can produce slightly different inputs, leading to solver path divergence.Expected cross-machine differences in aggregate weight statistics: O(1e-6) to O(1e-4) relative. Expected differences from a real code/data change: typically > 1% relative. There is a clean separation between noise and signal — the test tolerance just needs to sit between them.