Improve cross-machine reproducibility "fingerprint" test for area weights

@martinholmer raised a concern about whether the current "fingerprint" test for reproducibility across runs and machines is sufficient. It is not. This issue and proposed solution, prepared in collaboration with Claude, reflects that conversation with @martinholmer.

## Goals

Area weight fingerprint tests should:

1. **Catch legitimate differences** — any real change to code, data, targets, or solver parameters that alters area weights should be detected, even if the change is small (e.g., 1% shift in a small area's weight sum)
2. **No false positives from numerical noise** — different hardware (x86 vs ARM, different SIMD), different BLAS implementations (MKL vs OpenBLAS vs Accelerate), or different operation ordering in algebraically equivalent code should not cause test failures
3. **Scale to thousands of areas** — if we add county weighting, we may have 3,000+ areas; the test must not become slow or produce spurious failures due to sheer number of comparisons
4. **Be fast** — the fingerprint check itself should take seconds, not minutes; I/O (reading weight files) will dominate, not computation

## Current approach and its weaknesses

The current fingerprint (in `_compute_fingerprint`) works as follows:
- For each area, read the weight file, round each weight to the nearest integer, sum the rounded values → one integer per area
- Sort areas alphabetically, join as `AK:12345|AL:67890|...`, take first 16 chars of SHA-256

**Problems:**

- **Insensitive to small areas**: Area weights are often in the range 0.1–8.0. Rounding to integers destroys most of the information. A weight shifting from 1.5 to 2.4 (60% change) is invisible if both round to 2.
- **Distribution-blind**: Two completely different weight vectors with the same integer sum produce the same fingerprint. Redistribution of weight across records goes undetected.
- **Rounding boundary false positives**: A weight of 2.4999 on one machine and 2.5001 on another (a ~1e-5 relative difference — pure numerical noise) rounds to 2 vs 3, changing the sum and failing the test.
- **Hash provides no diagnostics**: When the SHA-256 check fails, you know *something* changed but not *what* or *where*. The per-area integer sum comparison helps, but inherits the insensitivity and boundary problems above.

## Context: solver numerical properties

The area weight solver uses Clarabel (interior-point QP) with convergence tolerances of 1e-7 on gap and feasibility. Clarabel uses its own QDLDL factorization (not BLAS), so given identical inputs it aims for cross-machine determinism. However, the *inputs* to Clarabel (the constraint matrix from Tax-Calculator output) depend on numpy/BLAS, so independent `make data` runs on different machines can produce slightly different inputs, leading to solver path divergence.

Expected cross-machine differences in aggregate weight statistics: O(1e-6) to O(1e-4) relative. Expected differences from a real code/data change: typically > 1% relative. There is a clean separation between noise and signal — the test tolerance just needs to sit between them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cross-machine reproducibility "fingerprint" test for area weights #477

Goals

Current approach and its weaknesses

Context: solver numerical properties

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve cross-machine reproducibility "fingerprint" test for area weights #477

Description

Goals

Current approach and its weaknesses

Context: solver numerical properties

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions