Update pipeline documentation, both public facing and internal by juaristi22 · Pull Request #644 · PolicyEngine/policyengine-us-data

juaristi22 · 2026-03-27T13:37:46Z

Summary

Comprehensive update to pipeline documentation — both public-facing and internal developer reference. FIxes #643 .

Internal developer reference (`docs/internals/`)

Three new notebooks providing thorough explanations of the calibration pipeline for developers:

data_build_internals.ipynb — Stage 1: PUF cloning, geography assignment (including AGI-conditional routing and the no-collision constraint), and source imputation. Corrected pipeline ordering to match implementation (PUF clone → geography → source imputation). Documents that geography is rederived per-run, not persisted.
calibration_package_internals.ipynb — Stage 2: Matrix construction internals including per-state simulation, clone loop, domain constraints (corrected: constraints come from stratum_constraints in policy_data.db, not target_config.yaml), takeup re-randomization (state precomputation + clone-loop draws), county-dependent variables, COO assembly, target config filtering (clarified: applied post-matrix-build, not during construction), hierarchical uprating, and calibration package serialization with initial weight computation.
optimization_and_local_dataset_assembly_internals.ipynb — Stages 3–4: L0 optimization (fixed sparsity demo from 20→200 records so lambda effect is visible), H5 assembly pipeline (expanded from 11→16 steps matching actual implementation), SPM threshold recalculation, takeup consistency invariant, and diagnostics including validation_results.csv.
README.md — Pipeline orchestration reference with run ID format, step dependency graph, Modal volumes, HuggingFace artifact paths, resume logic. Added file reference tables for calibration/ and modal_app/ with per-file descriptions and notes on legacy/standalone status.

Public-facing documentation

docs/methodology.md — Minor updates to reflect current implementation.
docs/data.md — Updated data source descriptions.

Dead code removed

save_geography() and load_geography() from clone_and_assign.py — defined but never called by any pipeline code. Geography is rederived each run via deterministic seeding, making serialization unnecessary.

Test plan

ruff format --check . passes
Notebooks verified against implementation (cross-referenced with actual code in unified_matrix_builder.py, unified_calibration.py, publish_local_area.py, clone_and_assign.py)
Rebased on latest main (includes PR Improve calibration: AGI-conditional geography, expanded targets, pipeline fixes #671 AGI-conditional assignment, Python 3.14, ORG imputations)

🤖 Generated with Claude Code

…ll diagnostics to HF - docs/methodology.md and docs/data.md updated to match current pipeline - pipeline.py now uploads validation diagnostics after H5 builds complete, in addition to the existing calibration diagnostics upload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Move docs/calibration_internals.ipynb → docs/internals/calibration_package_internals.ipynb - Add docs/internals/data_build_internals.ipynb: Stage 1 coverage — clone creation with real assign_random_geography() on 20 records, source imputation concept demo, PUF cloning toy walkthrough - Add docs/internals/local_dataset_assembly_internals.ipynb: Stages 3–4 — Hard Concrete L0 math, λ preset comparison, weight expansion reference, diagnostics column guide - Add docs/internals/README.md: navigation index + §9 pipeline orchestration (run ID format, Modal volumes, step dependency graph, resume logic, HuggingFace artifact paths, meta.json structure) - Extend calibration_package_internals with Part 4 (matrix assembly per-state, domain constraints) and Part 5 (takeup randomization cross-stage demo) - All notebooks execute with zero errors under --allow-errors; toy inputs complete in <30s - Add changelog fragment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

baogorek

Hi @juaristi22 this is great work! Please see comments. I know you're documenting a very malleable product here. I didn't get a chance to go through every line in the Jupyter notebooks, and I don't think I'll need to, but I'd like to see them one more time after you rebase and address the comments. Also, please consider whether you do want to include changes to the python files in here that control the modal job and data build.

baogorek · 2026-04-07T20:35:37Z

docs/methodology.md

 # Methodology

-PolicyEngine constructs its representative household dataset through a multi-stage pipeline. Survey data from the CPS is merged with tax detail from the IRS PUF, stratified, and supplemented with variables from ACS, SIPP, and SCF. The resulting dataset is then cloned to geographic variants, simulated through PolicyEngine US with stochastic take-up, and calibrated via L0-regularized optimization against administrative targets at the national, state, and congressional district levels. The pipeline produces 488 geographically representative H5 datasets.
+PolicyEngine constructs its representative household dataset through a five-step pipeline that runs on Modal, preceded by a prerequisite database build. The database build (`make database`) populates a SQLite store of administrative calibration targets. The five Modal steps are: (1) build datasets — assemble the enhanced microdata from CPS, PUF, ACS, SIPP, and SCF; (2) build package — run PolicyEngine on every clone to construct a sparse calibration matrix; (3) fit weights — find household weights via L0-regularized optimization against the administrative targets; (4) build H5 files — write 488 geographically representative datasets; (5) promote — move the staged files to production on HuggingFace.


FYI I just added this PR today to build the database during the pipeline, which was originally your suggestion.

Not a blocker but it's a little easier for us vim users (all one of us!) to navigate around these files if there are multiple lines, so maybe a line length of 100-120 characters. It will also make the mds a bit easier to read in the terminal.

baogorek · 2026-04-07T20:49:33Z

docs/methodology.md

 - **National preset** (λ_L0 = 1e-4): Retains approximately 50,000 records. Used for the national web app dataset where fast computation is prioritized over geographic granularity.

-The optimizer is Adam with a learning rate of 0.15, running for 100–200 epochs. Training runs on GPU (A100 or T4) via Modal for production builds, or on CPU for local development.
+The optimizer is Adam with a learning rate of 0.15. The default epoch count is 100; production builds typically run 1000-1500 epochs to ensure convergence. Training runs on GPU (A100 or T4) via Modal for production builds, or on CPU for local development.


Up to you whether you want to go ahead and include the commands to run local area and national fits:

python -m policyengine_us_data.calibration.unified_calibration \ --package-path policyengine_us_data/storage/calibration/calibration_package.pkl \ --epochs 1000 \ --beta 0.65 \ --lambda-l0 1e-7 \ --lambda-l2 1e-8 \ --log-freq 500 \ --target-config policyengine_us_data/calibration/target_config.yaml \ --device cpu python -m policyengine_us_data.calibration.unified_calibration \ --package-path policyengine_us_data/storage/calibration/calibration_package.pkl \ --epochs 4000 \ --beta 0.65 \ --lambda-l0 1e-4 \ --lambda-l2 1e-12 \ --log-freq 500 \ --target-config policyengine_us_data/calibration/target_config.yaml \ --device cpu \ --output policyengine_us_data/storage/calibration/national/weights.npy

baogorek · 2026-04-07T20:50:26Z

docs/methodology.md

 ### Output

-The pipeline produces 488 H5 datasets: 51 state files (including DC), 435 congressional district files, a national file, and city files for New York City. Each file is a self-contained PolicyEngine dataset that can be loaded directly into `Microsimulation` for policy analysis.
+The pipeline produces 488 local H5 datasets: 51 state files (including DC), 435 congressional district files, a national file, and city files for New York City. Each file is a self-contained PolicyEngine dataset that can be loaded directly into `Microsimulation` for policy analysis.


We should add a second city file just so we can always say "cities" plural!

baogorek · 2026-04-07T21:05:09Z

docs/methodology.md

+
+### Why 430 clones per household
+
+The pipeline clones each of the ~12,000 stratified households 430 times, producing approximately 5.2 million total records entering calibration. We chose 430 so that the population-weighted random block sampling covers every populated census block in the US with at least one clone in expectation. Fewer clones reduce geographic resolution; more clones increase memory and compute cost proportionally.


I honestly can't remember why it's not exactly 12,000. I think it might be 11,999 even though I'm sure I requested 12k exactly. Not sure if that will confuse anyone in the future.

"We chose 430 so that the population-weighted random block sampling covers every populated census block in the US with at least one clone in expectation." -> I don't think this is true, since there are census blocks in the Yosemite wilderness. Also, according to the code, more populated blocks get more donors, and most recently, richer census blocks get more of the ultra rich donor households.

baogorek · 2026-04-07T21:06:57Z

docs/methodology.md

+
+### Why L0 regularization (not L1 or L2)
+
+L1 and L2 regularization shrink weights toward zero or toward uniform but retain all records with nonzero weight. Running PolicyEngine simulations at scale requires iterating over every nonzero-weight record, so retaining millions of records makes per-area simulation slow. L0 regularization drives most weights to *exactly* zero, producing a sparse weight vector where only a few hundred thousand records carry nonzero weight. The optimizer selects those records to collectively match the administrative targets, making per-area simulation fast while preserving calibration accuracy.


Believe it or not (and statisticians were obsessed with this fact for decades), L1 regularization does in fact zero out variables and is the basis for the LASSO. But it also pushes the non-zero variables (i.e., weights) down, which we don't want. Or, I should say, we want a separate knob of control, which is our L2 knob.

baogorek · 2026-04-07T21:09:47Z

modal_app/data_build.py

@@ -468,49 +468,27 @@ def build_datasets(
            for future in as_completed(futures):
                future.result()  # Raises if script failed



@anth-volk and I have been going gangbusters with modal changes. Are you sure you want them in this PR?

baogorek · 2026-04-07T21:11:18Z

policyengine_us_data/datasets/puf/puf.py

-        from policyengine_us_data.datasets.cps import CPS_2021
-
-        cps = Microsimulation(dataset=CPS_2021)
+    cps = Microsimulation(dataset=CPS_2024)


Nobody likes seeing try / catch blocks deleted than me, but I just want to make sure this is being done for the right reason.

baogorek · 2026-04-07T21:20:36Z

docs/internals/calibration_package_internals.ipynb

+    "_state_per_record = np.array([6, 48, 36, 6, 48, 17, 36, 6])\n",
+    "_cd_per_record = np.array([601, 4801, 3601, 602, 4802, 1701, 3602, 603])\n",
+    "\n",
+    "\n",
+    "@dataclass\n",


underscore variables and @DataClass make this a bit scary, but I see later that you're just building the toy matrix. It's a bit hard for me to understand what's happening in the build. If the matrix is 12 by 24, I wonder if you could print it out and hide the code that made it (24 may be too many). Take this as just off-the-cuff feedback; I don't have answers.

baogorek · 2026-04-07T21:22:58Z

docs/internals/data_build_internals.ipynb

+    "When `household_agi` and `cd_agi_targets` are provided, `assign_random_geography()` uses a two-distribution sampling strategy:\n",
+    "\n",
+    "1. **Identify extreme households** — those at or above the `agi_threshold_pctile` (default 90th percentile) of household AGI.\n",
+    "2. **Build AGI-weighted block probabilities** — `_build_agi_block_probs()` multiplies population block probabilities by CD-level AGI targets: `P_agi(block) = P_pop(block) * AGI_target(CD) / Z`. This makes blocks in high-AGI districts more likely for extreme households.\n",


Check the formula, because it just changed recently.

baogorek · 2026-04-07T21:24:06Z

docs/internals/data_build_internals.ipynb

+    "Implementation:\n",
+    "- Clone 0 draws freely.\n",
+    "- Each subsequent clone checks for collisions against all previous clones and resamples the colliding records, up to 50 retries.\n",
+    "- Residual collisions after 50 retries are accepted (very rare with large block distributions).\n",


I think we fixed the collision problem since we now put household_id in the salt. It would have been done about a month ago.

juaristi22 marked this pull request as draft March 27, 2026 16:47

juaristi22 force-pushed the maria/methodology-docs branch from 0ced478 to 6493e3a Compare April 1, 2026 08:17

juaristi22 and others added 4 commits April 1, 2026 19:18

Add changelog fragment

f9e5965

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

review and update internals docs

4479b78

juaristi22 force-pushed the maria/methodology-docs branch from 6493e3a to 4479b78 Compare April 1, 2026 13:49

juaristi22 marked this pull request as ready for review April 1, 2026 13:49

juaristi22 requested review from anth-volk and baogorek April 1, 2026 13:49

fixing data build order from #674

af3d5b8

baogorek requested changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pipeline documentation, both public facing and internal#644

Update pipeline documentation, both public facing and internal#644
juaristi22 wants to merge 5 commits intomainfrom
maria/methodology-docs

juaristi22 commented Mar 27, 2026 •

edited

Loading

Uh oh!

baogorek left a comment

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

baogorek Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Why 430 clones per household

		The pipeline clones each of the ~12,000 stratified households 430 times, producing approximately 5.2 million total records entering calibration. We chose 430 so that the population-weighted random block sampling covers every populated census block in the US with at least one clone in expectation. Fewer clones reduce geographic resolution; more clones increase memory and compute cost proportionally.


		### Why L0 regularization (not L1 or L2)

		L1 and L2 regularization shrink weights toward zero or toward uniform but retain all records with nonzero weight. Running PolicyEngine simulations at scale requires iterating over every nonzero-weight record, so retaining millions of records makes per-area simulation slow. L0 regularization drives most weights to exactly zero, producing a sparse weight vector where only a few hundred thousand records carry nonzero weight. The optimizer selects those records to collectively match the administrative targets, making per-area simulation fast while preserving calibration accuracy.

		@@ -468,49 +468,27 @@ def build_datasets(
		for future in as_completed(futures):
		future.result() # Raises if script failed

Conversation

juaristi22 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Internal developer reference (docs/internals/)

Public-facing documentation

Dead code removed

Test plan

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juaristi22 commented Mar 27, 2026 •

edited

Loading

Internal developer reference (`docs/internals/`)