PolicyEngine · juaristi22 · Apr 8, 2026 · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026
diff --git a/Makefile b/Makefile
@@ -20,6 +20,7 @@ all: data test
 
 format:
 	ruff format .
+	mdformat --wrap 100 docs/
 
 test:
 	pytest

diff --git a/changelog.d/add-calibration-pipeline-internals-docs.added.md b/changelog.d/add-calibration-pipeline-internals-docs.added.md
@@ -0,0 +1 @@
+Add `docs/internals/` developer reference: three notebooks covering all nine pipeline stages (Stage 1 data build, Stage 2 calibration matrix assembly, Stages 3–4 L0 optimization and H5 assembly) plus a README with pipeline orchestration reference, run ID format, Modal volume layout, and HuggingFace artifact paths.
diff --git a/changelog.d/update-methodology-docs.changed.md b/changelog.d/update-methodology-docs.changed.md
@@ -0,0 +1 @@
+Update public-facing methodology and data documentation to reflect the current pipeline implementation; pipeline now uploads validation diagnostics to HuggingFace after H5 builds complete.
diff --git a/docs/README.md b/docs/README.md
@@ -5,10 +5,12 @@ This project uses [MyST Markdown](https://mystmd.org/) for documentation.
 ## Building Locally
 
 ### Requirements
+
 - Python 3.14+ with dev dependencies: `uv pip install -e .[dev] --system`
 - Node.js 20+ (required by MyST)
 
 ### Commands
+
 ```bash
 make documentation        # Build static HTML files
 make documentation-serve  # Serve locally on http://localhost:8080
@@ -21,7 +23,8 @@ make documentation-serve  # Serve locally on http://localhost:8080
 - `_build/html/` - **Static HTML files (use for GitHub Pages deployment)**
 - `_build/site/` - Dynamic content for `myst start` development server only
 
-**GitHub Pages must deploy `_build/html/`**, not `_build/site/`. The `_build/site/` directory contains JSON files for MyST's development server and will result in a blank page on GitHub Pages.
+**GitHub Pages must deploy `_build/html/`**, not `_build/site/`. The `_build/site/` directory
+contains JSON files for MyST's development server and will result in a blank page on GitHub Pages.
 
 ## GitHub Pages Deployment
 
@@ -33,14 +36,17 @@ make documentation-serve  # Serve locally on http://localhost:8080
 ## Troubleshooting
 
 **Blank page after deployment:**
+
 - Check that workflow deploys `folder: docs/_build/html` (not `_build/site`)
 - Wait 5-10 minutes for GitHub Pages propagation
 - Hard refresh browser (Ctrl+Shift+R / Cmd+Shift+R)
 
 **Build fails in CI:**
+
 - Ensure Node.js setup step exists in workflow (MyST requires Node.js)
 - Never add timeouts or `|| true` to build commands - they mask failures
 
 **Missing index.html:**
+
 - MyST auto-generates index.html in `_build/html/`
 - Do not create manual index.html in docs/
diff --git a/docs/abstract.md b/docs/abstract.md
@@ -1,11 +1,16 @@
 # Abstract
 
-We present a methodology for creating enhanced microsimulation datasets by combining the
-Current Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses
-quantile regression forests to impute 67 tax variables from the PUF onto CPS records,
-preserving distributional characteristics while maintaining household composition and member
-relationships. The imputation process alone does not guarantee consistency with official
-statistics, necessitating a reweighting step to align the combined dataset with known
-population totals and administrative benchmarks. We apply a reweighting algorithm that calibrates the dataset to 2,813 targets from the IRS Statistics of Income, Census population projections, Congressional Budget Office benefit program estimates, Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare spending patterns, and other benefit program costs. The reweighting employs dropout-regularized gradient descent optimization to ensure consistency with administrative benchmarks. The dataset maintains the CPS's demographic detail and geographic granularity while
-incorporating tax reporting data from administrative sources. We release the enhanced
-dataset, source code, and documentation to support policy analysis.
+We present a methodology for creating enhanced microsimulation datasets by combining the Current
+Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses quantile regression
+forests to impute 67 tax variables from the PUF onto CPS records, preserving distributional
+characteristics while maintaining household composition and member relationships. The imputation
+process alone does not guarantee consistency with official statistics, necessitating a reweighting
+step to align the combined dataset with known population totals and administrative benchmarks. We
+apply a reweighting algorithm that calibrates the dataset to 2,813 targets from the IRS Statistics
+of Income, Census population projections, Congressional Budget Office benefit program estimates,
+Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare
+spending patterns, and other benefit program costs. The reweighting employs dropout-regularized
+gradient descent optimization to ensure consistency with administrative benchmarks. The dataset
+maintains the CPS's demographic detail and geographic granularity while incorporating tax reporting
+data from administrative sources. We release the enhanced dataset, source code, and documentation to
+support policy analysis.
diff --git a/docs/appendix.md b/docs/appendix.md
@@ -4,7 +4,8 @@
 
 ### A.1 Quantile Regression Forest Implementation
 
-The following code demonstrates the implementation of Quantile Regression Forests for variable imputation:
+The following code demonstrates the implementation of Quantile Regression Forests for variable
+imputation:
 
 ```python
 from quantile_forest import RandomForestQuantileRegressor
@@ -49,6 +50,7 @@ for iteration in range(5000):
 #### Variables Imputed from IRS Public Use File (67 variables)
 
 **Income Variables:**
+
 - employment_income
 - partnership_s_corp_income
 - social_security
@@ -75,6 +77,7 @@ for iteration in range(5000):
 - salt_refund_income
 
 **Deductions and Adjustments:**
+
 - interest_deduction
 - unreimbursed_business_employee_expenses
 - pre_tax_contributions
@@ -92,6 +95,7 @@ for iteration in range(5000):
 - deductible_mortgage_interest
 
 **Tax Credits:**
+
 - cdcc_relevant_expenses
 - foreign_tax_credit
 - american_opportunity_credit
@@ -104,6 +108,7 @@ for iteration in range(5000):
 - other_credits
 
 **Qualified Business Income Variables:**
+
 - w2_wages_from_qualified_business
 - unadjusted_basis_qualified_property
 - business_is_sstb
@@ -118,6 +123,7 @@ for iteration in range(5000):
 - self_employment_income_would_be_qualified
 
 **Other Tax Variables:**
+
 - traditional_ira_contributions
 - qualified_tuition_expenses
 - casualty_loss
@@ -137,4 +143,4 @@ for iteration in range(5000):
 #### Variables Imputed from American Community Survey (2 variables)
 
 - rent
-- real_estate_taxes
+- real_estate_taxes
diff --git a/docs/background.md b/docs/background.md
@@ -2,16 +2,36 @@
 
 ## The Microsimulation Landscape
 
-Tax and benefit microsimulation models play a role in policy analysis by projecting the distributional and revenue impacts of proposed reforms. Institutions maintaining these models include government agencies like the Congressional Budget Office (CBO), Joint Committee on Taxation (JCT), and Treasury's Office of Tax Analysis (OTA), as well as non-governmental organizations including the Urban-Brookings Tax Policy Center (TPC), Tax Foundation, Penn Wharton Budget Model (PWBM), Institute on Taxation and Economic Policy (ITEP), Yale Budget Lab, and the open-source Policy Simulation Library (PSL). Each model serves specific institutional needs but faces common data challenges.
-
-The core challenges these models face stem from the tradeoff between data comprehensiveness and accessibility. Administrative tax data provides income reporting but lacks the household context that models need to analyze benefit programs and family-level impacts {cite:p}`sabelhaus2020`. Survey data captures household relationships and program participation but suffers from income underreporting that worsens at higher income levels {cite:p}`meyer2021`. The need to protect taxpayer privacy limits data availability because administrators cannot publicly release microdata.
+Tax and benefit microsimulation models play a role in policy analysis by projecting the
+distributional and revenue impacts of proposed reforms. Institutions maintaining these models
+include government agencies like the Congressional Budget Office (CBO), Joint Committee on Taxation
+(JCT), and Treasury's Office of Tax Analysis (OTA), as well as non-governmental organizations
+including the Urban-Brookings Tax Policy Center (TPC), Tax Foundation, Penn Wharton Budget Model
+(PWBM), Institute on Taxation and Economic Policy (ITEP), Yale Budget Lab, and the open-source
+Policy Simulation Library (PSL). Each model serves specific institutional needs but faces common
+data challenges.
+
+The core challenges these models face stem from the tradeoff between data comprehensiveness and
+accessibility. Administrative tax data provides income reporting but lacks the household context
+that models need to analyze benefit programs and family-level impacts {cite:p}`sabelhaus2020`.
+Survey data captures household relationships and program participation but suffers from income
+underreporting that worsens at higher income levels {cite:p}`meyer2021`. The need to protect
+taxpayer privacy limits data availability because administrators cannot publicly release microdata.
 
 ## Data Enhancement Approaches
 
 Different microsimulation models use various approaches to enhance their underlying data:
 
-Government models (CBO, JCT, Treasury) have access to confidential administrative data but cannot share their enhanced microdata. Non-governmental models work with public data, leading to various enhancement strategies. Some organizations use proprietary extracts of tax returns, while others enhance survey data with various methods.
+Government models (CBO, JCT, Treasury) have access to confidential administrative data but cannot
+share their enhanced microdata. Non-governmental models work with public data, leading to various
+enhancement strategies. Some organizations use proprietary extracts of tax returns, while others
+enhance survey data with various methods.
 
-Our enhanced dataset provides an open-source methodology with state identifiers and calibration to state-level targets. This enables analysis of federal-state tax interactions. Researchers can use the dataset with PolicyEngine or other microsimulation models.
+Our enhanced dataset provides an open-source methodology with state identifiers and calibration to
+state-level targets. This enables analysis of federal-state tax interactions. Researchers can use
+the dataset with PolicyEngine or other microsimulation models.
 
-The open-source nature promotes methodological transparency. The modular design allows researchers to substitute alternative imputation or calibration methods while maintaining the overall framework. Regular updates as new CPS and administrative data become available ensure the dataset remains current.
+The open-source nature promotes methodological transparency. The modular design allows researchers
+to substitute alternative imputation or calibration methods while maintaining the overall framework.
+Regular updates as new CPS and administrative data become available ensure the dataset remains
+current.
-Original file line number
+Diff line change
@@ Expand Up / @@ -20,6 +20,7 @@ all: data test @@
     format:
     	ruff format .
+    	mdformat --wrap 100 docs/
     test:
     	pytest
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Add `docs/internals/` developer reference: three notebooks covering all nine pipeline stages (Stage 1 data build, Stage 2 calibration matrix assembly, Stages 3–4 L0 optimization and H5 assembly) plus a README with pipeline orchestration reference, run ID format, Modal volume layout, and HuggingFace artifact paths.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Update public-facing methodology and data documentation to reflect the current pipeline implementation; pipeline now uploads validation diagnostics to HuggingFace after H5 builds complete.