Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ all: data test

format:
ruff format .
mdformat --wrap 100 docs/

test:
pytest
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add `docs/internals/` developer reference: three notebooks covering all nine pipeline stages (Stage 1 data build, Stage 2 calibration matrix assembly, Stages 3–4 L0 optimization and H5 assembly) plus a README with pipeline orchestration reference, run ID format, Modal volume layout, and HuggingFace artifact paths.
1 change: 1 addition & 0 deletions changelog.d/update-methodology-docs.changed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Update public-facing methodology and data documentation to reflect the current pipeline implementation; pipeline now uploads validation diagnostics to HuggingFace after H5 builds complete.
8 changes: 7 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@ This project uses [MyST Markdown](https://mystmd.org/) for documentation.
## Building Locally

### Requirements

- Python 3.14+ with dev dependencies: `uv pip install -e .[dev] --system`
- Node.js 20+ (required by MyST)

### Commands

```bash
make documentation # Build static HTML files
make documentation-serve # Serve locally on http://localhost:8080
Expand All @@ -21,7 +23,8 @@ make documentation-serve # Serve locally on http://localhost:8080
- `_build/html/` - **Static HTML files (use for GitHub Pages deployment)**
- `_build/site/` - Dynamic content for `myst start` development server only

**GitHub Pages must deploy `_build/html/`**, not `_build/site/`. The `_build/site/` directory contains JSON files for MyST's development server and will result in a blank page on GitHub Pages.
**GitHub Pages must deploy `_build/html/`**, not `_build/site/`. The `_build/site/` directory
contains JSON files for MyST's development server and will result in a blank page on GitHub Pages.

## GitHub Pages Deployment

Expand All @@ -33,14 +36,17 @@ make documentation-serve # Serve locally on http://localhost:8080
## Troubleshooting

**Blank page after deployment:**

- Check that workflow deploys `folder: docs/_build/html` (not `_build/site`)
- Wait 5-10 minutes for GitHub Pages propagation
- Hard refresh browser (Ctrl+Shift+R / Cmd+Shift+R)

**Build fails in CI:**

- Ensure Node.js setup step exists in workflow (MyST requires Node.js)
- Never add timeouts or `|| true` to build commands - they mask failures

**Missing index.html:**

- MyST auto-generates index.html in `_build/html/`
- Do not create manual index.html in docs/
23 changes: 14 additions & 9 deletions docs/abstract.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
# Abstract

We present a methodology for creating enhanced microsimulation datasets by combining the
Current Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses
quantile regression forests to impute 67 tax variables from the PUF onto CPS records,
preserving distributional characteristics while maintaining household composition and member
relationships. The imputation process alone does not guarantee consistency with official
statistics, necessitating a reweighting step to align the combined dataset with known
population totals and administrative benchmarks. We apply a reweighting algorithm that calibrates the dataset to 2,813 targets from the IRS Statistics of Income, Census population projections, Congressional Budget Office benefit program estimates, Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare spending patterns, and other benefit program costs. The reweighting employs dropout-regularized gradient descent optimization to ensure consistency with administrative benchmarks. The dataset maintains the CPS's demographic detail and geographic granularity while
incorporating tax reporting data from administrative sources. We release the enhanced
dataset, source code, and documentation to support policy analysis.
We present a methodology for creating enhanced microsimulation datasets by combining the Current
Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses quantile regression
forests to impute 67 tax variables from the PUF onto CPS records, preserving distributional
characteristics while maintaining household composition and member relationships. The imputation
process alone does not guarantee consistency with official statistics, necessitating a reweighting
step to align the combined dataset with known population totals and administrative benchmarks. We
apply a reweighting algorithm that calibrates the dataset to 2,813 targets from the IRS Statistics
of Income, Census population projections, Congressional Budget Office benefit program estimates,
Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare
spending patterns, and other benefit program costs. The reweighting employs dropout-regularized
gradient descent optimization to ensure consistency with administrative benchmarks. The dataset
maintains the CPS's demographic detail and geographic granularity while incorporating tax reporting
data from administrative sources. We release the enhanced dataset, source code, and documentation to
support policy analysis.
10 changes: 8 additions & 2 deletions docs/appendix.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@

### A.1 Quantile Regression Forest Implementation

The following code demonstrates the implementation of Quantile Regression Forests for variable imputation:
The following code demonstrates the implementation of Quantile Regression Forests for variable
imputation:

```python
from quantile_forest import RandomForestQuantileRegressor
Expand Down Expand Up @@ -49,6 +50,7 @@ for iteration in range(5000):
#### Variables Imputed from IRS Public Use File (67 variables)

**Income Variables:**

- employment_income
- partnership_s_corp_income
- social_security
Expand All @@ -75,6 +77,7 @@ for iteration in range(5000):
- salt_refund_income

**Deductions and Adjustments:**

- interest_deduction
- unreimbursed_business_employee_expenses
- pre_tax_contributions
Expand All @@ -92,6 +95,7 @@ for iteration in range(5000):
- deductible_mortgage_interest

**Tax Credits:**

- cdcc_relevant_expenses
- foreign_tax_credit
- american_opportunity_credit
Expand All @@ -104,6 +108,7 @@ for iteration in range(5000):
- other_credits

**Qualified Business Income Variables:**

- w2_wages_from_qualified_business
- unadjusted_basis_qualified_property
- business_is_sstb
Expand All @@ -118,6 +123,7 @@ for iteration in range(5000):
- self_employment_income_would_be_qualified

**Other Tax Variables:**

- traditional_ira_contributions
- qualified_tuition_expenses
- casualty_loss
Expand All @@ -137,4 +143,4 @@ for iteration in range(5000):
#### Variables Imputed from American Community Survey (2 variables)

- rent
- real_estate_taxes
- real_estate_taxes
32 changes: 26 additions & 6 deletions docs/background.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,36 @@

## The Microsimulation Landscape

Tax and benefit microsimulation models play a role in policy analysis by projecting the distributional and revenue impacts of proposed reforms. Institutions maintaining these models include government agencies like the Congressional Budget Office (CBO), Joint Committee on Taxation (JCT), and Treasury's Office of Tax Analysis (OTA), as well as non-governmental organizations including the Urban-Brookings Tax Policy Center (TPC), Tax Foundation, Penn Wharton Budget Model (PWBM), Institute on Taxation and Economic Policy (ITEP), Yale Budget Lab, and the open-source Policy Simulation Library (PSL). Each model serves specific institutional needs but faces common data challenges.

The core challenges these models face stem from the tradeoff between data comprehensiveness and accessibility. Administrative tax data provides income reporting but lacks the household context that models need to analyze benefit programs and family-level impacts {cite:p}`sabelhaus2020`. Survey data captures household relationships and program participation but suffers from income underreporting that worsens at higher income levels {cite:p}`meyer2021`. The need to protect taxpayer privacy limits data availability because administrators cannot publicly release microdata.
Tax and benefit microsimulation models play a role in policy analysis by projecting the
distributional and revenue impacts of proposed reforms. Institutions maintaining these models
include government agencies like the Congressional Budget Office (CBO), Joint Committee on Taxation
(JCT), and Treasury's Office of Tax Analysis (OTA), as well as non-governmental organizations
including the Urban-Brookings Tax Policy Center (TPC), Tax Foundation, Penn Wharton Budget Model
(PWBM), Institute on Taxation and Economic Policy (ITEP), Yale Budget Lab, and the open-source
Policy Simulation Library (PSL). Each model serves specific institutional needs but faces common
data challenges.

The core challenges these models face stem from the tradeoff between data comprehensiveness and
accessibility. Administrative tax data provides income reporting but lacks the household context
that models need to analyze benefit programs and family-level impacts {cite:p}`sabelhaus2020`.
Survey data captures household relationships and program participation but suffers from income
underreporting that worsens at higher income levels {cite:p}`meyer2021`. The need to protect
taxpayer privacy limits data availability because administrators cannot publicly release microdata.

## Data Enhancement Approaches

Different microsimulation models use various approaches to enhance their underlying data:

Government models (CBO, JCT, Treasury) have access to confidential administrative data but cannot share their enhanced microdata. Non-governmental models work with public data, leading to various enhancement strategies. Some organizations use proprietary extracts of tax returns, while others enhance survey data with various methods.
Government models (CBO, JCT, Treasury) have access to confidential administrative data but cannot
share their enhanced microdata. Non-governmental models work with public data, leading to various
enhancement strategies. Some organizations use proprietary extracts of tax returns, while others
enhance survey data with various methods.

Our enhanced dataset provides an open-source methodology with state identifiers and calibration to state-level targets. This enables analysis of federal-state tax interactions. Researchers can use the dataset with PolicyEngine or other microsimulation models.
Our enhanced dataset provides an open-source methodology with state identifiers and calibration to
state-level targets. This enables analysis of federal-state tax interactions. Researchers can use
the dataset with PolicyEngine or other microsimulation models.

The open-source nature promotes methodological transparency. The modular design allows researchers to substitute alternative imputation or calibration methods while maintaining the overall framework. Regular updates as new CPS and administrative data become available ensure the dataset remains current.
The open-source nature promotes methodological transparency. The modular design allows researchers
to substitute alternative imputation or calibration methods while maintaining the overall framework.
Regular updates as new CPS and administrative data become available ensure the dataset remains
current.
Loading
Loading