Impute gift_aid from SPI for high-income donor rows by MaxGhenis · Pull Request #347 · PolicyEngine/policyengine-uk-data

MaxGhenis · 2026-04-17T11:15:38Z

Summary

The enhanced FRS's SPI-donor half carries gift_aid = 0 for every record — including the ~10k synthetic high-earner rows generated by the SPI income imputation — because the QRF was only trained to predict the six core income components. gift_aid was already listed in SPI_RENAMES (so SPI's GIFTAID column was being prepared as a training input) but it wasn't in IMPUTATIONS, so it never reached the model's output set.

As a result, the entire modelled UK population has gift_aid = 0 everywhere, missing ~£1-1.5bn/yr of Gift Aid tax relief that HMRC actually pays, and any reform touching Gift Aid is currently a no-op against baseline.

Fix

Adds gift_aid to IMPUTATIONS. Because it's trained jointly with the six income components, each SPI-donor row now carries a Gift Aid figure drawn alongside its income from the same SPI respondent — so high-earner donors get larger-than-zero Gift Aid draws in proportion to how much the underlying SPI respondent claimed.

The cache file is renamed to income_v2.pkl so any stale local pickle (which doesn't have gift_aid as an output) is bypassed automatically. CI always trains from scratch so this is a no-op there.

Verification

Retrained the QRF on the committed SPI 2020-21 inputs and ran a 10,000-row synthetic SPI-donor-style sample (random age, gender, region) through .predict:

	`gift_aid` nonzero share	mean (nonzero)
All donor rows	5.6% (was 0%)	£2,492
`employment_income >= £200k`	41.4% (was 0%)	£4,543

Consistent with HMRC higher-rate Gift Aid relief flows and with the share of UK taxpayers claiming Gift Aid (~12% of taxpayers, higher among higher-rate payers).

Scope

Does not touch the FRS-side imputation (still only overwrites dividend_income there). A proper FRS-side Gift Aid imputation would need an income-conditional model to avoid smearing £1-1.5bn of Gift Aid uniformly across all demographics. That's left for a follow-up once policyengine-uk#1621 lands the second-stage QRF pipeline.
Matches the US pattern of imputing rich covariates jointly rather than copying donor-row values. Related: policyengine-us-data#589.

Test plan

uvx ruff format --check on income.py clean
Existing test_is_parent_from_frs.py / test_target_registry.py pass after edit
Retraining succeeds against committed SPI 2020-21 data
Predicted gift_aid distribution matches expectations (above table)
CI Test job passes (includes a full build against the SPI raw data)

The enhanced FRS's SPI-donor half carries `gift_aid = 0` for every record — including the ~10k synthetic high-earner rows generated by the SPI income imputation — because the QRF was only trained to predict the six core income components. `gift_aid` was already listed in `SPI_RENAMES` (so SPI's `GIFTAID` column was being prepared as a training input) but it wasn't in `IMPUTATIONS`, so it never reached the model's output set. As a result, the entire modelled population missed the ~£1-1.5bn/yr of Gift Aid tax relief that HMRC actually pays, and any reform touching the Gift Aid regime was a no-op. Adds `gift_aid` to `IMPUTATIONS`. Because it's trained jointly with the income components, each SPI-donor row now carries a Gift Aid figure drawn alongside its income from the same SPI respondent, giving correlated draws (high earners likelier to make larger Gift Aid claims) rather than a demographic-only smear. Verified on a 10,000-row synthetic SPI-donor sample after retraining: - All rows: 5.6% nonzero gift_aid, mean among nonzero £2,492 - Employment income >= £200k: 41.4% nonzero gift_aid, mean £4,543 Renames the cached pickle to `income_v2.pkl` so any existing local pickle (which doesn't have `gift_aid` as an output) is bypassed and retraining happens automatically on next build. CI always trains from scratch, so this is a no-op there. Does not touch the FRS-side imputation (still only overwrites `dividend_income`). Properly imputing Gift Aid on the FRS side would require an income-conditional model, outside this PR's scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI surfaced two issues with the initial version: 1. `impute_over_incomes` used `dataset.person[IMPUTATIONS]` to compute a rent/mortgage adjustment factor. With gift_aid in IMPUTATIONS but not on the raw FRS build, the selection raised `KeyError: ['gift_aid'] not in index`. Gift Aid is also an expenditure, not income, so it shouldn't be in an "income total". Split INCOME_COMPONENTS (6 income variables, used for the adjustment factor) from IMPUTATIONS (training outputs, which additionally include gift_aid). 2. The full-FRS half of `impute_income` only overwrites dividend_income, so gift_aid remained unset on those rows. When `stack_datasets` combined the two halves, the full-FRS rows surfaced NaN gift_aid and subsequent `validate()` calls in the dataset-uprating path tripped. Initialise `dataset.person["gift_aid"] = 0.0` at the top of `impute_income` so the full-FRS side has a concrete value from the start. End-to-end verified locally against FRS 2023-24 + SPI 2020-21: - SPI-donor rows (weight=0, ~22k): 5.8% have nonzero gift_aid - Full-FRS rows (~36k): gift_aid = 0 (as intended — no change there) - No NaN columns; `validate()` passes on both halves and the stack Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer feedback: SPI's `GIFTINV` (qualifying investment/property gifts) is a separate relief on the UK side — policyengine-uk has a distinct `charitable_investment_gifts` variable that flows into the income-tax `allowances` aggregate. The previous revision of this PR only added `gift_aid` (SPI `GIFTAID`) to the QRF outputs, which is inconsistent with the standalone SPI dataset path (`datasets/spi.py:88`) that sums `GIFTAID + GIFTINV` into a single column. The enhanced-FRS path should populate each charitable-giving variable separately so each maps to the right policyengine-uk variable. Adds `charitable_investment_gifts` to `IMPUTATIONS` alongside `gift_aid`, extends the zero-initialisation to both columns, and bumps the cache file to `income_v3.pkl` so stale pickles from the previous revision retrain automatically. Verified locally against FRS 2023-24 + SPI 2020-21: - gift_aid: 1,271 of 21,607 SPI-donor rows nonzero (5.9%) - charitable_investment_gifts: 13 of 21,607 SPI-donor rows nonzero (0.06%) - Both validate() cleanly; stacked dataset has no NaN in either column Both columns stay at 0 on the full-FRS half (as intended — FRS doesn't collect charitable giving). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverts the `income.pkl` → `income_v2.pkl` → `income_v3.pkl` cache rename in favour of stable naming. Bumping the filename on every output-set change is a form of the `_v2`/`_v3` anti-pattern the repo's CLAUDE.md calls out (it leaves orphan pickles behind on local dev machines and adds to the "keep deleting deprecated files" debt). Instead, the cache loader now checks `cached.model.imputed_variables` against the current `IMPUTATIONS` and retrains if they disagree. A stale local `income.pkl` from the pre-PR state (six outputs) is detected and rebuilt on first use; no manual deletion required. Verified end-to-end: - Fresh run: `income.pkl` created with 8 outputs; `charitable_investment_gifts` and `gift_aid` populated as expected. - Forced stale state: a 6-output model saved under `income.pkl` is detected on load, discarded, and retrained to 8 outputs automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxGhenis and others added 2 commits April 17, 2026 07:14

MaxGhenis marked this pull request as ready for review April 17, 2026 12:09

MaxGhenis and others added 2 commits April 17, 2026 08:38

MaxGhenis merged commit f6e8454 into main Apr 17, 2026
4 of 5 checks passed

MaxGhenis deleted the impute-gift-aid-from-spi branch April 17, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impute gift_aid from SPI for high-income donor rows#347

Impute gift_aid from SPI for high-income donor rows#347
MaxGhenis merged 4 commits intomainfrom
impute-gift-aid-from-spi

MaxGhenis commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 17, 2026

Summary

Fix

Verification

Scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant