Skip to content

Impute gift_aid from SPI for high-income donor rows#347

Merged
MaxGhenis merged 4 commits intomainfrom
impute-gift-aid-from-spi
Apr 17, 2026
Merged

Impute gift_aid from SPI for high-income donor rows#347
MaxGhenis merged 4 commits intomainfrom
impute-gift-aid-from-spi

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

The enhanced FRS's SPI-donor half carries gift_aid = 0 for every record — including the ~10k synthetic high-earner rows generated by the SPI income imputation — because the QRF was only trained to predict the six core income components. gift_aid was already listed in SPI_RENAMES (so SPI's GIFTAID column was being prepared as a training input) but it wasn't in IMPUTATIONS, so it never reached the model's output set.

As a result, the entire modelled UK population has gift_aid = 0 everywhere, missing ~£1-1.5bn/yr of Gift Aid tax relief that HMRC actually pays, and any reform touching Gift Aid is currently a no-op against baseline.

Fix

Adds gift_aid to IMPUTATIONS. Because it's trained jointly with the six income components, each SPI-donor row now carries a Gift Aid figure drawn alongside its income from the same SPI respondent — so high-earner donors get larger-than-zero Gift Aid draws in proportion to how much the underlying SPI respondent claimed.

The cache file is renamed to income_v2.pkl so any stale local pickle (which doesn't have gift_aid as an output) is bypassed automatically. CI always trains from scratch so this is a no-op there.

Verification

Retrained the QRF on the committed SPI 2020-21 inputs and ran a 10,000-row synthetic SPI-donor-style sample (random age, gender, region) through .predict:

gift_aid nonzero share mean (nonzero)
All donor rows 5.6% (was 0%) £2,492
employment_income >= £200k 41.4% (was 0%) £4,543

Consistent with HMRC higher-rate Gift Aid relief flows and with the share of UK taxpayers claiming Gift Aid (~12% of taxpayers, higher among higher-rate payers).

Scope

  • Does not touch the FRS-side imputation (still only overwrites dividend_income there). A proper FRS-side Gift Aid imputation would need an income-conditional model to avoid smearing £1-1.5bn of Gift Aid uniformly across all demographics. That's left for a follow-up once policyengine-uk#1621 lands the second-stage QRF pipeline.
  • Matches the US pattern of imputing rich covariates jointly rather than copying donor-row values. Related: policyengine-us-data#589.

Test plan

  • uvx ruff format --check on income.py clean
  • Existing test_is_parent_from_frs.py / test_target_registry.py pass after edit
  • Retraining succeeds against committed SPI 2020-21 data
  • Predicted gift_aid distribution matches expectations (above table)
  • CI Test job passes (includes a full build against the SPI raw data)

MaxGhenis and others added 2 commits April 17, 2026 07:14
The enhanced FRS's SPI-donor half carries `gift_aid = 0` for every
record — including the ~10k synthetic high-earner rows generated by
the SPI income imputation — because the QRF was only trained to
predict the six core income components. `gift_aid` was already listed
in `SPI_RENAMES` (so SPI's `GIFTAID` column was being prepared as a
training input) but it wasn't in `IMPUTATIONS`, so it never reached
the model's output set. As a result, the entire modelled population
missed the ~£1-1.5bn/yr of Gift Aid tax relief that HMRC actually
pays, and any reform touching the Gift Aid regime was a no-op.

Adds `gift_aid` to `IMPUTATIONS`. Because it's trained jointly with
the income components, each SPI-donor row now carries a Gift Aid
figure drawn alongside its income from the same SPI respondent,
giving correlated draws (high earners likelier to make larger Gift
Aid claims) rather than a demographic-only smear.

Verified on a 10,000-row synthetic SPI-donor sample after retraining:

- All rows: 5.6% nonzero gift_aid, mean among nonzero £2,492
- Employment income >= £200k: 41.4% nonzero gift_aid, mean £4,543

Renames the cached pickle to `income_v2.pkl` so any existing local
pickle (which doesn't have `gift_aid` as an output) is bypassed and
retraining happens automatically on next build. CI always trains
from scratch, so this is a no-op there.

Does not touch the FRS-side imputation (still only overwrites
`dividend_income`). Properly imputing Gift Aid on the FRS side
would require an income-conditional model, outside this PR's scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced two issues with the initial version:

1. `impute_over_incomes` used `dataset.person[IMPUTATIONS]` to compute a
   rent/mortgage adjustment factor. With gift_aid in IMPUTATIONS but not
   on the raw FRS build, the selection raised `KeyError: ['gift_aid']
   not in index`. Gift Aid is also an expenditure, not income, so it
   shouldn't be in an "income total". Split INCOME_COMPONENTS (6 income
   variables, used for the adjustment factor) from IMPUTATIONS (training
   outputs, which additionally include gift_aid).

2. The full-FRS half of `impute_income` only overwrites dividend_income,
   so gift_aid remained unset on those rows. When `stack_datasets`
   combined the two halves, the full-FRS rows surfaced NaN gift_aid and
   subsequent `validate()` calls in the dataset-uprating path tripped.
   Initialise `dataset.person["gift_aid"] = 0.0` at the top of
   `impute_income` so the full-FRS side has a concrete value from the
   start.

End-to-end verified locally against FRS 2023-24 + SPI 2020-21:

- SPI-donor rows (weight=0, ~22k): 5.8% have nonzero gift_aid
- Full-FRS rows (~36k): gift_aid = 0 (as intended — no change there)
- No NaN columns; `validate()` passes on both halves and the stack

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis marked this pull request as ready for review April 17, 2026 12:09
MaxGhenis and others added 2 commits April 17, 2026 08:38
Reviewer feedback: SPI's `GIFTINV` (qualifying investment/property
gifts) is a separate relief on the UK side — policyengine-uk has a
distinct `charitable_investment_gifts` variable that flows into the
income-tax `allowances` aggregate. The previous revision of this PR
only added `gift_aid` (SPI `GIFTAID`) to the QRF outputs, which is
inconsistent with the standalone SPI dataset path
(`datasets/spi.py:88`) that sums `GIFTAID + GIFTINV` into a single
column. The enhanced-FRS path should populate each charitable-giving
variable separately so each maps to the right policyengine-uk variable.

Adds `charitable_investment_gifts` to `IMPUTATIONS` alongside
`gift_aid`, extends the zero-initialisation to both columns, and bumps
the cache file to `income_v3.pkl` so stale pickles from the previous
revision retrain automatically.

Verified locally against FRS 2023-24 + SPI 2020-21:

- gift_aid: 1,271 of 21,607 SPI-donor rows nonzero (5.9%)
- charitable_investment_gifts: 13 of 21,607 SPI-donor rows nonzero (0.06%)
- Both validate() cleanly; stacked dataset has no NaN in either column

Both columns stay at 0 on the full-FRS half (as intended — FRS doesn't
collect charitable giving).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the `income.pkl` → `income_v2.pkl` → `income_v3.pkl` cache
rename in favour of stable naming. Bumping the filename on every
output-set change is a form of the `_v2`/`_v3` anti-pattern the repo's
CLAUDE.md calls out (it leaves orphan pickles behind on local dev
machines and adds to the "keep deleting deprecated files" debt).

Instead, the cache loader now checks `cached.model.imputed_variables`
against the current `IMPUTATIONS` and retrains if they disagree. A
stale local `income.pkl` from the pre-PR state (six outputs) is
detected and rebuilt on first use; no manual deletion required.

Verified end-to-end:

- Fresh run: `income.pkl` created with 8 outputs; `charitable_investment_gifts` and `gift_aid` populated as expected.
- Forced stale state: a 6-output model saved under `income.pkl` is detected on load, discarded, and retrained to 8 outputs automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit f6e8454 into main Apr 17, 2026
4 of 5 checks passed
@MaxGhenis MaxGhenis deleted the impute-gift-aid-from-spi branch April 17, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant