Impute gift_aid from SPI for high-income donor rows#347
Merged
Conversation
The enhanced FRS's SPI-donor half carries `gift_aid = 0` for every record — including the ~10k synthetic high-earner rows generated by the SPI income imputation — because the QRF was only trained to predict the six core income components. `gift_aid` was already listed in `SPI_RENAMES` (so SPI's `GIFTAID` column was being prepared as a training input) but it wasn't in `IMPUTATIONS`, so it never reached the model's output set. As a result, the entire modelled population missed the ~£1-1.5bn/yr of Gift Aid tax relief that HMRC actually pays, and any reform touching the Gift Aid regime was a no-op. Adds `gift_aid` to `IMPUTATIONS`. Because it's trained jointly with the income components, each SPI-donor row now carries a Gift Aid figure drawn alongside its income from the same SPI respondent, giving correlated draws (high earners likelier to make larger Gift Aid claims) rather than a demographic-only smear. Verified on a 10,000-row synthetic SPI-donor sample after retraining: - All rows: 5.6% nonzero gift_aid, mean among nonzero £2,492 - Employment income >= £200k: 41.4% nonzero gift_aid, mean £4,543 Renames the cached pickle to `income_v2.pkl` so any existing local pickle (which doesn't have `gift_aid` as an output) is bypassed and retraining happens automatically on next build. CI always trains from scratch, so this is a no-op there. Does not touch the FRS-side imputation (still only overwrites `dividend_income`). Properly imputing Gift Aid on the FRS side would require an income-conditional model, outside this PR's scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced two issues with the initial version: 1. `impute_over_incomes` used `dataset.person[IMPUTATIONS]` to compute a rent/mortgage adjustment factor. With gift_aid in IMPUTATIONS but not on the raw FRS build, the selection raised `KeyError: ['gift_aid'] not in index`. Gift Aid is also an expenditure, not income, so it shouldn't be in an "income total". Split INCOME_COMPONENTS (6 income variables, used for the adjustment factor) from IMPUTATIONS (training outputs, which additionally include gift_aid). 2. The full-FRS half of `impute_income` only overwrites dividend_income, so gift_aid remained unset on those rows. When `stack_datasets` combined the two halves, the full-FRS rows surfaced NaN gift_aid and subsequent `validate()` calls in the dataset-uprating path tripped. Initialise `dataset.person["gift_aid"] = 0.0` at the top of `impute_income` so the full-FRS side has a concrete value from the start. End-to-end verified locally against FRS 2023-24 + SPI 2020-21: - SPI-donor rows (weight=0, ~22k): 5.8% have nonzero gift_aid - Full-FRS rows (~36k): gift_aid = 0 (as intended — no change there) - No NaN columns; `validate()` passes on both halves and the stack Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer feedback: SPI's `GIFTINV` (qualifying investment/property gifts) is a separate relief on the UK side — policyengine-uk has a distinct `charitable_investment_gifts` variable that flows into the income-tax `allowances` aggregate. The previous revision of this PR only added `gift_aid` (SPI `GIFTAID`) to the QRF outputs, which is inconsistent with the standalone SPI dataset path (`datasets/spi.py:88`) that sums `GIFTAID + GIFTINV` into a single column. The enhanced-FRS path should populate each charitable-giving variable separately so each maps to the right policyengine-uk variable. Adds `charitable_investment_gifts` to `IMPUTATIONS` alongside `gift_aid`, extends the zero-initialisation to both columns, and bumps the cache file to `income_v3.pkl` so stale pickles from the previous revision retrain automatically. Verified locally against FRS 2023-24 + SPI 2020-21: - gift_aid: 1,271 of 21,607 SPI-donor rows nonzero (5.9%) - charitable_investment_gifts: 13 of 21,607 SPI-donor rows nonzero (0.06%) - Both validate() cleanly; stacked dataset has no NaN in either column Both columns stay at 0 on the full-FRS half (as intended — FRS doesn't collect charitable giving). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the `income.pkl` → `income_v2.pkl` → `income_v3.pkl` cache rename in favour of stable naming. Bumping the filename on every output-set change is a form of the `_v2`/`_v3` anti-pattern the repo's CLAUDE.md calls out (it leaves orphan pickles behind on local dev machines and adds to the "keep deleting deprecated files" debt). Instead, the cache loader now checks `cached.model.imputed_variables` against the current `IMPUTATIONS` and retrains if they disagree. A stale local `income.pkl` from the pre-PR state (six outputs) is detected and rebuilt on first use; no manual deletion required. Verified end-to-end: - Fresh run: `income.pkl` created with 8 outputs; `charitable_investment_gifts` and `gift_aid` populated as expected. - Forced stale state: a 6-output model saved under `income.pkl` is detected on load, discarded, and retrained to 8 outputs automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The enhanced FRS's SPI-donor half carries
gift_aid = 0for every record — including the ~10k synthetic high-earner rows generated by the SPI income imputation — because the QRF was only trained to predict the six core income components.gift_aidwas already listed inSPI_RENAMES(so SPI'sGIFTAIDcolumn was being prepared as a training input) but it wasn't inIMPUTATIONS, so it never reached the model's output set.As a result, the entire modelled UK population has
gift_aid = 0everywhere, missing ~£1-1.5bn/yr of Gift Aid tax relief that HMRC actually pays, and any reform touching Gift Aid is currently a no-op against baseline.Fix
Adds
gift_aidtoIMPUTATIONS. Because it's trained jointly with the six income components, each SPI-donor row now carries a Gift Aid figure drawn alongside its income from the same SPI respondent — so high-earner donors get larger-than-zero Gift Aid draws in proportion to how much the underlying SPI respondent claimed.The cache file is renamed to
income_v2.pklso any stale local pickle (which doesn't havegift_aidas an output) is bypassed automatically. CI always trains from scratch so this is a no-op there.Verification
Retrained the QRF on the committed SPI 2020-21 inputs and ran a 10,000-row synthetic SPI-donor-style sample (random age, gender, region) through
.predict:gift_aidnonzero shareemployment_income >= £200kConsistent with HMRC higher-rate Gift Aid relief flows and with the share of UK taxpayers claiming Gift Aid (~12% of taxpayers, higher among higher-rate payers).
Scope
dividend_incomethere). A proper FRS-side Gift Aid imputation would need an income-conditional model to avoid smearing £1-1.5bn of Gift Aid uniformly across all demographics. That's left for a follow-up once policyengine-uk#1621 lands the second-stage QRF pipeline.Test plan
uvx ruff format --checkonincome.pycleantest_is_parent_from_frs.py/test_target_registry.pypass after editTestjob passes (includes a full build against the SPI raw data)