From f0d67efde361e42f56855a02f095d30bc9126c4c Mon Sep 17 00:00:00 2001 From: Max Ghenis Date: Fri, 17 Apr 2026 18:48:35 -0400 Subject: [PATCH] Point CONTRIBUTING.md at the shared PolicyEngine guide --- .github/CONTRIBUTING.md | 47 +++++++++++++++++-- .../docs-towncrier-contributing.changed.md | 1 + 2 files changed, 44 insertions(+), 4 deletions(-) create mode 100644 changelog.d/docs-towncrier-contributing.changed.md diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 4604959a7..ffc952586 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -1,7 +1,46 @@ -## Updating data +# Contributing to policyengine-uk-data -If your changes present a non-bugfix change to one or more datasets which are cloud-hosted (FRS and EFRS), then please change both the filename and URL (in both the class definition file and in `storage/upload_completed_datasets.py`). This enables us to store historical versions of datasets separately and reproducibly. +See the [shared PolicyEngine contribution guide](https://github.com/PolicyEngine/.github/blob/main/CONTRIBUTING.md) for cross-repo conventions (towncrier changelog fragments, `uv run`, PR description format, anti-patterns). This file covers policyengine-uk-data specifics. -## Updating the versioning +## Commands -Please add to `changelog.yaml` and then run `make changelog` before committing the results ONCE in this PR. +```bash +make install # install deps (uv) +make format # format (required) +make download # download raw FRS + SPI inputs from HF (needs HUGGING_FACE_TOKEN) +make data # full dataset build (impute, calibrate, upload) +make test # test suite +uv run pytest policyengine_uk_data/tests/path/to/test.py -v +``` + +Python 3.13+. Default branch: `main`. Raw FRS / SPI microdata live on HuggingFace; set `HUGGING_FACE_TOKEN` before running anything that touches the dataset build. + +## What lives here + +This repo builds the `.h5` files that feed `policyengine-uk`: + +- `datasets/frs.py` — raw FRS → PolicyEngine variable mapping +- `datasets/imputations/` — QRF / other imputations layered on top (income, wealth, consumption, etc.) +- `datasets/local_areas/` — constituency and local-authority calibration +- `targets/` — calibration target sources (OBR, DWP, HMRC, ONS, SLC, etc.) +- `utils/calibrate.py` — the reweighting optimiser +- `storage/` — raw inputs, intermediate artefacts, published outputs + +## Data-protection rules — no exceptions + +The enhanced FRS dataset is licensed under strict UK Data Service terms. Violating them risks losing access, which would end PolicyEngine UK. + +- **Never upload data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated. +- **Never modify `upload_completed_datasets.py` or `utils/data_upload.py`** to change upload destinations without explicit confirmation from the data controller (currently Nikhil Woodruff). +- **Never print, log, or output individual-level records.** Aggregates (sums, means, counts, weighted totals) are fine; individual rows are not. +- **If you see a private/public repo split, assume it is intentional** — ask why before changing it. + +## Updating datasets + +If your change is a non-bugfix update to a cloud-hosted dataset (FRS, enhanced FRS), bump both the filename and URL in the class definition and in `storage/upload_completed_datasets.py`. That lets us store historical dataset versions separately and reproducibly. + +## Repo-specific anti-patterns + +- **Don't** hardcode dataset years in variable transforms; use `dataset.time_period` and the uprating pipeline. +- **Don't** commit large binary artefacts — use HuggingFace storage. +- **Don't** skip `make test` when touching the imputation or calibration pipeline; full CI rebuilds the dataset and takes ~25 minutes. diff --git a/changelog.d/docs-towncrier-contributing.changed.md b/changelog.d/docs-towncrier-contributing.changed.md new file mode 100644 index 000000000..91870f4d9 --- /dev/null +++ b/changelog.d/docs-towncrier-contributing.changed.md @@ -0,0 +1 @@ +Point CONTRIBUTING.md at the shared PolicyEngine contribution guide (https://github.com/PolicyEngine/.github) and trim the per-repo file to commands, repo-specific conventions, and anti-patterns. Removes the stale `changelog_entry.yaml` / `make changelog` instructions.