Skip to content

Add HF destination guard constants and xfail AST detection test#356

Merged
MaxGhenis merged 2 commits intomainfrom
fix/hf-destinations-guard
Apr 17, 2026
Merged

Add HF destination guard constants and xfail AST detection test#356
MaxGhenis merged 2 commits intomainfrom
fix/hf-destinations-guard

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

This PR addresses bug-hunt finding S1 (HF destination naming inconsistency) — but only by adding detection scaffolding, not by changing any upload destinations.

Context. The bug hunt flagged that storage/upload_private_prerequisites.py:32 uploads UKDS-licensed FRS/LCFS/WAS/ETB/SPI zips with repo="policyengine/policyengine-uk-data" — i.e. the PUBLIC repo — while the file is literally named upload_*private*_prerequisites. On top of that, utils/data_upload.py::upload_data_files defaults hf_repo_name to the PUBLIC repo while its sibling upload_files_to_hf defaults to the PRIVATE repo. The inconsistency is a footgun rather than an active leak today (the HF repo policyengine/policyengine-uk-data returns HTTP 401), but it will eventually bite.

What this PR does:

  1. Adds policyengine_uk_data/utils/hf_destinations.py exposing three constants:
    • PRIVATE_REPO = "policyengine/policyengine-uk-data-private"
    • PUBLIC_REPO = "policyengine/policyengine-uk-data"
    • ALLOWED_REPOS = frozenset({PRIVATE_REPO, PUBLIC_REPO})
      with detailed docstrings on which artefacts belong where, and pointers to CLAUDE.md rule 1.
  2. Adds policyengine_uk_data/tests/test_hf_destinations.py. The main test walks every .py file in the package with ast, finds every upload(...), upload_file(...), upload_files_to_hf(...) and upload_data_files(...) call site, and flags any that passes a string literal (or a non-constant expression) as repo= / hf_repo_name= / repo_id= instead of routing through PRIVATE_REPO / PUBLIC_REPO.

The scanner currently reports these violations:

  • utils/huggingface.py:26upload_file(..., repo_id=<expr>) (variable passthrough).
  • utils/data_upload.py:147upload_files_to_hf(..., hf_repo_name=<expr>) (variable passthrough).
  • storage/upload_private_prerequisites.py:31upload(..., repo='policyengine/policyengine-uk-data', ...)the primary footgun.
  • storage/upload_completed_datasets.py:19upload_data_files(..., hf_repo_name='policyengine/policyengine-uk-data-private', ...) — correct destination, but still a raw literal.

What this PR does NOT do. Per the data-controller policy in CLAUDE.md rule 1 and repeated in MEMORY.md:

NEVER change HF upload destinations in policyengine-uk-data — private/public split is intentional (UK Data Service licence)

no repo= or hf_repo_name= argument value is changed. The detection test is marked @pytest.mark.xfail(reason=..., strict=False) with the message:

Known naming inconsistency; existing destinations intentionally preserved per repo policy. Resolve by renaming the HF repo or migrating scripts — NOT by changing code in place without owner approval.

Ask of maintainers

Please decide how to resolve the naming inconsistency:

  • Option A: rename the HF repos so policyengine-uk-data-private becomes the public one (or vice versa), keeping the code unchanged.
  • Option B: migrate each script through the hf_destinations constants with data-controller review, then remove the xfail decorator.
  • Option C: keep the current state and document that upload_private_prerequisites.py is intentional for the public mirror.

Until that decision lands, this PR leaves the existing destinations untouched and gives the team a single place (the xfail test output) to see exactly which call sites need attention.

Test plan

  • uv run pytest policyengine_uk_data/tests/test_hf_destinations.py -v — constants test passes; detection test xfails as expected with the four known violations listed above.

Finding S1 from the bug hunt.

@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Review — APPROVE (self-PR; leaving as comment)

Safety posture verified. This PR only ADDS three files:

  • changelog.d/hf-destinations-guard.added.md
  • policyengine_uk_data/tests/test_hf_destinations.py
  • policyengine_uk_data/utils/hf_destinations.py

Zero existing files touched, so no repo= or hf_repo_name= value can have changed. The AST walker correctly targets the four upload entry points (upload, upload_file, upload_files_to_hf, upload_data_files) and the three kwargs (repo, hf_repo_name, repo_id); _is_allowed_reference accepts both the hf_destinations.PRIVATE_REPO attribute access and bare PRIVATE_REPO name import. The xfail reason string is firm and direct: "existing destinations intentionally preserved per repo policy ... NOT by changing code in place without owner approval." Constants live in a dedicated module with docstrings pointing to UKDS licence context.

CI: all green.

@MaxGhenis MaxGhenis merged commit fb3b987 into main Apr 17, 2026
3 checks passed
@MaxGhenis MaxGhenis deleted the fix/hf-destinations-guard branch April 17, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant