Add HF destination guard constants and xfail AST detection test#356
Merged
Add HF destination guard constants and xfail AST detection test#356
Conversation
Contributor
Author
Review — APPROVE (self-PR; leaving as comment)Safety posture verified. This PR only ADDS three files:
Zero existing files touched, so no CI: all green. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses bug-hunt finding S1 (HF destination naming inconsistency) — but only by adding detection scaffolding, not by changing any upload destinations.
Context. The bug hunt flagged that
storage/upload_private_prerequisites.py:32uploads UKDS-licensed FRS/LCFS/WAS/ETB/SPI zips withrepo="policyengine/policyengine-uk-data"— i.e. the PUBLIC repo — while the file is literally namedupload_*private*_prerequisites. On top of that,utils/data_upload.py::upload_data_filesdefaultshf_repo_nameto the PUBLIC repo while its siblingupload_files_to_hfdefaults to the PRIVATE repo. The inconsistency is a footgun rather than an active leak today (the HF repopolicyengine/policyengine-uk-datareturns HTTP 401), but it will eventually bite.What this PR does:
policyengine_uk_data/utils/hf_destinations.pyexposing three constants:PRIVATE_REPO = "policyengine/policyengine-uk-data-private"PUBLIC_REPO = "policyengine/policyengine-uk-data"ALLOWED_REPOS = frozenset({PRIVATE_REPO, PUBLIC_REPO})with detailed docstrings on which artefacts belong where, and pointers to CLAUDE.md rule 1.
policyengine_uk_data/tests/test_hf_destinations.py. The main test walks every.pyfile in the package withast, finds everyupload(...),upload_file(...),upload_files_to_hf(...)andupload_data_files(...)call site, and flags any that passes a string literal (or a non-constant expression) asrepo=/hf_repo_name=/repo_id=instead of routing throughPRIVATE_REPO/PUBLIC_REPO.The scanner currently reports these violations:
utils/huggingface.py:26—upload_file(..., repo_id=<expr>)(variable passthrough).utils/data_upload.py:147—upload_files_to_hf(..., hf_repo_name=<expr>)(variable passthrough).storage/upload_private_prerequisites.py:31—upload(..., repo='policyengine/policyengine-uk-data', ...)— the primary footgun.storage/upload_completed_datasets.py:19—upload_data_files(..., hf_repo_name='policyengine/policyengine-uk-data-private', ...)— correct destination, but still a raw literal.What this PR does NOT do. Per the data-controller policy in
CLAUDE.mdrule 1 and repeated inMEMORY.md:no
repo=orhf_repo_name=argument value is changed. The detection test is marked@pytest.mark.xfail(reason=..., strict=False)with the message:Ask of maintainers
Please decide how to resolve the naming inconsistency:
policyengine-uk-data-privatebecomes the public one (or vice versa), keeping the code unchanged.hf_destinationsconstants with data-controller review, then remove thexfaildecorator.upload_private_prerequisites.pyis intentional for the public mirror.Until that decision lands, this PR leaves the existing destinations untouched and gives the team a single place (the
xfailtest output) to see exactly which call sites need attention.Test plan
uv run pytest policyengine_uk_data/tests/test_hf_destinations.py -v— constants test passes; detection test xfails as expected with the four known violations listed above.Finding S1 from the bug hunt.