-
Notifications
You must be signed in to change notification settings - Fork 10
Archive pipeline runs and intermediate artifacts to HuggingFace #641
Copy link
Copy link
Open
Description
Summary
Pipeline run metadata and diagnostics are currently stored on a Modal volume (pipeline-artifacts) under /pipeline/runs/{run_id}/. If the volume is deleted, all run history is lost. Diagnostics are also uploaded to the main data repo (policyengine/policyengine-us-data) under calibration/runs/{run_id}/diagnostics/, but only final diagnostics — not intermediate build artifacts.
Goal
Publish full run records (metadata + diagnostics + intermediate artifacts) to a dedicated HF model repo (PolicyEngine/policyengine-us-data-pipeline). Keep all existing uploads to the main data repo unchanged.
What gets archived
Run metadata & diagnostics (mirrored on every write)
{run_id}/meta.json{run_id}/diagnostics/unified_diagnostics.csv{run_id}/diagnostics/calibration_log.csv{run_id}/diagnostics/unified_run_config.json{run_id}/diagnostics/national_*variants{run_id}/diagnostics/validation_results.csv{run_id}/diagnostics/national_validation.txt
Intermediate build artifacts (Step 1 — not shipped elsewhere)
acs_2022.h5,irs_puf_2015.h5,puf_2024.h5extended_cps_2024.h5,stratified_extended_cps_2024.h5build_log.txt,calibration_log_legacy.csv,uprating_factors.csv
Package metadata (Step 2)
calibration_package_meta.json
Implementation
- New
_batched_hf_upload()shared helper indata_upload.py(deduplicates staging + pipeline upload logic) - New
upload_to_pipeline_repo()thin wrapper for the pipeline archival repo _mirror_to_pipeline_repo()inpipeline.py— non-fatal subprocess wrapper with timeout, env-var data passing_archive_artifacts()— unified artifact archival helperwrite_run_meta()gainsmirrorparam (False in error handlers to prevent hangs)- All archival is non-fatal — pipeline never fails due to archival issues
Prerequisites
- Create
PolicyEngine/policyengine-us-data-pipelinemodel repo on HuggingFace (one-time manual step)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels