Skip to content

Archive pipeline runs and intermediate artifacts to HuggingFace #641

@anth-volk

Description

@anth-volk

Summary

Pipeline run metadata and diagnostics are currently stored on a Modal volume (pipeline-artifacts) under /pipeline/runs/{run_id}/. If the volume is deleted, all run history is lost. Diagnostics are also uploaded to the main data repo (policyengine/policyengine-us-data) under calibration/runs/{run_id}/diagnostics/, but only final diagnostics — not intermediate build artifacts.

Goal

Publish full run records (metadata + diagnostics + intermediate artifacts) to a dedicated HF model repo (PolicyEngine/policyengine-us-data-pipeline). Keep all existing uploads to the main data repo unchanged.

What gets archived

Run metadata & diagnostics (mirrored on every write)

  • {run_id}/meta.json
  • {run_id}/diagnostics/unified_diagnostics.csv
  • {run_id}/diagnostics/calibration_log.csv
  • {run_id}/diagnostics/unified_run_config.json
  • {run_id}/diagnostics/national_* variants
  • {run_id}/diagnostics/validation_results.csv
  • {run_id}/diagnostics/national_validation.txt

Intermediate build artifacts (Step 1 — not shipped elsewhere)

  • acs_2022.h5, irs_puf_2015.h5, puf_2024.h5
  • extended_cps_2024.h5, stratified_extended_cps_2024.h5
  • build_log.txt, calibration_log_legacy.csv, uprating_factors.csv

Package metadata (Step 2)

  • calibration_package_meta.json

Implementation

  • New _batched_hf_upload() shared helper in data_upload.py (deduplicates staging + pipeline upload logic)
  • New upload_to_pipeline_repo() thin wrapper for the pipeline archival repo
  • _mirror_to_pipeline_repo() in pipeline.py — non-fatal subprocess wrapper with timeout, env-var data passing
  • _archive_artifacts() — unified artifact archival helper
  • write_run_meta() gains mirror param (False in error handlers to prevent hangs)
  • All archival is non-fatal — pipeline never fails due to archival issues

Prerequisites

  • Create PolicyEngine/policyengine-us-data-pipeline model repo on HuggingFace (one-time manual step)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions