Skip to content

Improve versioning, add SQL query/result caching, optimize main index query#142

Merged
fedorov merged 11 commits into
mainfrom
improve-versioning
May 8, 2026
Merged

Improve versioning, add SQL query/result caching, optimize main index query#142
fedorov merged 11 commits into
mainfrom
improve-versioning

Conversation

@fedorov

@fedorov fedorov commented May 8, 2026

Copy link
Copy Markdown
Member

No description provided.

fedorov and others added 9 commits May 8, 2026 09:08
…c_index

Joins auxiliary_metadata to expose when each series was first added to IDC
and when it was last revised, enabling version-aware filtering of the index.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the direct 57M-row instance-level join on SeriesInstanceUID with a
CTE that groups auxiliary_metadata to one row per series first, avoiding the
many-to-many join explosion before GROUP BY.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… CTE

MIN on series_init_idc_version and MAX on series_revised_idc_version are
semantically correct and robust against unexpected intra-series variation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cache key is the SHA256 of each SQL file, which encodes both query logic
and the BQ dataset version. On cache hit, all three artifacts (.parquet,
_schema.json, .sql) are restored from gs://idc-index-data-cache without
hitting BigQuery. Cache failures fall back to BQ transparently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add type: ignore[attr-defined] for google.cloud.storage import
  (no stubs available for mypy)
- Add gcs_cache_bucket to both guard conditions so mypy narrows
  the type from str | None to str at the call sites

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
google.cloud.storage causes persistent mypy attr-defined and
import-untyped errors that resist per-line suppression due to ruff
auto-fixing the import form. Use # mypy: ignore-errors at the file
level as a pragmatic workaround until a proper fix is implemented.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Indexes idc-dev-etl.idc_v24_pub.version_metadata, exposing idc_version
and version_timestamp while excluding the version_hash column.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fedorov fedorov force-pushed the improve-versioning branch from 738bfc8 to dd206f4 Compare May 8, 2026 15:40
@fedorov fedorov changed the title Improve versioning - add series-level column with added/updated version info Improve versioning, add SQL query/result caching, optimize main index query May 8, 2026
fedorov and others added 2 commits May 8, 2026 11:57
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fedorov fedorov merged commit b031e96 into main May 8, 2026
12 checks passed
@fedorov fedorov deleted the improve-versioning branch May 8, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant