-
Notifications
You must be signed in to change notification settings - Fork 74
Initial snapshot summary metrics support #529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Initial snapshot summary metrics support #529
Conversation
src/metadata/iceberg_snapshot.cpp
Outdated
| } | ||
| } | ||
|
|
||
| static const std::unordered_map<SnapshotMetricType, string> kSnapshotMetricKeys = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| try { | ||
| value = std::stoll(it->second); | ||
| } catch (...) { | ||
| // Skip invalid metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat matches https://github.com/apache/iceberg/blob/06ebf07e671c3ee2b0d0add1c4dc849ab9cbf630/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L818-L820, except we discard invalid metrics at earlier stage
| if (it != snapshot_summary.end()) { | ||
| int64_t value; | ||
| try { | ||
| value = std::stoll(it->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static const IcebergSnapshot::metrics_map_t kMetricsWithEmptyTotals {{SnapshotMetricType::TOTAL_DATA_FILES, 0}, | ||
| {SnapshotMetricType::TOTAL_RECORDS, 0}}; | ||
|
|
||
| static IcebergSnapshot::metrics_map_t GetSnapshotMetrics(const IcebergManifest &manifest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because DuckDB fast appends a manifest for new snapshot, this implementation is simpler than the Java side
|
|
||
| static IcebergSnapshot::metrics_map_t GetSnapshotMetrics(const IcebergManifest &manifest, | ||
| const IcebergSnapshot::metrics_map_t &previous_metrics) { | ||
| IcebergSnapshot::metrics_map_t metrics {{SnapshotMetricType::ADDED_DATA_FILES, manifest.added_files_count}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| {SnapshotMetricType::DELETED_DATA_FILES, manifest.deleted_files_count}, | ||
| {SnapshotMetricType::DELETED_RECORDS, manifest.deleted_rows_count}}; | ||
|
|
||
| auto previous_total_files = previous_metrics.find(SnapshotMetricType::TOTAL_DATA_FILES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/apache/iceberg/blob/06ebf07e671c3ee2b0d0add1c4dc849ab9cbf630/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L792 for this combining process with previous snapshot summary (again simpler for us since we can just use the manifest)
| auto manifest_list_path = table_info.BaseFilePath() + "/metadata/snap-" + std::to_string(snapshot_id) + "-" + | ||
| manifest_list_uuid + ".avro"; | ||
|
|
||
| //! Construct the manifest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rearranging - wanted to construct manifest fully first so it can be then used to construct the snapshot (the snapshot now depends on it for metrics). Just thought this read nicer now
| manifest, table_info.table_metadata.GetSnapshotById(new_snapshot.parent_snapshot_id)->metrics); | ||
| } | ||
| } else { | ||
| // If there was no previous snapshot, default the metrics to start totals at 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| yyjson_mut_obj_add_strcpy(doc, snapshot_json, "manifest-list", snapshot.manifest_list.c_str()); | ||
| auto summary_json = yyjson_mut_obj_add_obj(doc, snapshot_json, "summary"); | ||
| yyjson_mut_obj_add_strcpy(doc, summary_json, "operation", snapshot.summary.operation.c_str()); | ||
| for (auto &prop : snapshot.summary.additional_properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to what's done to serialize set_properties_update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that the key (prop.first.c_str()) is not copied, and thus has to stay alive until the document gets stringified. I think that contract holds, but I figured I'd mention it just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yeah looking at usages I also think this holds. This contract was also relied on by the set_properties_update serialization before, as in both cases, the keys' lifetimes are tied to the the CommitTableRequest &table parameter (I believe).
…ltr/duckdb-iceberg into sm/snapshot-summary-metrics
…ltr/duckdb-iceberg into sm/snapshot-summary-metrics
Tmonster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just one small question right now.
Can you also rebase to v1.4-andium? You may have to do some refactoring since #556 just got merged
| } | ||
| } | ||
|
|
||
| static const std::map<SnapshotMetricType, string> kSnapshotMetricKeys = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why the k?
Addresses #530.
Iceberg writers tend to include metrics as part of snapshot summaries. This PR begins snapshot metrics support by the DuckDB writer.
Concretely, this PR supports
for writes.
With this PR, DuckDB-written snapshots have: e.g.