Skip to content

Conversation

@smaheshwar-pltr
Copy link

@smaheshwar-pltr smaheshwar-pltr commented Oct 11, 2025

Addresses #530.

Iceberg writers tend to include metrics as part of snapshot summaries. This PR begins snapshot metrics support by the DuckDB writer.

Concretely, this PR supports

  • added-data-files
  • added-records
  • deleted-data-files
  • deleted-records
  • total-data-files
  • total-records

for writes.

With this PR, DuckDB-written snapshots have: e.g.

"summary" : {
      "operation" : "append",
      "added-data-files" : "1",
      "deleted-records" : "0",
      "deleted-data-files" : "0",
      "total-data-files" : "1",
      "added-records" : "2",
      "total-records" : "2"
    },

}
}

static const std::unordered_map<SnapshotMetricType, string> kSnapshotMetricKeys = {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try {
value = std::stoll(it->second);
} catch (...) {
// Skip invalid metrics
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (it != snapshot_summary.end()) {
int64_t value;
try {
value = std::stoll(it->second);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static const IcebergSnapshot::metrics_map_t kMetricsWithEmptyTotals {{SnapshotMetricType::TOTAL_DATA_FILES, 0},
{SnapshotMetricType::TOTAL_RECORDS, 0}};

static IcebergSnapshot::metrics_map_t GetSnapshotMetrics(const IcebergManifest &manifest,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because DuckDB fast appends a manifest for new snapshot, this implementation is simpler than the Java side


static IcebergSnapshot::metrics_map_t GetSnapshotMetrics(const IcebergManifest &manifest,
const IcebergSnapshot::metrics_map_t &previous_metrics) {
IcebergSnapshot::metrics_map_t metrics {{SnapshotMetricType::ADDED_DATA_FILES, manifest.added_files_count},
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{SnapshotMetricType::DELETED_DATA_FILES, manifest.deleted_files_count},
{SnapshotMetricType::DELETED_RECORDS, manifest.deleted_rows_count}};

auto previous_total_files = previous_metrics.find(SnapshotMetricType::TOTAL_DATA_FILES);
Copy link
Author

@smaheshwar-pltr smaheshwar-pltr Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/apache/iceberg/blob/06ebf07e671c3ee2b0d0add1c4dc849ab9cbf630/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L792 for this combining process with previous snapshot summary (again simpler for us since we can just use the manifest)

auto manifest_list_path = table_info.BaseFilePath() + "/metadata/snap-" + std::to_string(snapshot_id) + "-" +
manifest_list_uuid + ".avro";

//! Construct the manifest
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rearranging - wanted to construct manifest fully first so it can be then used to construct the snapshot (the snapshot now depends on it for metrics). Just thought this read nicer now

manifest, table_info.table_metadata.GetSnapshotById(new_snapshot.parent_snapshot_id)->metrics);
}
} else {
// If there was no previous snapshot, default the metrics to start totals at 0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yyjson_mut_obj_add_strcpy(doc, snapshot_json, "manifest-list", snapshot.manifest_list.c_str());
auto summary_json = yyjson_mut_obj_add_obj(doc, snapshot_json, "summary");
yyjson_mut_obj_add_strcpy(doc, summary_json, "operation", snapshot.summary.operation.c_str());
for (auto &prop : snapshot.summary.additional_properties) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to what's done to serialize set_properties_update

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that the key (prop.first.c_str()) is not copied, and thus has to stay alive until the document gets stringified. I think that contract holds, but I figured I'd mention it just in case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah looking at usages I also think this holds. This contract was also relied on by the set_properties_update serialization before, as in both cases, the keys' lifetimes are tied to the the CommitTableRequest &table parameter (I believe).

@smaheshwar-pltr smaheshwar-pltr marked this pull request as ready for review October 12, 2025 14:29
@smaheshwar-pltr smaheshwar-pltr changed the title Basic snapshot summary metrics Initial snapshot summary metrics support Oct 12, 2025
Copy link
Collaborator

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just one small question right now.
Can you also rebase to v1.4-andium? You may have to do some refactoring since #556 just got merged

}
}

static const std::map<SnapshotMetricType, string> kSnapshotMetricKeys = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why the k?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants