Skip to content

Conversation

@PingLiuPing
Copy link
Collaborator

@PingLiuPing PingLiuPing commented Nov 7, 2025

Introduces infrastructure for evaluating Iceberg partition transforms in Velox.
Added TransformExprBuilder which converts Iceberg partition specifications into Velox expressions
Added TransformEvaluator which evaluates multiple transform expressions in a single pass using compiled ExprSet. It automatically register Iceberg function in first call.
Added IcebergPartitionPath which converts partition keys to their string representation for directory paths following Iceberg specification (handles dates, timestamps, binary data encoding)
Added test cases to cover the transform evaluation result.
Update iceberg CMake file to link Iceberg functions library.

@netlify
Copy link

netlify bot commented Nov 7, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 8047d9b
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/690fa7794b77fa000894725d

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2025
@PingLiuPing PingLiuPing requested review from mbasmanova and removed request for majetideepak November 7, 2025 15:20
@PingLiuPing
Copy link
Collaborator Author

@mbasmanova This PR continues from #15035 (review).
In next PR, I plan to include IcebergPartitionIdGenerator and do some refactor on Hive PartitionIdGenerator, I roughly count the lines which should be also around 700 lines. After that, all the preliminary work should be complete, and I’ll submit another PR to integrate them with IcebergDataSink.

@mbasmanova
Copy link
Contributor

Ci is red


/// Converts a partition value to its string representation for use in
/// partition directory path. The format follows the Iceberg specification
/// for partition path encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link to the spec? Would be nice to add here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment.
I cannot find an explicit document about this. I will revise the comment.
I followed the Iceberg java code to implement this.

explicit IcebergPartitionPath(TransformType transformType)
: transformType_(transformType) {}

~IcebergPartitionPath() override = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


using HivePartitionUtil::toPartitionString;

std::string toPartitionString(int32_t value, const TypePtr& type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, document the format of the result

/// Converts a partition value to its string representation for use in
/// partition directory path. The format follows the Iceberg specification
/// for partition path encoding.
class IcebergPartitionPath : public HivePartitionUtil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to extract this class into a separate PR and add a test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing.

I will separate this to a new PR.

/// type, and optional parameter (e.g., bucket count, truncate width).
/// @param inputFieldName Name of the source column in the input RowVector.
/// @return Typed expression representing the transform.
static core::TypedExprPtr toExpression(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose this API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I moved it to anonymous namespace.


namespace {

constexpr char const* kIcebergFunctionPrefix{"iceberg_"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this next to its only use

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

class TransformTest : public test::IcebergTestBase {
protected:
template <typename T>
VectorPtr createFlatVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use VectorMaker or helper methods from VectorTestBase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed to VectorMaker


// Create expected vector and verify results.
auto expectedVector = createFlatVector<OUT>(expectedValues);
ASSERT_EQ(resultVector[0]->size(), expectedValues.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assertEqualVectors or similar from VectorTestBase.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_transform_eval branch from ff2e9e0 to ae18463 Compare November 7, 2025 19:47
case TransformType::kTruncate:
return type;
}
VELOX_UNREACHABLE("Unknown transform type");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Have to add this to fix the building break on Linux. Macos is fine without this change.

Comment on lines 32 to 33
const std::vector<std::optional<IN>>& inputValues,
const std::vector<std::optional<OUT>>& expectedValues,
Copy link
Contributor

@mbasmanova mbasmanova Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace these arguments with RowVectorPtr and use makeFlatVector and similar in the callers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just updated.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_transform_eval branch from ae18463 to d007247 Compare November 7, 2025 22:15
protected:
void testTransform(
const IcebergPartitionSpec::Field& field,
const RowVectorPtr& inputRowVector,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inputRowVector -> input, expectedRowVector -> expected

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

auto transformExprs = TransformExprBuilder::toExpressions(
spec,
std::vector<column_index_t>{0},
asRowType(inputRowVector->type()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input->rowType()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

const RowVectorPtr& inputRowVector,
const RowVectorPtr& expectedRowVector) {
// Build and evaluate transform expressions.
auto spec = std::make_shared<const IcebergPartitionSpec>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we test single field specs only? Why not test multi-field specs?

Copy link
Collaborator Author

@PingLiuPing PingLiuPing Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment.
I added those test one by one when I implement the transformation logic one by one, and type by type.
I will refactor it to support test multiple transforms.

{makeFlatVector<StringView>(
{StringView("\x01\x02\x03", 3), StringView("", 0)},
VARBINARY())}),
makeRowVector({makeFlatVector<StringView>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop StringView or use std::string

makeFlatVector<std::string>({"foo", "bar"})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@PingLiuPing PingLiuPing force-pushed the lp_iceberg_transform_eval branch from d007247 to 8047d9b Compare November 8, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants