Skip to content

Conversation

@liamzwbao
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Implement array shredding into List/LargeList/ListView/LargeListView to close the gaps in shred_variant. Part of the changes lay the groundwork for adding variant_get support for list types in a follow-up.

Are these changes tested?

Yes

Are there any user-facing changes?

New shredding types supported

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Nov 12, 2025
@liamzwbao liamzwbao marked this pull request as ready for review November 12, 2025 23:53
@liamzwbao
Copy link
Contributor Author

Hi @scovich, you might be interested

Comment on lines +286 to +291
enum ArrayVariantToArrowRowBuilder<'a> {
List(VariantToListArrowRowBuilder<'a, i32>),
LargeList(VariantToListArrowRowBuilder<'a, i64>),
ListView(VariantToListViewArrowRowBuilder<'a, i32>),
LargeListView(VariantToListViewArrowRowBuilder<'a, i64>),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to move this into variant_to_arrow, and variant_get could get the support out of the box. Given the size of this PR, I would do that as a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

@scovich
Copy link
Contributor

scovich commented Nov 17, 2025

Hi @scovich, you might be interested

Definitely interested, but a bit overbooked last week. Hoping I can get to it this week.

# Conflicts:
#	parquet-variant-compute/src/shred_variant.rs
match value {
Variant::List(list) => {
self.value_builder.append_null();
self.typed_value_builder.append_value(list)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking -- if I try to shred as List<i32> and I encounter a variant array [..., "hi", ...], the bad entry will either become NULL or cause an error, depending on cast options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the hi will be located in typed_value.value

Copy link
Contributor Author

@liamzwbao liamzwbao Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the hi will be located in typed_value.value

that's right. For example, an variant array [1, 2, 3, 1, "two", Variant::Null] would become

typed_value: [1, 2, 3, 1, null, null]
value: [null, null, null, null, "two", Variant::Null]

Can refer to test_array_shredding_as_list as an example

Comment on lines +313 to +316
List(VariantToListArrowRowBuilder<'a, i32>),
LargeList(VariantToListArrowRowBuilder<'a, i64>),
ListView(VariantToListViewArrowRowBuilder<'a, i32>),
LargeListView(VariantToListViewArrowRowBuilder<'a, i64>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce a ListLikeArrayBuilder trait (**) that encapsulates the (minimal) differences between these four types, so that ArrayVariantToArrowRowBuilder becomes a generic struct instead of an enum?

(**) c.f. StringLikeArrayBuilder that serves the same purpose for strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick analysis suggests the trait needs:

  • An associated type: type Offset: OffsetSizeTrait
  • A constructor: fn try_new(...) -> Result<Self>
  • Helper functions to support append_null and append_value (nulls, offsets, etc)
  • A finisher: fn finish(self) -> Result<ArrayRef>

Two trait implementations (one for lists and one for list views), both generic over Offset

And from there, the outer builder should be able to implement its own logic just once instead of four times.

Double check tho -- the above is a very rough sketch. The goal is to minimize boilerplate and duplication, using a careful selection of trait methods that capture the essential differences between lists and list views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to refactor the code using ListLikeArrayBuilder, but it doesn’t help much (draft PR). Plus, it would also force the builder into a boxed dyn trait, adding overhead versus the enum approach.

I think the current impl actaully follow what we did in varaint_to_arrow, the code structure looks like below

Variant (VariantToShreddedVariantRowBuilder)
├─ Primitive (VariantToShreddedPrimitiveVariantRowBuilder)
│  ├─ Int/Float group (PrimitiveVariantToArrowRowBuilder::VariantToPrimitiveArrowRowBuilder):
│  │    Null, Boolean, Int8/16/32/64, UInt8/16/32/64,
│  │    Float16/32/64, Decimal32/64/128/256
│  ├─ Time-like (PrimitiveVariantToArrowRowBuilder::VariantToTimestampArrowRowBuilder):
│  │    Date32, Time64(us), Timestamp(us/ns) [tz or ntz]
│  └─ String/Binary (PrimitiveVariantToArrowRowBuilder::VariantToStringArrowRowBuilder/VariantToBinaryArrowRowBuilder):
│       Utf8, Utf8View, Binary, BinaryView
├─ Array (VariantToShreddedArrayVariantRowBuilder)
│  ├─ List          (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│  ├─ LargeList     (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│  ├─ ListView      (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│  └─ LargeListView (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│      └─ each element uses make_variant_to_shredded_variant_arrow_row_builder recursively
└─ Object (VariantToShreddedObjectVariantRowBuilder)

StringLikeArrayBuilder and BinaryLikeArrayBuilder mainly simplify VariantToStringArrowRowBuilder and VariantToBinaryArrowRowBuilder but still need dispatch from PrimitiveVariantToArrowRowBuilder. In theory, ListLikeArrayBuilder could simplify nulls, offsets, and sizes via GenericList(View)Builder, but we’d need a lot of adapters, because element_builder is Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implement ArrayBuilder directly to make the trait work in our case.

match value {
Variant::List(list) => {
self.value_builder.append_null();
self.typed_value_builder.append_value(list)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the hi will be located in typed_value.value

Ok(())
}

fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change the parameter value to something else, currently, there will be two value in line 286, and this may lead some confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, let me think about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename value as variant

}

#[test]
fn test_array_shredding_as_list() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests!
Not sure if we need to extract some common logic for these tests. They share some of the same logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support array shredding into List/LargeList/ListView/LargeListView

4 participants