-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Support array shredding into List/LargeList/ListView/LargeListView
#8831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @scovich, you might be interested |
| enum ArrayVariantToArrowRowBuilder<'a> { | ||
| List(VariantToListArrowRowBuilder<'a, i32>), | ||
| LargeList(VariantToListArrowRowBuilder<'a, i64>), | ||
| ListView(VariantToListViewArrowRowBuilder<'a, i32>), | ||
| LargeListView(VariantToListViewArrowRowBuilder<'a, i64>), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to move this into variant_to_arrow, and variant_get could get the support out of the box. Given the size of this PR, I would do that as a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
Definitely interested, but a bit overbooked last week. Hoping I can get to it this week. |
# Conflicts: # parquet-variant-compute/src/shred_variant.rs
| match value { | ||
| Variant::List(list) => { | ||
| self.value_builder.append_null(); | ||
| self.typed_value_builder.append_value(list)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking -- if I try to shred as List<i32> and I encounter a variant array [..., "hi", ...], the bad entry will either become NULL or cause an error, depending on cast options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the hi will be located in typed_value.value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the
hiwill be located intyped_value.value
that's right. For example, an variant array [1, 2, 3, 1, "two", Variant::Null] would become
typed_value: [1, 2, 3, 1, null, null]
value: [null, null, null, null, "two", Variant::Null]Can refer to test_array_shredding_as_list as an example
| List(VariantToListArrowRowBuilder<'a, i32>), | ||
| LargeList(VariantToListArrowRowBuilder<'a, i64>), | ||
| ListView(VariantToListViewArrowRowBuilder<'a, i32>), | ||
| LargeListView(VariantToListViewArrowRowBuilder<'a, i64>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we introduce a ListLikeArrayBuilder trait (**) that encapsulates the (minimal) differences between these four types, so that ArrayVariantToArrowRowBuilder becomes a generic struct instead of an enum?
(**) c.f. StringLikeArrayBuilder that serves the same purpose for strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick analysis suggests the trait needs:
- An associated type:
type Offset: OffsetSizeTrait - A constructor:
fn try_new(...) -> Result<Self> - Helper functions to support
append_nullandappend_value(nulls, offsets, etc) - A finisher:
fn finish(self) -> Result<ArrayRef>
Two trait implementations (one for lists and one for list views), both generic over Offset
And from there, the outer builder should be able to implement its own logic just once instead of four times.
Double check tho -- the above is a very rough sketch. The goal is to minimize boilerplate and duplication, using a careful selection of trait methods that capture the essential differences between lists and list views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to refactor the code using ListLikeArrayBuilder, but it doesn’t help much (draft PR). Plus, it would also force the builder into a boxed dyn trait, adding overhead versus the enum approach.
I think the current impl actaully follow what we did in varaint_to_arrow, the code structure looks like below
Variant (VariantToShreddedVariantRowBuilder)
├─ Primitive (VariantToShreddedPrimitiveVariantRowBuilder)
│ ├─ Int/Float group (PrimitiveVariantToArrowRowBuilder::VariantToPrimitiveArrowRowBuilder):
│ │ Null, Boolean, Int8/16/32/64, UInt8/16/32/64,
│ │ Float16/32/64, Decimal32/64/128/256
│ ├─ Time-like (PrimitiveVariantToArrowRowBuilder::VariantToTimestampArrowRowBuilder):
│ │ Date32, Time64(us), Timestamp(us/ns) [tz or ntz]
│ └─ String/Binary (PrimitiveVariantToArrowRowBuilder::VariantToStringArrowRowBuilder/VariantToBinaryArrowRowBuilder):
│ Utf8, Utf8View, Binary, BinaryView
├─ Array (VariantToShreddedArrayVariantRowBuilder)
│ ├─ List (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│ ├─ LargeList (ArrayVariantToArrowRowBuilder::VariantToListArrowRowBuilder)
│ ├─ ListView (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│ └─ LargeListView (ArrayVariantToArrowRowBuilder::VariantToListViewArrowRowBuilder)
│ └─ each element uses make_variant_to_shredded_variant_arrow_row_builder recursively
└─ Object (VariantToShreddedObjectVariantRowBuilder)
StringLikeArrayBuilder and BinaryLikeArrayBuilder mainly simplify VariantToStringArrowRowBuilder and VariantToBinaryArrowRowBuilder but still need dispatch from PrimitiveVariantToArrowRowBuilder. In theory, ListLikeArrayBuilder could simplify nulls, offsets, and sizes via GenericList(View)Builder, but we’d need a lot of adapters, because element_builder is Box<VariantToShreddedVariantRowBuilder<'a>> and doesn’t implement ArrayBuilder directly to make the trait work in our case.
| match value { | ||
| Variant::List(list) => { | ||
| self.value_builder.append_null(); | ||
| self.typed_value_builder.append_value(list)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the hi will be located in typed_value.value
| Ok(()) | ||
| } | ||
|
|
||
| fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change the parameter value to something else, currently, there will be two value in line 286, and this may lead some confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, let me think about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename value as variant
| } | ||
|
|
||
| #[test] | ||
| fn test_array_shredding_as_list() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests!
Not sure if we need to extract some common logic for these tests. They share some of the same logic.
Which issue does this PR close?
List/LargeList/ListView/LargeListView#8830.Rationale for this change
What changes are included in this PR?
Implement array shredding into
List/LargeList/ListView/LargeListViewto close the gaps inshred_variant. Part of the changes lay the groundwork for addingvariant_getsupport for list types in a follow-up.Are these changes tested?
Yes
Are there any user-facing changes?
New shredding types supported