[fix] Fix oversized BYTE_ARRAY parquet data pages#613
Conversation
| value_offset += batch_num_spaced_values; | ||
| }; | ||
|
|
||
| auto WriteChunk = [&](int64_t offset, int64_t batch_size, bool check_page) { |
There was a problem hiding this comment.
When int32_t overflows, the row group size has already far exceeded the default 128 MB limit, and the modified WriteChunk still cannot solve this problem.
In addition, the Parquet writer expects the written batch size to be 40 MB, as referenced here:
bolt/bolt/dwio/parquet/writer/Writer.h
Line 200 in 95bf0f9
Can we solve this issue by modifying the logic in splitWriteRecordBatch? This would avoid both excessively large page sizes and excessively large row group sizes, while making the change more controlled.
There was a problem hiding this comment.
The problem fixed here is mainly about the encoded data page size. writeBatchBytes limits the input batch before Parquet encoding/compression, while this issue happens after plain BYTE_ARRAY values are appended into the encoder. Therefore, even with a 40MB write batch limit, a single encoded data page can still become oversized.
The int32 guard here is only a safety check for invalid Parquet PageHeader sizes. The main fix is to split plain BYTE_ARRAY writes before appending too much encoded data into the page encoder. We can improve splitWriteRecordBatch separately for row group/input batch control, but I think the page-level split is still needed.
There was a problem hiding this comment.
Could you check whether this has any performance impact and provide benchmark results?
There was a problem hiding this comment.
Sure, I can check the performance impact for normal cases.
For the overflow case, benchmark results may not be very meaningful since the original behavior could already write corrupted data when the page size exceeds int32_t. I will mainly verify that regular BYTE_ARRAY writes are not noticeably affected.
There was a problem hiding this comment.
Yes, we need to verify the performance when the page size is normal and the WriteChunk method does not hit the fast path but enters the while loop.
There was a problem hiding this comment.
The current change keeps a fast path for the common flat/non-null case when the estimated encoded size is within the page limit.
I will add benchmark results for the normal page-size case where WriteChunk cannot take this fast path and falls back to the while-loop logic, e.g. with def levels / nulls.
There was a problem hiding this comment.
I added benchmarks for the normal page-size cases, including plain-encoded VARCHAR and nested ARRAY.
The benchmark results did reveal unexpected overhead in the common case, so I moved the fast-path check earlier: we now first use Arrow offsets to conservatively estimate whether the whole batch can fit in the current page. If it can, we write the batch directly and avoid entering the per-level while loop. The precise per-level splitting loop is only used when the batch is close to the page-size limit.
This keeps the oversized-page fix from affecting unrelated/common write paths while still preserving the correctness guard for pages that may exceed the Parquet PageHeader int32 size limit.
There was a problem hiding this comment.
I still see some residual performance risk in the cases where the fast path is not applicable, especially nullable VARCHAR/VARBINARY and nested ARRAY. In those paths, WriteChunk enters the per-level loop and then WriteSubChunk scans levels again via MaybeCalculateValidityBits / WriteLevelsSpaced, so the no-split normal case may pay an extra full scan.
Current Bolt Parquet writer has no performance advantage over the Java writer for string types. I think checking whether data_pagesize_ is exceeded at every level is too strict and may introduce performance issues.
We can refer to Arrow’s implementation: https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc. Arrow checks whether int32_t overflows, but it does not check the page size level by level.
6cacc43 to
faab616
Compare
faab616 to
182e5fe
Compare
What problem does this PR solve?
This PR fixes oversized Parquet data pages for
BYTE_ARRAYcolumns, includingVARCHARandVARBINARY.Previously, plain
BYTE_ARRAYwrites appended a full write batch into the encoder before checking the data page size. If one batch contained enough encoded bytes, the writer could produce a data page much larger than the configured page size. In extreme cases, page sizes could exceed the ParquetPageHeaderint32 fields.This PR:
BYTE_ARRAYwrites by encoded byte size before appending to the encoder.VARCHAR,VARBINARY, and dictionary fallback.Issue Number: close #612
Type of Change
Description
This PR fixes oversized BYTE_ARRAY Parquet data pages by splitting plain-encoded BYTE_ARRAY writes before the page size can exceed the Parquet
PageHeaderint32 limit.Changes include:
PageHeaderint32 fields.Performance Impact
[] No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Negative Impact: Explained below (e.g., trade-off for correctness).
Release Note
Please describe the changes in this PR
Release Note:
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes