[fix] Fix oversized BYTE_ARRAY parquet data pages by Weixin-Xu · Pull Request #613 · bytedance/bolt

Weixin-Xu · 2026-06-01T09:09:59Z

What problem does this PR solve?

This PR fixes oversized Parquet data pages for BYTE_ARRAY columns, including VARCHAR and VARBINARY.

Previously, plain BYTE_ARRAY writes appended a full write batch into the encoder before checking the data page size. If one batch contained enough encoded bytes, the writer could produce a data page much larger than the configured page size. In extreme cases, page sizes could exceed the Parquet PageHeader int32 fields.

This PR:

Splits plain BYTE_ARRAY writes by encoded byte size before appending to the encoder.
Uses Arrow binary offsets as a fast path for flat batches, avoiding per-value scanning unless the batch may exceed the page limit.
Respects record boundaries when page boundaries must align with records.
Adds int32 guards for data and dictionary page header sizes.
Adds writer tests for VARCHAR, VARBINARY, and dictionary fallback.

Issue Number: close #612

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

This PR fixes oversized BYTE_ARRAY Parquet data pages by splitting plain-encoded BYTE_ARRAY writes before the page size can exceed the Parquet PageHeader int32 limit.

Changes include:

Validate dictionary and data page sizes before writing them into Parquet PageHeader int32 fields.
Split plain-encoded BYTE_ARRAY chunks by encoded byte size for VARCHAR / VARBINARY.
Preserve dictionary encoding behavior and apply BYTE_ARRAY page splitting after dictionary fallback to plain encoding.
Use Arrow offsets to cheaply estimate whether the current batch can fit in the current page.
Fall back to precise per-level splitting only when the batch is close to the data page byte limit.
Add tests for VARCHAR, VARBINARY, and dictionary fallback cases.
Fix the Parquet writer benchmark harness to honor folly benchmark iterations and avoid optimized-away results.

Performance Impact

[] No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Benchmark target:
bolt_dwio_parquet_writer_benchmark

Build:
cmake --build --preset conan-release --target bolt_dwio_parquet_writer_benchmark

Baseline:
implementation before the BYTE_ARRAY page-splitting refactor/optimization

Current:
this PR

Note:
The benchmark harness was fixed to honor folly benchmark iterations.
Without this, some fast cases reported 0.00fs / Infinity and were not reliable.

Targeted BYTE_ARRAY non-dictionary path:
--bm_regex='VarcharPlain.*no_dict|VarcharListPlain.*no_dict'
--bm_max_secs=1
--bm_min_usec=100000

VarcharPlain 4k null0          15.90ms -> 13.55ms   -14.8%
VarcharPlain 32k null0        116.61ms -> 95.79ms   -17.8%
VarcharPlain 256k null0         1.04s  -> 870.75ms  -15.9%

VarcharPlain 4k null20         18.66ms -> 15.82ms   -15.2%
VarcharPlain 32k null20       141.76ms -> 120.16ms  -15.2%
VarcharPlain 256k null20        1.20s  -> 1.03s     -14.1%

VarcharListPlain 4k null0     114.51ms -> 101.46ms  -11.4%
VarcharListPlain 32k null0    906.03ms -> 854.19ms  -5.7%
VarcharListPlain 256k null0     7.82s  -> 7.12s     -9.0%

VarcharListPlain 4k null20    105.21ms -> 94.35ms   -10.3%
VarcharListPlain 32k null20   839.07ms -> 821.86ms  -2.1%
VarcharListPlain 256k null20    6.99s  -> 6.30s     -9.8%

Additional existing writer benchmark coverage:
--bm_regex='Varchar_batch|BigInt_batch|Double_batch|ShortDecimalType_batch|LongDecimalType_batch|Map_batch|List_batch'
--bm_max_secs=1
--bm_min_usec=100000

Covered existing benchmark types:
- Varchar dictionary encoding
- BigInt
- Double
- ShortDecimalType
- LongDecimalType
- Map
- List

Result:
No clear performance regression was observed in non-BYTE_ARRAY-targeted paths.
Some large complex-type cases showed several-percent run-to-run variance, so suspicious cases were rerun individually.

List-only rerun:
List_batch_256k_dict    3972.93ms -> 3932.61ms   -1.0%
List_batch_1M_dict        17.13s  -> 16.69s      -2.5%

The page-header size check fast path is kept lightweight:
- normal path is inline
- overflow branch is guarded by ARROW_PREDICT_FALSE
- throw helper is marked [[noreturn]]

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed oversized BYTE_ARRAY Parquet data pages by splitting plain-encoded VARCHAR / VARBINARY pages before they exceed Parquet PageHeader int32 size limits.
- Optimized the common BYTE_ARRAY write path to avoid per-level splitting checks unless the batch is close to the page size limit.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

luozenglin · 2026-06-01T10:56:02Z

    value_offset += batch_num_spaced_values;
  };

+  auto WriteChunk = [&](int64_t offset, int64_t batch_size, bool check_page) {


When int32_t overflows, the row group size has already far exceeded the default 128 MB limit, and the modified WriteChunk still cannot solve this problem.

In addition, the Parquet writer expects the written batch size to be 40 MB, as referenced here:

bolt/bolt/dwio/parquet/writer/Writer.h

Line 200 in 95bf0f9

uint64_t writeBatchBytes = 40 * 1024 * 1024; // 40M

Can we solve this issue by modifying the logic in splitWriteRecordBatch? This would avoid both excessively large page sizes and excessively large row group sizes, while making the change more controlled.

The problem fixed here is mainly about the encoded data page size. writeBatchBytes limits the input batch before Parquet encoding/compression, while this issue happens after plain BYTE_ARRAY values are appended into the encoder. Therefore, even with a 40MB write batch limit, a single encoded data page can still become oversized.
The int32 guard here is only a safety check for invalid Parquet PageHeader sizes. The main fix is to split plain BYTE_ARRAY writes before appending too much encoded data into the page encoder. We can improve splitWriteRecordBatch separately for row group/input batch control, but I think the page-level split is still needed.

Could you check whether this has any performance impact and provide benchmark results?

Sure, I can check the performance impact for normal cases.

For the overflow case, benchmark results may not be very meaningful since the original behavior could already write corrupted data when the page size exceeds int32_t. I will mainly verify that regular BYTE_ARRAY writes are not noticeably affected.

Yes, we need to verify the performance when the page size is normal and the WriteChunk method does not hit the fast path but enters the while loop.

The current change keeps a fast path for the common flat/non-null case when the estimated encoded size is within the page limit.

I will add benchmark results for the normal page-size case where WriteChunk cannot take this fast path and falls back to the while-loop logic, e.g. with def levels / nulls.

I added benchmarks for the normal page-size cases, including plain-encoded VARCHAR and nested ARRAY.

The benchmark results did reveal unexpected overhead in the common case, so I moved the fast-path check earlier: we now first use Arrow offsets to conservatively estimate whether the whole batch can fit in the current page. If it can, we write the batch directly and avoid entering the per-level while loop. The precise per-level splitting loop is only used when the batch is close to the page-size limit.

This keeps the oversized-page fix from affecting unrelated/common write paths while still preserving the correctness guard for pages that may exceed the Parquet PageHeader int32 size limit.

I still see some residual performance risk in the cases where the fast path is not applicable, especially nullable VARCHAR/VARBINARY and nested ARRAY. In those paths, WriteChunk enters the per-level loop and then WriteSubChunk scans levels again via MaybeCalculateValidityBits / WriteLevelsSpaced, so the no-split normal case may pay an extra full scan.

Current Bolt Parquet writer has no performance advantage over the Java writer for string types. I think checking whether data_pagesize_ is exceeded at every level is too strict and may introduce performance issues.

We can refer to Arrow’s implementation: https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc. Arrow checks whether int32_t overflows, but it does not check the page size level by level.

Weixin-Xu changed the title ~~Fix oversized BYTE_ARRAY parquet data pages~~ [fix] Fix oversized BYTE_ARRAY parquet data pages Jun 1, 2026

Weixin-Xu requested review from fzhedu, liushengxuan and luozenglin June 1, 2026 09:12

luozenglin reviewed Jun 1, 2026

View reviewed changes

Weixin-Xu force-pushed the fix_parquet_writer_oversized_bytes branch from 6cacc43 to faab616 Compare June 2, 2026 02:24

Fix oversized BYTE_ARRAY parquet data pages

182e5fe

Weixin-Xu force-pushed the fix_parquet_writer_oversized_bytes branch from faab616 to 182e5fe Compare June 2, 2026 02:41

Weixin-Xu added 2 commits June 2, 2026 16:18

Optimize parquet writer page size checks

a9d1bee

format code

46f7b08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Fix oversized BYTE_ARRAY parquet data pages#613

[fix] Fix oversized BYTE_ARRAY parquet data pages#613
Weixin-Xu wants to merge 3 commits into
bytedance:mainfrom
Weixin-Xu:fix_parquet_writer_oversized_bytes

Weixin-Xu commented Jun 1, 2026 •

edited

Loading

Uh oh!

luozenglin Jun 1, 2026

Uh oh!

Weixin-Xu Jun 1, 2026 •

edited

Loading

Uh oh!

luozenglin Jun 2, 2026

Uh oh!

Weixin-Xu Jun 2, 2026

Uh oh!

luozenglin Jun 2, 2026

Uh oh!

Weixin-Xu Jun 2, 2026

Uh oh!

Weixin-Xu Jun 2, 2026

Uh oh!

luozenglin Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Weixin-Xu commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of Change

Description

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

luozenglin Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Weixin-Xu Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luozenglin Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Weixin-Xu Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

luozenglin Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Weixin-Xu Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Weixin-Xu Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

luozenglin Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Weixin-Xu commented Jun 1, 2026 •

edited

Loading

Weixin-Xu Jun 1, 2026 •

edited

Loading