Skip to content

[fix] Fix oversized BYTE_ARRAY parquet data pages#613

Open
Weixin-Xu wants to merge 3 commits into
bytedance:mainfrom
Weixin-Xu:fix_parquet_writer_oversized_bytes
Open

[fix] Fix oversized BYTE_ARRAY parquet data pages#613
Weixin-Xu wants to merge 3 commits into
bytedance:mainfrom
Weixin-Xu:fix_parquet_writer_oversized_bytes

Conversation

@Weixin-Xu

@Weixin-Xu Weixin-Xu commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

This PR fixes oversized Parquet data pages for BYTE_ARRAY columns, including VARCHAR and VARBINARY.

Previously, plain BYTE_ARRAY writes appended a full write batch into the encoder before checking the data page size. If one batch contained enough encoded bytes, the writer could produce a data page much larger than the configured page size. In extreme cases, page sizes could exceed the Parquet PageHeader int32 fields.

This PR:

  • Splits plain BYTE_ARRAY writes by encoded byte size before appending to the encoder.
  • Uses Arrow binary offsets as a fast path for flat batches, avoiding per-value scanning unless the batch may exceed the page limit.
  • Respects record boundaries when page boundaries must align with records.
  • Adds int32 guards for data and dictionary page header sizes.
  • Adds writer tests for VARCHAR, VARBINARY, and dictionary fallback.

Issue Number: close #612

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

This PR fixes oversized BYTE_ARRAY Parquet data pages by splitting plain-encoded BYTE_ARRAY writes before the page size can exceed the Parquet PageHeader int32 limit.

Changes include:

  • Validate dictionary and data page sizes before writing them into Parquet PageHeader int32 fields.
  • Split plain-encoded BYTE_ARRAY chunks by encoded byte size for VARCHAR / VARBINARY.
  • Preserve dictionary encoding behavior and apply BYTE_ARRAY page splitting after dictionary fallback to plain encoding.
  • Use Arrow offsets to cheaply estimate whether the current batch can fit in the current page.
  • Fall back to precise per-level splitting only when the batch is close to the data page byte limit.
  • Add tests for VARCHAR, VARBINARY, and dictionary fallback cases.
  • Fix the Parquet writer benchmark harness to honor folly benchmark iterations and avoid optimized-away results.

Performance Impact

  • [] No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Benchmark target:
    bolt_dwio_parquet_writer_benchmark
    
    Build:
    cmake --build --preset conan-release --target bolt_dwio_parquet_writer_benchmark
    
    Baseline:
    implementation before the BYTE_ARRAY page-splitting refactor/optimization
    
    Current:
    this PR
    
    Note:
    The benchmark harness was fixed to honor folly benchmark iterations.
    Without this, some fast cases reported 0.00fs / Infinity and were not reliable.
    
    Targeted BYTE_ARRAY non-dictionary path:
    --bm_regex='VarcharPlain.*no_dict|VarcharListPlain.*no_dict'
    --bm_max_secs=1
    --bm_min_usec=100000
    
    VarcharPlain 4k null0          15.90ms -> 13.55ms   -14.8%
    VarcharPlain 32k null0        116.61ms -> 95.79ms   -17.8%
    VarcharPlain 256k null0         1.04s  -> 870.75ms  -15.9%
    
    VarcharPlain 4k null20         18.66ms -> 15.82ms   -15.2%
    VarcharPlain 32k null20       141.76ms -> 120.16ms  -15.2%
    VarcharPlain 256k null20        1.20s  -> 1.03s     -14.1%
    
    VarcharListPlain 4k null0     114.51ms -> 101.46ms  -11.4%
    VarcharListPlain 32k null0    906.03ms -> 854.19ms  -5.7%
    VarcharListPlain 256k null0     7.82s  -> 7.12s     -9.0%
    
    VarcharListPlain 4k null20    105.21ms -> 94.35ms   -10.3%
    VarcharListPlain 32k null20   839.07ms -> 821.86ms  -2.1%
    VarcharListPlain 256k null20    6.99s  -> 6.30s     -9.8%
    
    Additional existing writer benchmark coverage:
    --bm_regex='Varchar_batch|BigInt_batch|Double_batch|ShortDecimalType_batch|LongDecimalType_batch|Map_batch|List_batch'
    --bm_max_secs=1
    --bm_min_usec=100000
    
    Covered existing benchmark types:
    - Varchar dictionary encoding
    - BigInt
    - Double
    - ShortDecimalType
    - LongDecimalType
    - Map
    - List
    
    Result:
    No clear performance regression was observed in non-BYTE_ARRAY-targeted paths.
    Some large complex-type cases showed several-percent run-to-run variance, so suspicious cases were rerun individually.
    
    List-only rerun:
    List_batch_256k_dict    3972.93ms -> 3932.61ms   -1.0%
    List_batch_1M_dict        17.13s  -> 16.69s      -2.5%
    
    The page-header size check fast path is kept lightweight:
    - normal path is inline
    - overflow branch is guarded by ARROW_PREDICT_FALSE
    - throw helper is marked [[noreturn]]
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Fixed oversized BYTE_ARRAY Parquet data pages by splitting plain-encoded VARCHAR / VARBINARY pages before they exceed Parquet PageHeader int32 size limits.
- Optimized the common BYTE_ARRAY write path to avoid per-level splitting checks unless the batch is close to the page size limit.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@Weixin-Xu Weixin-Xu changed the title Fix oversized BYTE_ARRAY parquet data pages [fix] Fix oversized BYTE_ARRAY parquet data pages Jun 1, 2026
value_offset += batch_num_spaced_values;
};

auto WriteChunk = [&](int64_t offset, int64_t batch_size, bool check_page) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When int32_t overflows, the row group size has already far exceeded the default 128 MB limit, and the modified WriteChunk still cannot solve this problem.

In addition, the Parquet writer expects the written batch size to be 40 MB, as referenced here:

uint64_t writeBatchBytes = 40 * 1024 * 1024; // 40M

Can we solve this issue by modifying the logic in splitWriteRecordBatch? This would avoid both excessively large page sizes and excessively large row group sizes, while making the change more controlled.

@Weixin-Xu Weixin-Xu Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem fixed here is mainly about the encoded data page size. writeBatchBytes limits the input batch before Parquet encoding/compression, while this issue happens after plain BYTE_ARRAY values are appended into the encoder. Therefore, even with a 40MB write batch limit, a single encoded data page can still become oversized.
The int32 guard here is only a safety check for invalid Parquet PageHeader sizes. The main fix is to split plain BYTE_ARRAY writes before appending too much encoded data into the page encoder. We can improve splitWriteRecordBatch separately for row group/input batch control, but I think the page-level split is still needed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check whether this has any performance impact and provide benchmark results?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can check the performance impact for normal cases.

For the overflow case, benchmark results may not be very meaningful since the original behavior could already write corrupted data when the page size exceeds int32_t. I will mainly verify that regular BYTE_ARRAY writes are not noticeably affected.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to verify the performance when the page size is normal and the WriteChunk method does not hit the fast path but enters the while loop.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current change keeps a fast path for the common flat/non-null case when the estimated encoded size is within the page limit.

I will add benchmark results for the normal page-size case where WriteChunk cannot take this fast path and falls back to the while-loop logic, e.g. with def levels / nulls.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added benchmarks for the normal page-size cases, including plain-encoded VARCHAR and nested ARRAY.

The benchmark results did reveal unexpected overhead in the common case, so I moved the fast-path check earlier: we now first use Arrow offsets to conservatively estimate whether the whole batch can fit in the current page. If it can, we write the batch directly and avoid entering the per-level while loop. The precise per-level splitting loop is only used when the batch is close to the page-size limit.

This keeps the oversized-page fix from affecting unrelated/common write paths while still preserving the correctness guard for pages that may exceed the Parquet PageHeader int32 size limit.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see some residual performance risk in the cases where the fast path is not applicable, especially nullable VARCHAR/VARBINARY and nested ARRAY. In those paths, WriteChunk enters the per-level loop and then WriteSubChunk scans levels again via MaybeCalculateValidityBits / WriteLevelsSpaced, so the no-split normal case may pay an extra full scan.

Current Bolt Parquet writer has no performance advantage over the Java writer for string types. I think checking whether data_pagesize_ is exceeded at every level is too strict and may introduce performance issues.

We can refer to Arrow’s implementation: https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc. Arrow checks whether int32_t overflows, but it does not check the page size level by level.

@Weixin-Xu Weixin-Xu force-pushed the fix_parquet_writer_oversized_bytes branch from 6cacc43 to faab616 Compare June 2, 2026 02:24
@Weixin-Xu Weixin-Xu force-pushed the fix_parquet_writer_oversized_bytes branch from faab616 to 182e5fe Compare June 2, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Parquet writer can produce oversized BYTE_ARRAY data pages for large VARCHAR/VARBINARY batches

2 participants