Skip to content

Conversation

@joe-ucp
Copy link

@joe-ucp joe-ucp commented Nov 19, 2025

What issue does this PR close?

  • Part of: #8806 — Bitwise API consolidation.
    This PR extracts the nullif bitmap/layout improvements from the larger bitwise PR for independent review.

Rationale

This PR addresses three key goals:

  1. Align nullif logic:
    Refactor nullif to use the same core buffer-level bitwise primitives employed elsewhere.
  2. Document bitmap layout:
    Encode the ArrayData bitmap layout contract in documentation and implementation, clarifying:
    • Bit numbering and endianness
    • Mapping of offsets and lengths to validity bits
    • Explicit meaning of NULLIF validity (V(i) & !C(i))
  3. Robust handling of offsets/nulls:
    Fix and guard against bugs involving offsets and null counts, particularly for sliced arrays and nested types.

Summary of Changes

  • Documentation:
    • Add docs/arraydata_bitmap_layout_contract.md describing ArrayData bitmap invariants and nullif validity.
  • Buffer helpers:
    • Extend arrow-buffer/src/buffer/immutable.rs with Buffer::bitwise_binary and Buffer::bitwise_unary.
    • Add focused tests verifying these helpers against legacy bitmap ops, including edge cases (offsets/tail bits).
  • nullif kernel refactor:
    • Update arrow-select/src/nullif.rs to:
      • Compute NULLIF validity via a dedicated function using the new buffer helpers.
      • Always create result ArrayData with offset = 0, with buffers aligned to logical index 0.
      • Correctly handle offsets and null counts for sliced arrays, booleans, strings, and structs.
      • Keep changes localized to nullif; no changes made to other kernels.

Testing

  • cargo test -p arrow-select --lib nullif -- --nocapture
  • cargo test -p arrow-select --lib
  • cargo test -p arrow-buffer --lib
  • cargo test -p arrow-arith --lib

All tests pass.


User-Facing Changes

  • No public API-breaking changes.
  • Normal nullif semantics are unchanged, but handling of edge cases with offsets/null bitmaps is now more robust and formally documented.

- Add docs/arraydata_bitmap_layout_contract.md to document ArrayData bitmap invariants
- Add Buffer::bitwise_binary / Buffer::bitwise_unary helpers and tests in immutable.rs
- Refactor arrow-select/src/nullif.rs to:
  - Compute validity via compute_nullif_validity using buffer bitwise helpers
  - Build result ArrayData with offset = 0 per the layout contract
  - Fix offset/null-count handling for sliced arrays and nested types
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 19, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry @joe-ucp -- I don't understand what problem this PR is solving

Can you please remind me what the rationale for these changes are?

@@ -0,0 +1,76 @@
# ArrayData + Bitmap Layout Contract
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to add docs , let's do it in doc comments to keep it close to the code rather than in a separate document that is more likely to drift out of sync

* result_valid(i) = V(i) & !C(i)
* result_value(i) = left_value(i) // when result_valid(i) == true
*
* This contract is the law. All nullif implementations must follow it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand the contract is the law comment. We should be enforcing things with tests not external readme files

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here wasn’t to replace tests, but to write down the invariants that the tests are asserting — especially around how offsets map to validity bits and how V(i) & !C(i) is interpreted for sliced arrays. This was primarly used by me to keep the "logic" in my own head. 😂

I’m happy to move that description out of the standalone markdown and into doc comments on the helpers in nullif.rs, and keep the tests as the way we actually enforce the behavior. That should keep the spec close to the code while still making the rules explicit.

If you’d prefer to avoid the extra MD file entirely, I can also drop it and keep the contract only in code + tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think we should strive to follow the patterns that already exist in this codebase (e.g. inline comments rather than markdown files) when possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improving the overall documentation about how arrays work and how to think about validity maps is of course always welcome -- perhaps it would make a good separate PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense. In the latest update I’ve:

  1. Moved the bitmap/validity contract text out of the standalone markdown file and into doc comments next to the nullif helpers, so the explanation now lives right with the code

  2. Dropped the extra .md file from this PR.

I agree a higher-level doc about how arrays/validity maps work would be useful as a separate follow-up PR once this one settles.

@joe-ucp
Copy link
Author

joe-ucp commented Nov 19, 2025

sorry @joe-ucp -- I don't understand what problem this PR is solving

Can you please remind me what the rationale for these changes are?

This one is mostly about alignment + safety, not fixing a known broken case in main.

Concretely, this PR:

  • Rewrites nullif to use the same core Buffer::bitwise_* APIs we just added for other bitwise logic, instead of its own ad-hoc bitmap math.
  • Makes the offset / validity bitmap behavior explicit and pushes that into tests (sliced arrays, nested types, null count consistency).
  • Ensures nullif always builds results with offset = 0 and correctly aligned buffers, which is the pattern used elsewhere and makes future refactors much safer.

The existing tests still pass, and the new tests mainly serve to lock in those layout assumptions so nullif doesn’t become a special case the next time we touch bitwise code.

@alamb
Copy link
Contributor

alamb commented Nov 23, 2025

This one is mostly about alignment + safety, not fixing a known broken case in main.

I am sorry @joe-ucp I am not likely to have much time to review this PR any time soon.

Given the limited amount of review capacity we have in this crate I suspect general code refactors that don't add features or fix issues from non maintainers are likely to be pretty low on the review list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants