Skip to content

IN LIST: reinterpret small-width types for bitmap filters#23013

Draft
geoffreyclaude wants to merge 6 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_reinterpret_bitmaps
Draft

IN LIST: reinterpret small-width types for bitmap filters#23013
geoffreyclaude wants to merge 6 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_reinterpret_bitmaps

Conversation

@geoffreyclaude

@geoffreyclaude geoffreyclaude commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

#23011 and #23012 add bitmap lookups for unsigned 1-byte and 2-byte integers. This PR lets other same-width primitive types reuse those same bitmaps without copying or converting the values.

The key idea is that some types have different meanings but the same physical shape in memory. For example:

  • UInt8 stores one byte.
  • Int8 also stores one byte.
  • UInt16 stores two bytes.
  • Int16 also stores two bytes.

The bitmap only cares about the exact bits. So an Int8 value can be viewed as its one-byte bit pattern and checked with the UInt8 bitmap. No new array is allocated and the underlying Arrow value buffer is shared.

That is what “zero-copy reinterpretation” means here: keep the same bytes, but use a lookup filter whose storage type matches the byte width.

What changes are included in this PR?

  • Adds a helper that reinterprets a primitive Arrow array as another primitive type with the same width.
  • Makes the helper slice-aware, so sliced Arrow arrays still start at the correct logical offset.
  • Wraps bitmap filters so signed 1-byte and 2-byte primitive arrays can reuse the unsigned bitmap storage.
  • Validates source and needle widths before using the reinterpreted path.
  • Adds focused coverage for signed boundary values, bit patterns, and sliced arrays.

Are these changes tested?

Yes.

  • cargo fmt --all --check
  • cargo test -p datafusion-physical-expr reinterpreted_bitmap_handles_signed_boundaries_and_slices --lib
  • cargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --lib
  • cargo test -p datafusion-physical-expr in_list_int_types --lib
  • cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings

Are there any user-facing changes?

No. This is an internal performance optimization only.

Local benchmark snapshot

Benchmark command:

cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name>

Method: compare adjacent saved baselines using raw Criterion sample minima (min(time / iters)). Lower is better; changes within +/-5% are treated as noise.

Compared baselines: #23012 -> #23013

Relevant scope: signed 16-bit reinterpretation rows.

Summary: 6 relevant rows, 6 faster, 0 slower, 0 within +/-5%.

Benchmark Before After Change
narrow_integer/i16/list=256/match=0% 19.15 us 4.00 us -79.1% (4.79x faster)
narrow_integer/i16/list=256/match=50% 31.32 us 4.00 us -87.2% (7.82x faster)
narrow_integer/i16/list=4/match=0% 16.79 us 4.01 us -76.1% (4.18x faster)
narrow_integer/i16/list=4/match=50% 34.80 us 4.01 us -88.5% (8.69x faster)
narrow_integer/i16/list=64/match=0% 19.21 us 4.11 us -78.6% (4.68x faster)
narrow_integer/i16/list=64/match=50% 34.72 us 4.01 us -88.5% (8.66x faster)

Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_reinterpret_bitmaps branch from 9cbd006 to cc752d0 Compare June 18, 2026 08:14
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_reinterpret_bitmaps branch from cc752d0 to 9925e82 Compare June 18, 2026 08:52
@geoffreyclaude geoffreyclaude changed the title Implement Zero-Copy Reinterpretation and enable Int8/Int16 Bitmaps IN LIST: reinterpret small-width types for bitmap filters Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant