Skip to content

IN LIST: add direct-probe hash filter for large primitive lists#23015

Draft
geoffreyclaude wants to merge 8 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_direct_probe_filter
Draft

IN LIST: add direct-probe hash filter for large primitive lists#23015
geoffreyclaude wants to merge 8 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_direct_probe_filter

Conversation

@geoffreyclaude

@geoffreyclaude geoffreyclaude commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

#23014 handles tiny primitive IN lists by comparing against each constant. That stops being a good tradeoff once the list gets larger.

For larger primitive lists, this PR uses a purpose-built lookup table. The mental model is:

  1. Precompute a table from the constants in x IN (...).
  2. For each input value, compute a cheap table slot from the value.
  3. Check that slot, and move forward if there was a collision.

This is still a hash-table style lookup, but it is simpler than the generic fallback because primitive values are fixed-width and can be stored directly. There is no need for the generic Arrow comparator path for each candidate.

The earlier bitmap and branchless filters remain in place for the cases where they are cheaper.

What changes are included in this PR?

  • Adds DirectProbeFilter, a compact open-addressing lookup table with linear probing.
  • Routes larger primitive IN lists to direct probing after the branchless thresholds.
  • Supports zero-copy same-width reinterpretation for compatible primitive types.
  • Avoids extra temporary value copies when building the table.
  • Keeps slice and null handling on the raw-buffer fast path.

Are these changes tested?

Yes.

  • cargo fmt --all --check
  • cargo test -p datafusion-physical-expr direct_probe --lib
  • cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings

Are there any user-facing changes?

No. This is an internal performance optimization only.

Local benchmark snapshot

Benchmark command:

cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name>

Method: compare adjacent saved baselines using raw Criterion sample minima (min(time / iters)). Lower is better; changes within +/-5% are treated as noise.

Compared baselines: #23014 -> #23015

Relevant scope: large primitive-list rows.

Summary: 13 relevant rows, 13 faster, 0 slower, 0 within +/-5%.

Benchmark Before After Change
f32/large_list/list=64/match=0% 18.83 us 7.91 us -58.0% (2.38x faster)
f32/large_list/list=64/match=50% 33.00 us 10.27 us -68.9% (3.21x faster)
nulls/primitive/i32/large_list/list=64/match=50%/nulls=20% 25.79 us 11.26 us -56.3% (2.29x faster)
primitive/i32/large_list/list=256/match=0% 17.80 us 7.99 us -55.1% (2.23x faster)
primitive/i32/large_list/list=256/match=50% 27.31 us 10.25 us -62.5% (2.67x faster)
primitive/i32/large_list/list=64/match=0% 18.15 us 8.05 us -55.7% (2.26x faster)
primitive/i32/large_list/list=64/match=50% 27.85 us 10.25 us -63.2% (2.72x faster)
primitive/i64/large_list/list=128/match=0% 18.93 us 8.00 us -57.7% (2.37x faster)
primitive/i64/large_list/list=128/match=50% 24.24 us 10.08 us -58.4% (2.41x faster)
primitive/i64/large_list/list=32/match=0% 19.82 us 8.51 us -57.1% (2.33x faster)
primitive/i64/large_list/list=32/match=50% 26.01 us 11.31 us -56.5% (2.30x faster)
timestamp_ns/large_list/list=32/match=0% 19.38 us 8.49 us -56.2% (2.28x faster)
timestamp_ns/large_list/list=32/match=50% 45.82 us 10.03 us -78.1% (4.57x faster)

Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_direct_probe_filter branch from c2625de to 12ca843 Compare June 18, 2026 08:22
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_direct_probe_filter branch from 12ca843 to 0111ce5 Compare June 18, 2026 09:12
@geoffreyclaude geoffreyclaude changed the title Implement Direct Probe (Hash) Filter for large primitive lists IN LIST: add direct-probe hash filter for large primitive lists Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant