IN LIST: add string-view filters for Utf8View and BinaryView#23016
Draft
geoffreyclaude wants to merge 9 commits into
Draft
IN LIST: add string-view filters for Utf8View and BinaryView#23016geoffreyclaude wants to merge 9 commits into
geoffreyclaude wants to merge 9 commits into
Conversation
This was referenced Jun 18, 2026
Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
2834ab2 to
34307af
Compare
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
34307af to
620f5e3
Compare
Introduces a two-stage filter for ByteView types. Stage 1 uses a fast DirectProbeFilter on masked views (len + prefix) for quick rejection; Stage 2 performs full verification only for potential long-string matches. Triggers for Utf8View and BinaryView.
620f5e3 to
0adb66e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
INperformance with specialized implementations #19390.Rationale for this change
String and binary values are different from integers because comparing the full bytes can be more expensive. A string may be long, and many rows may not match at all.
Arrow's
Utf8ViewandBinaryViewlayouts store useful summary information directly in each view value: the length and a prefix of the bytes. That gives us a cheap first question:If the length and prefix do not match, the answer is definitely no, and we avoid comparing the full value. If they do match, the value is only a candidate. Long values are then verified with an exact byte comparison before returning true.
For short
Utf8Viewstrings, the whole value fits inline in the view itself, so the primitive fast paths can be reused directly.What changes are included in this PR?
Utf8Viewbranchless/hash filters by viewing inline string views as 16-byte values.ByteViewMaskedFilterfor mixed-lengthUtf8ViewandBinaryViewarrays.Are these changes tested?
Yes.
cargo fmt --all --checkcargo test -p datafusion-physical-expr reinterpreted_ --libcargo test -p datafusion-physical-expr utf8view_hash_filter_handles_short_slices --libcargo test -p datafusion-physical-expr byte_view_masked_filter_verifies_long_string_matches --libcargo test -p datafusion-physical-expr in_list_string_types --libcargo test -p datafusion-physical-expr in_list_binary_types --libcargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warningsAre there any user-facing changes?
No. This is an internal performance optimization only.
Local benchmark snapshot
Benchmark command:
Method: compare adjacent saved baselines using raw Criterion sample minima (
min(time / iters)). Lower is better; changes within +/-5% are treated as noise.Compared baselines: #23015 -> #23016
Relevant scope: Utf8View and nullable Utf8View rows.
Summary: 34 relevant rows, 32 faster, 2 slower, 0 within +/-5%.
Largest relevant deltas:
nulls/utf8view/short_8b/list=16/match=50%/nulls=20%nulls/utf8view/short_8b/list=16/match=50%/nulls=50%nulls/utf8view/short_8b/list=16/match=50%/nulls=20%/NOT_INutf8view/short_8b/list=256/match=50%utf8view/short_8b/list=16/match=50%utf8view/len_12b/list=16/match=50%utf8view/len_12b/list=64/match=50%utf8view/short_8b/list=64/match=50%utf8view/short_8b/list=4/match=50%utf8view/mixed_len/list=16/match=50%utf8view/shared_prefix/pfx=12/list=32/match=0%utf8view/shared_prefix/pfx=16/list=64/match=0%utf8view/mixed_len/list=64/match=0%utf8view/mixed_len/list=16/match=0%utf8view/long_24b/list=4/match=0%Full relevant table (34 rows)
nulls/utf8view/long_24b/list=16/match=50%/nulls=20%nulls/utf8view/short_8b/list=16/match=50%/nulls=20%nulls/utf8view/short_8b/list=16/match=50%/nulls=20%/NOT_INnulls/utf8view/short_8b/list=16/match=50%/nulls=50%utf8view/len_12b/list=16/match=0%utf8view/len_12b/list=16/match=50%utf8view/len_12b/list=64/match=0%utf8view/len_12b/list=64/match=50%utf8view/long_24b/list=16/match=0%utf8view/long_24b/list=16/match=50%utf8view/long_24b/list=256/match=0%utf8view/long_24b/list=256/match=50%utf8view/long_24b/list=4/match=0%utf8view/long_24b/list=4/match=50%utf8view/long_24b/list=64/match=0%utf8view/long_24b/list=64/match=50%utf8view/mixed_len/list=16/match=0%utf8view/mixed_len/list=16/match=50%utf8view/mixed_len/list=64/match=0%utf8view/mixed_len/list=64/match=50%utf8view/shared_prefix/pfx=12/list=32/match=0%utf8view/shared_prefix/pfx=12/list=32/match=50%utf8view/shared_prefix/pfx=16/list=64/match=0%utf8view/shared_prefix/pfx=16/list=64/match=50%utf8view/shared_prefix/pfx=8/list=16/match=0%utf8view/shared_prefix/pfx=8/list=16/match=50%utf8view/short_8b/list=16/match=0%utf8view/short_8b/list=16/match=50%utf8view/short_8b/list=256/match=0%utf8view/short_8b/list=256/match=50%utf8view/short_8b/list=4/match=0%utf8view/short_8b/list=4/match=50%utf8view/short_8b/list=64/match=0%utf8view/short_8b/list=64/match=50%