Optimize IN performance with specialized implementations#19390
Optimize IN performance with specialized implementations#19390geoffreyclaude wants to merge 11 commits into
IN performance with specialized implementations#19390Conversation
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
|
🤖 |
|
run benchmark tpch tpchds |
|
🤖 Hi @Dandandan, thanks for the request (#19390 (comment)).
Please choose one or more of these with |
|
🤖: Benchmark completed Details
|
|
run benchmark tpch tpcds |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
@Dandandan how do I think once this optim is done, there could be a lot to reuse for broadcast joins... |
For plain (non dynamic) filters, I think based on a treshold (<= 3) it either gets planned as a chain of or expressions or using |
7ba1c85 to
276a37f
Compare
|
run benchmark in_list |
276a37f to
d18b346
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
2fc00e5 to
3db393a
Compare
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
That is my concern as well :) It's hard for me to judge what is complex because I've never seen it before / don't have a CS degree vs. is complex for anyone. I'm prepared to basically trust your judgment on it. How about this test for each commit / optimization: if your team came and said "hey we are hitting an One thing we could do is merge as stacked PRs so that commits don't get squashed and it's easier to revert individual pieces. Ultimately if there were issues because this is self contained I think the initial answer would be to revert the problematic optimization. |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
@adriangb Good idea, stacked PRs is probably the safest bet here. We don't want all commits to get squashed into a single one (is that the merge policy on |
I believe so. We only have 1 merge button and afaik it squashes.
Yep agreed. I think if benchmarks look good, all tests pass and there are no public API changes we can move forward with them. |
|
Thsoe are some pretty sweet performance results. I will try and find some time to review this more carefully |
IN performance with specialized implementations
9ea2d75 to
8cc3e0d
Compare
Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
Introduces a two-stage filter for ByteView types. Stage 1 uses a fast DirectProbeFilter on masked views (len + prefix) for quick rejection; Stage 2 performs full verification only for potential long-string matches. Triggers for Utf8View and BinaryView.
Port of the two-stage View optimization to standard Utf8 and LargeUtf8 types. Encodes strings as i128 (len + prefix) for fast O(1) pre-filtering before falling back to full string comparison. Triggers for Utf8 and LargeUtf8.
FixedSizeBinary(N) arrays share the same contiguous buffer layout as primitive arrays, so for power-of-2 widths (1, 2, 4, 8, 16) we can zero-copy reinterpret them and use the optimized primitive filters (bitmap, branchless, hash) instead of falling through to the NestedTypeFilter fallback.
8cc3e0d to
1448c71
Compare
|
Closing this aggregate PR now that the work has been split into the stacked review series:
#19241 remains the umbrella issue for the overall |
Status
This aggregate PR has been superseded by the split stacked series for #19241.
The review path is now:
Each PR now owns one focused step in the optimization stack, with its own explanation and CI signal. Closing this aggregate PR avoids duplicate review on the same work.
#19241 remains the umbrella issue for the overall
IN LISTperformance work.