feat: support Boolean in approx_distinct#22707
Conversation
PR apache#21453 introduced the ApproxDistinctBitmapWrapper pattern for small ints and incidentally dropped a "TODO support for boolean (trivial case)" comment without wiring up Boolean. This adds a BooleanDistinctCountAccumulator (a [bool; 2] flag pair) plugged into the existing wrapper so approx_distinct returns an exact 0/1/2 for Boolean inputs.
|
Thank for working on this @JeelRajodiya Can you update the PR body using the template. |
…ntAccumulator Use BooleanArray::has_true/has_false instead of iterating, and bail early when both flags are already set. Per review feedback from @Jefffrey on apache#22707.
…mall_int helpers - Replace `seen: [bool; 2]` with named `has_seen_false`/`has_seen_true` fields and add `seen_both()` helper for clarity. - Rename `get_small_int_*` helpers to `get_fixed_domain_*` since they also dispatch the Boolean arm, which is not a small int.
|
@Jefffrey PR is ready for review! |
|
@JeelRajodiya , Thank you for the PR. Let me take a look and post review this week |
|
Also, there are other types that approx distinct does not support. Carrier mapping:
Or we can also create seprate accumulators for them. Whichever is preffered by datafusion team. |
coderfender
left a comment
There was a problem hiding this comment.
Left a couple of comments but this looks good to me
|
|
||
| #[cold] | ||
| fn get_small_int_approx_accumulator( | ||
| fn get_fixed_domain_approx_accumulator( |
There was a problem hiding this comment.
Curious why the name get_fixed_domain_state_field is chosen here ?
There was a problem hiding this comment.
because the keeping the name small_int won't be relevant anymore and the types we included under these functions share a small, finite domain (i.e finite values) hence this name was chosen.
…inctCountAccumulator - Pull the per-flag update block into an `observe(&BooleanArray)` helper shared by `update_batch` and `merge_batch`. - Sum the two flags as u8 first, then cast once to i64 in `count()`. Per review feedback from @coderfender on apache#22707.
Which issue does this PR close?
No issue is open for Boolean specifically. Related: #1109 (closed by #21453) introduced the bitmap pattern this PR extends. The pre-#21453 code carried a
// TODO support for boolean (trivial case)comment that was silently dropped without implementation.Rationale for this change
Today
approx_distinct(bool_col)errors with "Support for 'approx_distinct' for data type Boolean is not implemented". Boolean has at most 2 distinct non-null values, so HLL is overkill — a tiny pair of flags gives an exact answer at a fraction of the memory cost, matching the small-int bitmap strategy already in the codebase.What changes are included in this PR?
Adds
BooleanDistinctCountAccumulator(two named flags:has_seen_false/has_seen_true) infunctions-aggregate-commonand wiresDataType::Booleanthrough the existingApproxDistinctBitmapWrapperinapprox_distinct.rsalongside the small-int arms. State serializes asList<Boolean>.Are these changes tested?
Yes — SLT regression coverage in
aggregate.sltfor all-true, all-false, mixed, all-null, andGROUP BYcases.Are there any user-facing changes?
Yes —
approx_distinct(<boolean>)now works instead of erroring. No API or behavioral changes for existing types.