fix: Improve consistency of per-column stats on FilterExec output#22718
fix: Improve consistency of per-column stats on FilterExec output#22718neilconway wants to merge 1 commit into
FilterExec output#22718Conversation
asolimando
left a comment
There was a problem hiding this comment.
LGTM, thanks @neilconway for the PR, I left a few comments but nothing major/blocking
| /// | ||
| /// Equality predicates (`col = literal`) set NDV to `Exact(1)`, or | ||
| /// `Exact(0)` when the predicate is contradictory (e.g. `a = 1 AND a = 2`). | ||
| /// (either default or estimated) to input statistics. |
There was a problem hiding this comment.
The function is quite interesting and IMO the docstring could summarize a little more what it does, especially as this is public
| /// input, rows where that column is NULL cannot survive. | ||
| /// | ||
| /// This analysis is conservative; for example, OR clauses are not considered | ||
| /// null-rejecting; neither are indirect operands like `a + 1 < 10`. |
There was a problem hiding this comment.
Nit: this is incomplete and it's fine, but we might add support at least for IS NOT NULL as it should be both easy and what people would intuitively expect. wdyt?
| ) -> Precision<usize> { | ||
| match filtered_num_rows { | ||
| Precision::Exact(0) | Precision::Inexact(0) => filtered_num_rows, | ||
| _ => Precision::Exact(1), |
There was a problem hiding this comment.
Nit: what about returning Inexact(1) when matching Absent?
| /// filtered row estimate, since a column cannot have more nulls or distinct | ||
| /// values than it has rows. Known counts are demoted to inexact because the | ||
| /// filtered row count is itself an estimate. | ||
| fn cap_at_rows( |
There was a problem hiding this comment.
Not a must-have, but you might be interested in checking what is implemented in Apache Calcite for the same case: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/metadata/RelMdUtil.java#L317
This is a more refined estimator than the proposed one, but we can postpone to a follow-up PR (this was on my radar already, I will get to it at some point).
Which issue does this PR close?
Rationale for this change
#21081 capped the NDV at the row count when computing statistics for several operators. This PR extends that work and ensures that per-column statistics for filter operators are consistent with the estimated output row count. In particular:
We also extend the analysis to consider null-rejecting predicates; for example, the clause
a = 10as a top-level conjunct implies that the null-count of the surviving rows is exactly 0.What changes are included in this PR?
Are these changes tested?
Yes; new tests added.
Are there any user-facing changes?
No.