fix: Improve consistency of per-column stats on `FilterExec` output by neilconway · Pull Request #22718 · apache/datafusion

neilconway · 2026-06-02T14:42:28Z

Which issue does this PR close?

Closes Filter stats should ensure column-level stats are consistent with filter #22716

Rationale for this change

#21081 capped the NDV at the row count when computing statistics for several operators. This PR extends that work and ensures that per-column statistics for filter operators are consistent with the estimated output row count. In particular:

Null count is also capped at the row count
Byte size is scaled down by the estimated selectivity

We also extend the analysis to consider null-rejecting predicates; for example, the clause a = 10 as a top-level conjunct implies that the null-count of the surviving rows is exactly 0.

What changes are included in this PR?

Ensure per-column statistics (null count, byte size) are consistent with filtered row count
Check for null-rejecting predicates to estimate a more accurate null count of 0
Update SLT expected plans
Add unit tests for new behavior
Various refactoring and comment improvements

Are these changes tested?

Yes; new tests added.

Are there any user-facing changes?

No.

neilconway · 2026-06-02T14:43:18Z

cc @asolimando @gene-bordegaray @xudong963

asolimando

LGTM, thanks @neilconway for the PR, I left a few comments but nothing major/blocking

asolimando · 2026-06-02T16:09:49Z

-    ///
-    /// Equality predicates (`col = literal`) set NDV to `Exact(1)`, or
-    /// `Exact(0)` when the predicate is contradictory (e.g. `a = 1 AND a = 2`).
+    /// (either default or estimated) to input statistics.


The function is quite interesting and IMO the docstring could summarize a little more what it does, especially as this is public

asolimando · 2026-06-02T16:13:32Z

+/// input, rows where that column is NULL cannot survive.
+///
+/// This analysis is conservative; for example, OR clauses are not considered
+/// null-rejecting; neither are indirect operands like `a + 1 < 10`.


Nit: this is incomplete and it's fine, but we might add support at least for IS NOT NULL as it should be both easy and what people would intuitively expect. wdyt?

asolimando · 2026-06-02T16:16:58Z

+) -> Precision<usize> {
+    match filtered_num_rows {
+        Precision::Exact(0) | Precision::Inexact(0) => filtered_num_rows,
+        _ => Precision::Exact(1),


Nit: what about returning Inexact(1) when matching Absent?

asolimando · 2026-06-02T16:32:07Z

+/// filtered row estimate, since a column cannot have more nulls or distinct
+/// values than it has rows. Known counts are demoted to inexact because the
+/// filtered row count is itself an estimate.
+fn cap_at_rows(


Not a must-have, but you might be interested in checking what is implemented in Apache Calcite for the same case: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/metadata/RelMdUtil.java#L317

This is a more refined estimator than the proposed one, but we can postpone to a follow-up PR (this was on my radar already, I will get to it at some point).

.

497edc6

github-actions Bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jun 2, 2026

asolimando approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve consistency of per-column stats on `FilterExec` output#22718

fix: Improve consistency of per-column stats on `FilterExec` output#22718
neilconway wants to merge 1 commit into
apache:mainfrom
neilconway:neilc/fix-stats-filter-caps-null-byte-size

neilconway commented Jun 2, 2026

Uh oh!

neilconway commented Jun 2, 2026

Uh oh!

asolimando left a comment

Uh oh!

asolimando Jun 2, 2026

Uh oh!

asolimando Jun 2, 2026

Uh oh!

asolimando Jun 2, 2026

Uh oh!

asolimando Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neilconway commented Jun 2, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Jun 2, 2026

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

asolimando Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants