Skip to content

fix: Improve consistency of per-column stats on FilterExec output#22718

Open
neilconway wants to merge 1 commit into
apache:mainfrom
neilconway:neilc/fix-stats-filter-caps-null-byte-size
Open

fix: Improve consistency of per-column stats on FilterExec output#22718
neilconway wants to merge 1 commit into
apache:mainfrom
neilconway:neilc/fix-stats-filter-caps-null-byte-size

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

#21081 capped the NDV at the row count when computing statistics for several operators. This PR extends that work and ensures that per-column statistics for filter operators are consistent with the estimated output row count. In particular:

  • Null count is also capped at the row count
  • Byte size is scaled down by the estimated selectivity

We also extend the analysis to consider null-rejecting predicates; for example, the clause a = 10 as a top-level conjunct implies that the null-count of the surviving rows is exactly 0.

What changes are included in this PR?

  • Ensure per-column statistics (null count, byte size) are consistent with filtered row count
  • Check for null-rejecting predicates to estimate a more accurate null count of 0
  • Update SLT expected plans
  • Add unit tests for new behavior
  • Various refactoring and comment improvements

Are these changes tested?

Yes; new tests added.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jun 2, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

cc @asolimando @gene-bordegaray @xudong963

Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @neilconway for the PR, I left a few comments but nothing major/blocking

///
/// Equality predicates (`col = literal`) set NDV to `Exact(1)`, or
/// `Exact(0)` when the predicate is contradictory (e.g. `a = 1 AND a = 2`).
/// (either default or estimated) to input statistics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is quite interesting and IMO the docstring could summarize a little more what it does, especially as this is public

/// input, rows where that column is NULL cannot survive.
///
/// This analysis is conservative; for example, OR clauses are not considered
/// null-rejecting; neither are indirect operands like `a + 1 < 10`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is incomplete and it's fine, but we might add support at least for IS NOT NULL as it should be both easy and what people would intuitively expect. wdyt?

) -> Precision<usize> {
match filtered_num_rows {
Precision::Exact(0) | Precision::Inexact(0) => filtered_num_rows,
_ => Precision::Exact(1),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: what about returning Inexact(1) when matching Absent?

/// filtered row estimate, since a column cannot have more nulls or distinct
/// values than it has rows. Known counts are demoted to inexact because the
/// filtered row count is itself an estimate.
fn cap_at_rows(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a must-have, but you might be interested in checking what is implemented in Apache Calcite for the same case: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/metadata/RelMdUtil.java#L317

This is a more refined estimator than the proposed one, but we can postpone to a follow-up PR (this was on my radar already, I will get to it at some point).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filter stats should ensure column-level stats are consistent with filter

2 participants