Skip to content

Fix DataFusion drops grouped MIN/MAX rows with NULL sort keys under ORDER BY + LIMIT#22364

Open
xiedeyantu wants to merge 1 commit into
apache:mainfrom
xiedeyantu:topk
Open

Fix DataFusion drops grouped MIN/MAX rows with NULL sort keys under ORDER BY + LIMIT#22364
xiedeyantu wants to merge 1 commit into
apache:mainfrom
xiedeyantu:topk

Conversation

@xiedeyantu
Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

TopKAggregation could rewrite grouped MIN/MAX queries with ORDER BY ... LIMIT
into the grouped TopK execution path even when the aggregate sort key was NULL.

The grouped TopK stream previously assumed aggregate sort values reaching the heap
were non-null and skipped rows whose aggregate value was NULL. This was incorrect
for queries where a group is valid but its final MIN/MAX result is NULL
(for example, when all values in the group are NULL). In such cases, the row
should still be preserved and ordered according to NULLS FIRST / NULLS LAST
semantics rather than being dropped.

This change fixes that correctness issue in the execution path instead of disabling
the optimization for nullable aggregates.

What changes are included in this PR?

  • update primitive TopK heap handling to retain nullable aggregate sort values
  • emit primitive heap results with the correct Arrow null buffer
  • remove the row-level NULL skip in GroupedTopKAggregateStream
  • add a regression test covering grouped MIN/MAX queries with NULL aggregate
    results under ORDER BY ... LIMIT

Are these changes tested?

Yes.

Added a regression test in datafusion/core/tests/physical_optimizer/aggregate_statistics.rs
that verifies grouped MIN/MAX rows with NULL aggregate results are preserved when
the TopK aggregation optimization is enabled.

Focused validation run:

  • cargo test -p datafusion-physical-plan aggregates::topk::heap --lib
  • cargo test -p datafusion --test core_integration null_min_max_topk_preserves_group_rows

Are there any user-facing changes?

Yes.

This fixes an incorrect query result for grouped MIN/MAX queries using
ORDER BY ... LIMIT when the aggregate sort key is NULL.

@github-actions github-actions Bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt) labels May 19, 2026
@xiedeyantu
Copy link
Copy Markdown
Member Author

@Jefffrey Could you help me take a look this PR?

@xiedeyantu
Copy link
Copy Markdown
Member Author

@alamb If you have time, please take a look and see if this fix is ​​correct. The latest main branch I tested still has this problem.

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not as familiar with this part of the codebase, but one thing i noticed was the changed SLT tests seem to differ from the results without this optimization enabled?

for example see

query TI
select trace_id, MIN(timestamp) from traces group by trace_id order by MIN(timestamp) asc limit 4;
----
b -2
a -1
NULL 0
c 2

this seems a bit off if the behaviour changes depending on if optimization is enabled or not?

@github-actions github-actions Bot removed the sqllogictest SQL Logic Tests (.slt) label Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataFusion drops grouped MIN/MAX rows with NULL sort keys under ORDER BY + LIMIT

2 participants