Skip to content

fix: use spill writer's schema instead of the first batch schema for spill files#21293

Merged
alamb merged 1 commit intoapache:mainfrom
splitgraph:union-nullable-spill-fix
Apr 3, 2026
Merged

fix: use spill writer's schema instead of the first batch schema for spill files#21293
alamb merged 1 commit intoapache:mainfrom
splitgraph:union-nullable-spill-fix

Conversation

@gruuya
Copy link
Copy Markdown
Contributor

@gruuya gruuya commented Apr 1, 2026

Which issue does this PR close?

Rationale for this change

InProgressSpillFile will use the first batch schema when it has no writer, which could have different nullability from the subsequent batches (e.g. due to union-ing a literal projection and a table column projection). This can then lead to a panic in sort_batch.

InProgressSpillFile already has access to the canonical schema that the spill file should have (self.spill_writer.schema()).

What changes are included in this PR?

Single line fix: instead of using the first batch schema for the spill file schema, use the spill_writer's schema instead.

The rest of the changes are two new tests.

Are these changes tested?

Yes, two new tests are added.

Are there any user-facing changes?

@gruuya gruuya force-pushed the union-nullable-spill-fix branch from debff98 to fd58942 Compare April 1, 2026 08:58
@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Apr 1, 2026
@gruuya gruuya force-pushed the union-nullable-spill-fix branch 3 times, most recently from d0bcb4e to 2f08c97 Compare April 1, 2026 11:19
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gruuya -- this makes sense to me

}
}

#[cfg(test)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the unit tests are needed give the end to end tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually of the opposite opinion—the e2e test is slightly unwieldy to me, and convoluted in a way, since it tries to reproduce the bug using non-standard primitives and hit the right spill conditions in an unnatural way (because otherwise it's hard to repro). For instance union/repartition exec used directly instead of sql/dataframe API, special task ctx, sort_batch called directly on batches instead of wrapping with SortExec.

Also InProgressSpillFile spill file unit tests were missing anyway.

Either way, I'm fine with dropping either one (or none) of those, let me know what you think.

Copy link
Copy Markdown
Contributor

@alamb alamb Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this it is fine to keep both

@gruuya gruuya force-pushed the union-nullable-spill-fix branch from 2f08c97 to 8fbd95e Compare April 3, 2026 07:53
@@ -0,0 +1,162 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please not add a new test binary? Instead add it to the existing core_integration test?

Each new test binary takes like 100MB of disk space on the runner

Copy link
Copy Markdown
Contributor Author

@gruuya gruuya Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh for sure, give me a minute.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, moved to memory_limit dir as there's another spill-themed test there already.

@gruuya gruuya force-pushed the union-nullable-spill-fix branch from 8fbd95e to 4480b84 Compare April 3, 2026 20:57
@gruuya gruuya force-pushed the union-nullable-spill-fix branch from 4480b84 to e133dd3 Compare April 3, 2026 21:08
@alamb alamb added this pull request to the merge queue Apr 3, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 3, 2026

Thanks again @gruuya

Merged via the queue into apache:main with commit 78cfc15 Apr 3, 2026
35 checks passed
alamb pushed a commit to alamb/datafusion that referenced this pull request Apr 6, 2026
…spill files (apache#21293)

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#21292.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

`InProgressSpillFile` will use the first batch schema when it has no
writer, which could have different nullability from the subsequent
batches (e.g. due to union-ing a literal projection and a table column
projection). This can then lead to a panic in `sort_batch`.

`InProgressSpillFile` already has access to the canonical schema that
the spill file should have (`self.spill_writer.schema()`).

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Single line fix: instead of using the first batch schema for the spill
file schema, use the spill_writer's schema instead.

The rest of the changes are two new tests.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes, two new tests are added.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
alamb added a commit that referenced this pull request Apr 7, 2026
… schema for spill files (#21293) (#21403)

- Part of #21078
- Closes #21293 on branch-52

This PR:
- Backports #21293 from @gruuya
to the branch-52 line

Co-authored-by: Marko Grujic <markoog@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming panic when spilling batches with mixed nullability

2 participants