fix: use spill writer's schema instead of the first batch schema for spill files#21293
Conversation
debff98 to
fd58942
Compare
d0bcb4e to
2f08c97
Compare
| } | ||
| } | ||
|
|
||
| #[cfg(test)] |
There was a problem hiding this comment.
I wonder if the unit tests are needed give the end to end tests?
There was a problem hiding this comment.
I was actually of the opposite opinion—the e2e test is slightly unwieldy to me, and convoluted in a way, since it tries to reproduce the bug using non-standard primitives and hit the right spill conditions in an unnatural way (because otherwise it's hard to repro). For instance union/repartition exec used directly instead of sql/dataframe API, special task ctx, sort_batch called directly on batches instead of wrapping with SortExec.
Also InProgressSpillFile spill file unit tests were missing anyway.
Either way, I'm fine with dropping either one (or none) of those, let me know what you think.
There was a problem hiding this comment.
I think this it is fine to keep both
2f08c97 to
8fbd95e
Compare
| @@ -0,0 +1,162 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Could we please not add a new test binary? Instead add it to the existing core_integration test?
Each new test binary takes like 100MB of disk space on the runner
There was a problem hiding this comment.
Oh for sure, give me a minute.
There was a problem hiding this comment.
Done, moved to memory_limit dir as there's another spill-themed test there already.
8fbd95e to
4480b84
Compare
4480b84 to
e133dd3
Compare
|
Thanks again @gruuya |
…spill files (apache#21293) <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#21292. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> `InProgressSpillFile` will use the first batch schema when it has no writer, which could have different nullability from the subsequent batches (e.g. due to union-ing a literal projection and a table column projection). This can then lead to a panic in `sort_batch`. `InProgressSpillFile` already has access to the canonical schema that the spill file should have (`self.spill_writer.schema()`). <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Single line fix: instead of using the first batch schema for the spill file schema, use the spill_writer's schema instead. The rest of the changes are two new tests. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes, two new tests are added. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
Rationale for this change
InProgressSpillFilewill use the first batch schema when it has no writer, which could have different nullability from the subsequent batches (e.g. due to union-ing a literal projection and a table column projection). This can then lead to a panic insort_batch.InProgressSpillFilealready has access to the canonical schema that the spill file should have (self.spill_writer.schema()).What changes are included in this PR?
Single line fix: instead of using the first batch schema for the spill file schema, use the spill_writer's schema instead.
The rest of the changes are two new tests.
Are these changes tested?
Yes, two new tests are added.
Are there any user-facing changes?