perf[gpu]: export arrow device validity on the gpu#8440
Open
0ax1 wants to merge 10 commits into
Open
Conversation
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…t-pass Container exports (struct/list/fixed-size-list/list-view/dict) reach export_arrow_validity_buffer without going through execute_cuda, so a non-canonical Validity::Array (e.g. dict-encoded, or produced by take/scan) made the export bail. Canonicalize the validity on the GPU inside export_arrow_validity_buffer instead, which covers every export path uniformly. This makes the executor's execute_canonical_validity_cuda post-pass redundant. Removing it also restores the invariant that execute_cuda leaves validity host-executable, fixing the unwrap_host panic on the non-contiguous list-view rebuild path, where rebuild_primitive_list_view_child runs execute_no_nulls on the CPU context. Update the null_count expectations that the UNKNOWN_NULL_COUNT switch missed, and add a regression test for non-canonical container validity. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The helper had a null_count parameter that was always UNKNOWN_NULL_COUNT and a dead `null_count == 0` branch. Inline its move-to-device-and-align logic into the only caller. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
repack_arrow_validity_buffer recomputed the bit-to-byte length formula inline. Reuse the helper and derive the word count from the byte count. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Merging this PR will improve performance by 16.48%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | varbinview_large |
131.2 µs | 112.7 µs | +16.48% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ad/cuda-validity-export (b2c38b7) with develop (67a2b22)
Footnotes
-
10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Move canonicalization of the validity buffer from the CPU to the GPU for arrow device array. As part of that this change adds a null count kernel, as the count is required by cuDF. cuDF does not support consuming
-1(unknown true count) for passed in arrow device arrays.