perf[gpu]: export arrow device validity on the gpu by 0ax1 · Pull Request #8440 · vortex-data/vortex

0ax1 · 2026-06-16T09:52:06Z

Move canonicalization of the validity buffer from the CPU to the GPU for arrow device array. As part of that this change adds a null count kernel, as the count is required by cuDF. cuDF does not support consuming -1 (unknown true count) for passed in arrow device arrays.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

…t-pass Container exports (struct/list/fixed-size-list/list-view/dict) reach export_arrow_validity_buffer without going through execute_cuda, so a non-canonical Validity::Array (e.g. dict-encoded, or produced by take/scan) made the export bail. Canonicalize the validity on the GPU inside export_arrow_validity_buffer instead, which covers every export path uniformly. This makes the executor's execute_canonical_validity_cuda post-pass redundant. Removing it also restores the invariant that execute_cuda leaves validity host-executable, fixing the unwrap_host panic on the non-contiguous list-view rebuild path, where rebuild_primitive_list_view_child runs execute_no_nulls on the CPU context. Update the null_count expectations that the UNKNOWN_NULL_COUNT switch missed, and add a regression test for non-canonical container validity. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The helper had a null_count parameter that was always UNKNOWN_NULL_COUNT and a dead `null_count == 0` branch. Inline its move-to-device-and-align logic into the only caller. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

repack_arrow_validity_buffer recomputed the bit-to-byte length formula inline. Reuse the helper and derive the word count from the byte count. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

codspeed-hq · 2026-06-16T09:58:43Z

Merging this PR will improve performance by 16.48%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
✅ 1544 untouched benchmarks
⏩ 10 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`varbinview_large`	131.2 µs	112.7 µs	+16.48%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing ad/cuda-validity-export (b2c38b7) with develop (67a2b22)}

10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 and others added 6 commits June 16, 2026 09:16

feat: CUDA Arrow validity export

5f427fb

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

chore(cuda): remove kernel function cache

a6fdc7d

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

cleanup

b1e7317

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 added the changelog/performance A performance improvement label Jun 16, 2026

0ax1 added 2 commits June 16, 2026 10:38

bitcount kernel

52fdf57

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

format

c878f9c

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 requested review from onursatici and robert3005 June 16, 2026 10:46

0ax1 marked this pull request as ready for review June 16, 2026 10:46

0ax1 requested a review from a team June 16, 2026 10:46