Document why the Arrow exporter keeps Decimal128 as the default decimal width#8197
Document why the Arrow exporter keeps Decimal128 as the default decimal width#8197joseph-isaacs wants to merge 6 commits into
Conversation
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
|
this used to break some benchmarks in some annoying ways |
Polar Signals Profiling ResultsLatest Run
Previous Runs (2)
Powered by Polar Signals Cloud |
Cherry-pick of #8197 (Joe Isaacs). Enables narrow Arrow decimal types in the shared Vortex-to-Arrow mapping now that DataFusion 51+ and arrow-rs 56+ support them. Signed-off-by: Alexander Droste <alex@spiraldb.com> Co-Authored-By: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.017x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (1.017x ➖, 0↑ 2↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.006x ➖, 1↑ 3↓)
datafusion / vortex-compact (1.023x ➖, 1↑ 2↓)
datafusion / parquet (1.030x ➖, 0↑ 4↓)
datafusion / arrow (0.991x ➖, 5↑ 3↓)
duckdb / vortex-file-compressed (1.013x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.019x ➖, 0↑ 0↓)
duckdb / parquet (1.008x ➖, 1↑ 0↓)
duckdb / duckdb (1.014x ➖, 0↑ 0↓)
File Size Changes (18 files changed, +5.5% overall, 9↑ 9↓)
Totals:
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.908x ➖, 3↑ 1↓)
datafusion / vortex-compact (1.015x ➖, 0↑ 0↓)
datafusion / parquet (1.023x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.896x ✅, 3↑ 0↓)
duckdb / vortex-compact (1.016x ➖, 0↑ 1↓)
duckdb / parquet (1.005x ➖, 0↑ 0↓)
File Size Changes (2 files changed, +13.7% overall, 1↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.080x ➖, 1↑ 1↓)
datafusion / vortex-compact (1.042x ➖, 0↑ 0↓)
datafusion / parquet (1.077x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.157x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.017x ➖, 0↑ 0↓)
duckdb / parquet (1.052x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (1.025x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.063x ➖, 0↑ 0↓)
duckdb / parquet (1.009x ➖, 0↑ 0↓)
File Size Changes (2 files changed, +0.2% overall, 1↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.876x ✅, 17↑ 1↓)
datafusion / vortex-compact (0.908x ➖, 13↑ 1↓)
datafusion / parquet (0.971x ➖, 6↑ 2↓)
datafusion / arrow (0.927x ➖, 8↑ 5↓)
duckdb / vortex-file-compressed (0.964x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.973x ➖, 0↑ 0↓)
duckdb / parquet (0.993x ➖, 0↑ 0↓)
duckdb / duckdb (1.001x ➖, 0↑ 0↓)
File Size Changes (48 files changed, +5.5% overall, 27↑ 21↓)
Totals:
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.807x ✅, 26↑ 1↓)
datafusion / parquet (0.782x ✅, 27↑ 0↓)
duckdb / vortex-file-compressed (0.931x ➖, 9↑ 1↓)
duckdb / parquet (0.986x ➖, 0↑ 0↓)
duckdb / duckdb (0.970x ➖, 4↑ 2↓)
File Size Changes (201 files changed, +16.5% overall, 100↑ 101↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.987x ➖, 1↑ 1↓)
datafusion / vortex-compact (0.954x ➖, 1↑ 1↓)
datafusion / parquet (0.990x ➖, 3↑ 2↓)
duckdb / vortex-file-compressed (1.022x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.964x ➖, 0↑ 0↓)
duckdb / parquet (1.007x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Appian on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.001x ➖, 2↑ 1↓)
datafusion / parquet (1.009x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
duckdb / parquet (0.993x ➖, 0↑ 0↓)
duckdb / duckdb (0.989x ➖, 0↑ 0↓)
File Size Changes (19 files changed, +1.8% overall, 3↑ 16↓)
Totals:
Full attributed analysis
|
Cherry-pick of #8197 (Joe Isaacs). Enables narrow Arrow decimal types in the shared Vortex-to-Arrow mapping now that DataFusion 51+ and arrow-rs 56+ support them. Signed-off-by: Alexander Droste <alex@spiraldb.com> Co-Authored-By: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Benchmarks: CompressionVortex (geomean): 0.994x ➖ How to read Verdict and Engines
unknown / unknown (0.984x ➖, 2↑ 0↓)
|
Benchmarks: Random AccessVortex (geomean): 1.009x ➖ How to read Verdict and Engines
unknown / unknown (1.010x ➖, 4↑ 6↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.954x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.897x ➖, 1↑ 1↓)
datafusion / parquet (0.964x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.899x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.904x ➖, 0↑ 0↓)
duckdb / parquet (0.894x ➖, 0↑ 0↓)
Full attributed analysis
|
|
@AdamGS do you remember where this broke previously? |
|
TPC-DS |
|
I remembered Sum and and Avg, but I'll open a DF issue. |
|
another fix for now will be to have schemas for tpc-ds/tpc-h, if we declare them wide enough we'll avoid the overflow. |
Merging this PR will not alter performance
|
Rebased onto the DataFusion 54 upgrade branch (adamg/df-54) to evaluate narrow decimal export against DataFusion 54. Maps decimals to the smallest Arrow width that fits the precision: Decimal32 for <=9, Decimal64 for <=18, Decimal128 for <=38, else Decimal256. The Arrow executor already handles all four widths. Note: DataFusion 54's decimal SUM/AVG accumulators are unchanged from 53 (branch-54 keeps SUM(Decimal32) -> Decimal32(min(9, p+10), s) in i32), so aggregating inferred narrow decimals is still expected to overflow (e.g. TPC-DS Q1). This branch is for evaluating that behavior on 54. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
6174c77 to
6603fcf
Compare
|
I remember now that arrow-java doesn't support them yet so for jni we would need a way to force only i128/i256 |
|
needs to gate these too if < DF 0.54.0 |
|
DF issue: apache/datafusion#22713 |
Summary
Originally this PR enabled emitting narrower Arrow decimals (
Decimal32/Decimal64) by default fromto_data_type_naive. E2E verification againstdatafusion-benchwith real generated data showed this is not safe on the versions Vortex ships, so the behavior change has been reverted; the net diff is now an explanatory comment documenting why the narrower widths stay gated.What the verification found
Decimal32/Decimal64types (added in arrow-rs ~v56 and DataFusion 51.0.0), and Vortex's Arrow executor already produces all four widths on request. So the types exist.SUM/AVGwithin the same physical width family —Decimal32 → i32,Decimal64 → i64— instead of widening the accumulator (datafusion-functions-aggregatesum.rs/average.rs). ADecimal32SUM/AVGis capped at precision 9 /i32, so it overflows once the running total exceeds ~1e9, regardless of row count.avg(sum(sr_return_amt)),sr_return_amt=decimal(7,2), inferred schema →Decimal32) fails withArithmetic Overflow in AvgAccumulator. Tracked upstream in apache/datafusion#17489.Decimal128(15,2)schemas, so its plan never sees the narrow type.SparkToArrowSchemahard-codesbitWidth = 128; the JNITestMinimalwritesDecimal128(9,2)and asserts an exact round-trip), so any narrowing breaks the round-trip — confirmed by theTestMinimal.testFullScanfailure.There is no narrowing tier that's safe across all current consumers:
Decimal32breaks DataFusion aggregation + JNI;Decimal64would break the Sparkdecimal(10,2)round-trip too. The original guard comment was correct.Change
Revert to the
Decimal128default and replace the terse// commented out until DataFusion improves...note with a comment recording the concrete reasons (accumulator overflow + 128-bit binding assumption, with the upstream issue link). The Arrow executor still emitsDecimal32/Decimal64when a consumer explicitly requests that target type — only the inferred default staysDecimal128.Follow-up option
If we want the storage/bandwidth win for narrow decimals without breaking these consumers, the path is to make narrowing opt-in (default stays
Decimal128, enabled via a session/config flag). Happy to do that as a separate change.Checks
cargo test -p vortex-array --lib dtype::arrow✅datafusion-bench tpch --formats parquet,vortex --queries 1,6(SF=1, real data) ✅ — row counts validateddevelop(comment-only diff), restoring the green baseline for the TPC-DS bench and JNI jobs.https://claude.ai/code/session_01RctLpues7aLnsxJ86XH9pC