fix: reject string/binary read as numeric in native_datafusion scan by andygrove · Pull Request #4091 · apache/datafusion-comet

andygrove · 2026-04-25T19:44:10Z

Which issue does this PR close?

Closes #4088.

Rationale for this change

When the native_datafusion scan reads a Parquet BINARY (UTF8) column under a numeric read schema, the existing schema adapter creates a Spark Cast with is_adapting_schema=true. In that mode Cast delegates to DataFusion's cast, which parses the bytes (returning null on non-numeric strings, or in some paths reinterpreting the raw bytes). Spark's vectorized reader rejects this kind of mismatch with SchemaColumnConvertNotSupportedException on every supported version, and native_iceberg_compat already does the same via TypeUtil.checkParquetType. The native scan should match.

What changes are included in this PR?

native/core/src/parquet/schema_adapter.rs: in replace_with_spark_cast, add a guard before the existing branches that returns DataFusionError::Plan when the source type is Utf8, LargeUtf8, Binary, or LargeBinary and the target type is any integer (Int8/Int16/Int32/Int64/UInt*) or floating-point type (Float32/Float64). The rule mirrors TypeUtil.checkParquetType's BINARY case (lines 208-221), which only allows reading BINARY as StringType, BinaryType, or a binary-encoded decimal.

The check is intentionally narrow: it only fires for string/binary -> numeric mismatches and leaves every other type path unchanged.

How are these changes tested?

Added a focused test to ParquetReadSuite: native_datafusion rejects string read as numeric. It writes string data, reads it under c int, forces spark.comet.scan.impl=native_datafusion and spark.sql.sources.useV1SourceList=parquet, and asserts that collect() raises SparkException. Verified against ParquetReadV1Suite (44 succeeded, no regressions; 1 pre-existing test ignored).

The behavior is also covered by the per-impl matrix added in #4087 (string read as int: native_datafusion), whose assertion will need flipping from "succeeds with garbage" to "throws" once that PR merges.

The native_datafusion Spark physical expression adapter fell through to DataFusion's cast for Utf8/Binary -> numeric type changes (because SparkCastOptions.is_adapting_schema delegates to DataFusion's cast), which silently parses the bytes (returning nulls or, on some paths, reinterpreting raw bytes) where Spark's vectorized reader and the native_iceberg_compat scan throw SchemaColumnConvertNotSupportedException. Add a guard in replace_with_spark_cast that rejects when the source type is Utf8/LargeUtf8/Binary/LargeBinary and the target type is any integer or floating-point type, mirroring TypeUtil.checkParquetType on the JVM side. Closes apache#4088.

mbutrovich · 2026-04-26T14:09:49Z

Same question as the other PR, can we catch this earlier?

mbutrovich · 2026-04-27T14:38:40Z

Thanks @andygrove! I double-checked Spark: the guard matches Spark's ParquetVectorUpdaterFactory.getUpdater() for BINARY columns (lines 175-180) and looks correct for the numeric cases.

One question: the check rejects specific numeric target types, but Spark's vectorized reader rejects BINARY->anything except String, Binary, and binary-encoded Decimal. Could conversions like BINARY->Boolean or BINARY->Date also reach this code path? If so, would an allow-list approach (allow only the three valid targets, reject the rest) be more defensive here?

Per review feedback, switch from a deny-list of numeric target types to an allow-list of the only targets Spark's vectorized reader permits for a BINARY column: StringType, BinaryType, and a binary-encoded decimal. This also rejects mismatches like BINARY -> Boolean, BINARY -> Date, and BINARY -> Timestamp, which the previous guard silently handed to DataFusion's cast. Test now covers numeric, boolean, date, and timestamp targets.

andygrove · 2026-04-27T18:58:48Z

Same question as the other PR, can we catch this earlier?

As discussed on #4090: I looked at this and the physical Parquet schema is not available at CometScanRule time, so we cannot detect the mismatch before serialization. Catching it in the schema adapter (where we do have the physical and target types) is the earliest practical point.

andygrove · 2026-04-27T18:58:53Z

Could conversions like BINARY->Boolean or BINARY->Date also reach this code path? If so, would an allow-list approach (allow only the three valid targets, reject the rest) be more defensive here?

Yes, they reach this path and were silently delegated to DataFusion's cast (parses date/boolean strings, returns nulls on failure). Switched to an allow-list in the latest commit: when the source is Utf8/LargeUtf8/Binary/LargeBinary, the only permitted targets are the same string/binary types and Decimal128/Decimal256 (the binary-encoded decimal case). The test now also covers boolean, date, and timestamp targets in addition to the numeric cases.

mbutrovich

Thanks @andygrove!

andygrove · 2026-04-27T21:14:08Z

Merged. Thanks @mbutrovich

andygrove added correctness bug Something isn't working labels Apr 25, 2026

andygrove mentioned this pull request Apr 26, 2026

Comet 0.15.1 Release #4094

Open

mbutrovich approved these changes Apr 27, 2026

View reviewed changes

andygrove merged commit c05bd16 into apache:main Apr 27, 2026
181 of 187 checks passed

andygrove deleted the fix-issue-4088-string-as-int branch April 27, 2026 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reject string/binary read as numeric in native_datafusion scan#4091

fix: reject string/binary read as numeric in native_datafusion scan#4091
andygrove merged 2 commits intoapache:mainfrom
andygrove:fix-issue-4088-string-as-int

andygrove commented Apr 25, 2026

Uh oh!

mbutrovich commented Apr 26, 2026

Uh oh!

mbutrovich commented Apr 27, 2026

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Apr 25, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented Apr 26, 2026

Uh oh!

mbutrovich commented Apr 27, 2026

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants