Skip to content

fix: reject string/binary read as numeric in native_datafusion scan#4091

Merged
andygrove merged 2 commits intoapache:mainfrom
andygrove:fix-issue-4088-string-as-int
Apr 27, 2026
Merged

fix: reject string/binary read as numeric in native_datafusion scan#4091
andygrove merged 2 commits intoapache:mainfrom
andygrove:fix-issue-4088-string-as-int

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4088.

Rationale for this change

When the native_datafusion scan reads a Parquet BINARY (UTF8) column under a numeric read schema, the existing schema adapter creates a Spark Cast with is_adapting_schema=true. In that mode Cast delegates to DataFusion's cast, which parses the bytes (returning null on non-numeric strings, or in some paths reinterpreting the raw bytes). Spark's vectorized reader rejects this kind of mismatch with SchemaColumnConvertNotSupportedException on every supported version, and native_iceberg_compat already does the same via TypeUtil.checkParquetType. The native scan should match.

What changes are included in this PR?

native/core/src/parquet/schema_adapter.rs: in replace_with_spark_cast, add a guard before the existing branches that returns DataFusionError::Plan when the source type is Utf8, LargeUtf8, Binary, or LargeBinary and the target type is any integer (Int8/Int16/Int32/Int64/UInt*) or floating-point type (Float32/Float64). The rule mirrors TypeUtil.checkParquetType's BINARY case (lines 208-221), which only allows reading BINARY as StringType, BinaryType, or a binary-encoded decimal.

The check is intentionally narrow: it only fires for string/binary -> numeric mismatches and leaves every other type path unchanged.

How are these changes tested?

Added a focused test to ParquetReadSuite: native_datafusion rejects string read as numeric. It writes string data, reads it under c int, forces spark.comet.scan.impl=native_datafusion and spark.sql.sources.useV1SourceList=parquet, and asserts that collect() raises SparkException. Verified against ParquetReadV1Suite (44 succeeded, no regressions; 1 pre-existing test ignored).

The behavior is also covered by the per-impl matrix added in #4087 (string read as int: native_datafusion), whose assertion will need flipping from "succeeds with garbage" to "throws" once that PR merges.

The native_datafusion Spark physical expression adapter fell through to
DataFusion's cast for Utf8/Binary -> numeric type changes (because
SparkCastOptions.is_adapting_schema delegates to DataFusion's cast),
which silently parses the bytes (returning nulls or, on some paths,
reinterpreting raw bytes) where Spark's vectorized reader and the
native_iceberg_compat scan throw SchemaColumnConvertNotSupportedException.

Add a guard in replace_with_spark_cast that rejects when the source
type is Utf8/LargeUtf8/Binary/LargeBinary and the target type is any
integer or floating-point type, mirroring TypeUtil.checkParquetType
on the JVM side.

Closes apache#4088.
@andygrove andygrove added correctness bug Something isn't working labels Apr 25, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor

Same question as the other PR, can we catch this earlier?

@mbutrovich
Copy link
Copy Markdown
Contributor

Thanks @andygrove! I double-checked Spark: the guard matches Spark's ParquetVectorUpdaterFactory.getUpdater() for BINARY columns (lines 175-180) and looks correct for the numeric cases.

One question: the check rejects specific numeric target types, but Spark's vectorized reader rejects BINARY->anything except String, Binary, and binary-encoded Decimal. Could conversions like BINARY->Boolean or BINARY->Date also reach this code path? If so, would an allow-list approach (allow only the three valid targets, reject the rest) be more defensive here?

Per review feedback, switch from a deny-list of numeric target types to an
allow-list of the only targets Spark's vectorized reader permits for a
BINARY column: StringType, BinaryType, and a binary-encoded decimal. This
also rejects mismatches like BINARY -> Boolean, BINARY -> Date, and
BINARY -> Timestamp, which the previous guard silently handed to
DataFusion's cast.

Test now covers numeric, boolean, date, and timestamp targets.
@andygrove
Copy link
Copy Markdown
Member Author

Same question as the other PR, can we catch this earlier?

As discussed on #4090: I looked at this and the physical Parquet schema is not available at CometScanRule time, so we cannot detect the mismatch before serialization. Catching it in the schema adapter (where we do have the physical and target types) is the earliest practical point.

@andygrove
Copy link
Copy Markdown
Member Author

Could conversions like BINARY->Boolean or BINARY->Date also reach this code path? If so, would an allow-list approach (allow only the three valid targets, reject the rest) be more defensive here?

Yes, they reach this path and were silently delegated to DataFusion's cast (parses date/boolean strings, returns nulls on failure). Switched to an allow-list in the latest commit: when the source is Utf8/LargeUtf8/Binary/LargeBinary, the only permitted targets are the same string/binary types and Decimal128/Decimal256 (the binary-encoded decimal case). The test now also covers boolean, date, and timestamp targets in addition to the numeric cases.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove!

@andygrove andygrove merged commit c05bd16 into apache:main Apr 27, 2026
181 of 187 checks passed
@andygrove andygrove deleted the fix-issue-4088-string-as-int branch April 27, 2026 21:14
@andygrove
Copy link
Copy Markdown
Member Author

Merged. Thanks @mbutrovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working correctness

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: STRING column read as INT silently returns garbage values

2 participants