fix: reject string/binary read as numeric in native_datafusion scan#4091
fix: reject string/binary read as numeric in native_datafusion scan#4091andygrove merged 2 commits intoapache:mainfrom
Conversation
The native_datafusion Spark physical expression adapter fell through to DataFusion's cast for Utf8/Binary -> numeric type changes (because SparkCastOptions.is_adapting_schema delegates to DataFusion's cast), which silently parses the bytes (returning nulls or, on some paths, reinterpreting raw bytes) where Spark's vectorized reader and the native_iceberg_compat scan throw SchemaColumnConvertNotSupportedException. Add a guard in replace_with_spark_cast that rejects when the source type is Utf8/LargeUtf8/Binary/LargeBinary and the target type is any integer or floating-point type, mirroring TypeUtil.checkParquetType on the JVM side. Closes apache#4088.
|
Same question as the other PR, can we catch this earlier? |
|
Thanks @andygrove! I double-checked Spark: the guard matches Spark's One question: the check rejects specific numeric target types, but Spark's vectorized reader rejects BINARY->anything except String, Binary, and binary-encoded Decimal. Could conversions like BINARY->Boolean or BINARY->Date also reach this code path? If so, would an allow-list approach (allow only the three valid targets, reject the rest) be more defensive here? |
Per review feedback, switch from a deny-list of numeric target types to an allow-list of the only targets Spark's vectorized reader permits for a BINARY column: StringType, BinaryType, and a binary-encoded decimal. This also rejects mismatches like BINARY -> Boolean, BINARY -> Date, and BINARY -> Timestamp, which the previous guard silently handed to DataFusion's cast. Test now covers numeric, boolean, date, and timestamp targets.
As discussed on #4090: I looked at this and the physical Parquet schema is not available at |
Yes, they reach this path and were silently delegated to DataFusion's cast (parses date/boolean strings, returns nulls on failure). Switched to an allow-list in the latest commit: when the source is |
|
Merged. Thanks @mbutrovich |
Which issue does this PR close?
Closes #4088.
Rationale for this change
When the
native_datafusionscan reads a ParquetBINARY(UTF8) column under a numeric read schema, the existing schema adapter creates a SparkCastwithis_adapting_schema=true. In that modeCastdelegates to DataFusion's cast, which parses the bytes (returning null on non-numeric strings, or in some paths reinterpreting the raw bytes). Spark's vectorized reader rejects this kind of mismatch withSchemaColumnConvertNotSupportedExceptionon every supported version, andnative_iceberg_compatalready does the same viaTypeUtil.checkParquetType. The native scan should match.What changes are included in this PR?
native/core/src/parquet/schema_adapter.rs: inreplace_with_spark_cast, add a guard before the existing branches that returnsDataFusionError::Planwhen the source type isUtf8,LargeUtf8,Binary, orLargeBinaryand the target type is any integer (Int8/Int16/Int32/Int64/UInt*) or floating-point type (Float32/Float64). The rule mirrorsTypeUtil.checkParquetType'sBINARYcase (lines 208-221), which only allows reading BINARY asStringType,BinaryType, or a binary-encoded decimal.The check is intentionally narrow: it only fires for string/binary -> numeric mismatches and leaves every other type path unchanged.
How are these changes tested?
Added a focused test to
ParquetReadSuite:native_datafusion rejects string read as numeric. It writes string data, reads it underc int, forcesspark.comet.scan.impl=native_datafusionandspark.sql.sources.useV1SourceList=parquet, and asserts thatcollect()raisesSparkException. Verified againstParquetReadV1Suite(44 succeeded, no regressions; 1 pre-existing test ignored).The behavior is also covered by the per-impl matrix added in #4087 (
string read as int: native_datafusion), whose assertion will need flipping from "succeeds with garbage" to "throws" once that PR merges.