fix: gate bit_length/octet_length on BinaryType and downgrade translate#4594
Open
andygrove wants to merge 2 commits into
Open
fix: gate bit_length/octet_length on BinaryType and downgrade translate#4594andygrove wants to merge 2 commits into
andygrove wants to merge 2 commits into
Conversation
Surfaced by the string-expressions audit (apache#4461). * `bit_length` / `octet_length`: report `Compatible` while DataFusion's native impls reject `BinaryType` at execution time, so calls on binary columns surface as a native error rather than falling back to Spark. Add `CometBitLength` / `CometOctetLength` serdes that gate `BinaryType` as `Unsupported`, mirroring the existing `CometLength` shape. Closes apache#4464. * `translate`: report `Compatible` while DataFusion's `translate` iterates over Unicode graphemes (Spark uses code points) and substitutes U+0000 instead of treating it as a deletion sentinel. Add `CometStringTranslate` that returns `Incompatible(...)` so the divergent native path only runs when the user opts in via `spark.comet.expression.StringTranslate.allowIncompatible=true`. Closes apache#4463. Tests: extend `bit_length.sql` / `octet_length.sql` with `expect_fallback` cases on binary input; convert `string_translate.sql` to assert the default fallback path; add `string_translate_enabled.sql` exercising the opt-in native path on ASCII inputs where Spark and DataFusion agree. The existing `CometStringExpressionSuite` translate assertion is wrapped in `withSQLConf(StringTranslate.allowIncompatible=true)`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4464
Closes #4463
Rationale for this change
Two correctness issues surfaced by the string-expressions audit (#4461). In both cases, the serde reports
Compatiblewhile the underlying native path silently diverges from Spark — EXPLAIN, the auto-generated compatibility doc, and the dispatcher all seeCompatibleso the operator runs natively rather than falling back.[Bug] bit_length and octet_length error natively for BinaryType input instead of falling back #4464:
bit_length/octet_lengthreportCompatible(None)forBinaryType, but DataFusion'sBitLengthFunc/OctetLengthFuncuseSignature::coercible(... logical_string() ...)and rejectBinaryat execution time. The result:bit_length(<binary>)andoctet_length(<binary>)plan successfully under Comet, then surface as a native execution error rather than falling back cleanly. The siblinglengthalready guardsBinaryTypeviaCometLength.[Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion #4463:
translateis wired asCometScalarFunction("translate")and reportsCompatible, but DataFusion'stranslate(1) iterates over Unicode graphemes while Spark uses code points (so combining marks / ZWJ sequences disagree), and (2) substitutes U+0000 instead of treating it as a deletion sentinel like Spark'sStringTranslate.buildDict.What changes are included in this PR?
CometBitLength/CometOctetLength: new serdes that gateBinaryTypeasUnsupported(Some(...))(mirroring the existingCometLengthshape). The string path remainsCompatible.CometStringTranslate: new serde that returnsIncompatible(Some(...))so the divergent native path only runs when the user opts in viaspark.comet.expression.StringTranslate.allowIncompatible=true. The notes call out both divergences (graphemes vs code points, U+0000 deletion).How are these changes tested?
bit_length.sql/octet_length.sql: extended withexpect_fallback(... on BinaryType is not supported)cases that confirm binary input falls back cleanly and Spark and Comet agree on the answer.string_translate.sql: converted toexpect_fallback(is not fully compatible with Spark)to assert the default-path fallback behaviour.string_translate_enabled.sql: new fixture that setsspark.comet.expression.StringTranslate.allowIncompatible=trueand exercises the native path on ASCII inputs where Spark and DataFusion agree.CometStringExpressionSuite"length, reverse, instr, replace, translate" wrapped inwithSQLConf(...allowIncompatible=true)so the existing translate assertion still runs natively.Verified passing on Spark 3.5:
CometSqlFileTestSuite expressions/string/(45 tests succeeded),CometStringExpressionSuite(33 tests succeeded),spotless:check.