fix: gate bit_length/octet_length on BinaryType and downgrade translate by andygrove · Pull Request #4594 · apache/datafusion-comet

andygrove · 2026-06-04T19:57:04Z

Which issue does this PR close?

Closes #4464
Closes #4463

Rationale for this change

Two correctness issues surfaced by the string-expressions audit (#4461). In both cases, the serde reports Compatible while the underlying native path silently diverges from Spark — EXPLAIN, the auto-generated compatibility doc, and the dispatcher all see Compatible so the operator runs natively rather than falling back.

[Bug] bit_length and octet_length error natively for BinaryType input instead of falling back #4464: bit_length / octet_length report Compatible(None) for BinaryType, but DataFusion's BitLengthFunc / OctetLengthFunc use Signature::coercible(... logical_string() ...) and reject Binary at execution time. The result: bit_length(<binary>) and octet_length(<binary>) plan successfully under Comet, then surface as a native execution error rather than falling back cleanly. The sibling length already guards BinaryType via CometLength.
[Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion #4463: translate is wired as CometScalarFunction("translate") and reports Compatible, but DataFusion's translate (1) iterates over Unicode graphemes while Spark uses code points (so combining marks / ZWJ sequences disagree), and (2) substitutes U+0000 instead of treating it as a deletion sentinel like Spark's StringTranslate.buildDict.

What changes are included in this PR?

CometBitLength / CometOctetLength: new serdes that gate BinaryType as Unsupported(Some(...)) (mirroring the existing CometLength shape). The string path remains Compatible.
CometStringTranslate: new serde that returns Incompatible(Some(...)) so the divergent native path only runs when the user opts in via spark.comet.expression.StringTranslate.allowIncompatible=true. The notes call out both divergences (graphemes vs code points, U+0000 deletion).

How are these changes tested?

bit_length.sql / octet_length.sql: extended with expect_fallback(... on BinaryType is not supported) cases that confirm binary input falls back cleanly and Spark and Comet agree on the answer.
string_translate.sql: converted to expect_fallback(is not fully compatible with Spark) to assert the default-path fallback behaviour.
string_translate_enabled.sql: new fixture that sets spark.comet.expression.StringTranslate.allowIncompatible=true and exercises the native path on ASCII inputs where Spark and DataFusion agree.
CometStringExpressionSuite "length, reverse, instr, replace, translate" wrapped in withSQLConf(...allowIncompatible=true) so the existing translate assertion still runs natively.

Verified passing on Spark 3.5: CometSqlFileTestSuite expressions/string/ (45 tests succeeded), CometStringExpressionSuite (33 tests succeeded), spotless:check.

Surfaced by the string-expressions audit (apache#4461). * `bit_length` / `octet_length`: report `Compatible` while DataFusion's native impls reject `BinaryType` at execution time, so calls on binary columns surface as a native error rather than falling back to Spark. Add `CometBitLength` / `CometOctetLength` serdes that gate `BinaryType` as `Unsupported`, mirroring the existing `CometLength` shape. Closes apache#4464. * `translate`: report `Compatible` while DataFusion's `translate` iterates over Unicode graphemes (Spark uses code points) and substitutes U+0000 instead of treating it as a deletion sentinel. Add `CometStringTranslate` that returns `Incompatible(...)` so the divergent native path only runs when the user opts in via `spark.comet.expression.StringTranslate.allowIncompatible=true`. Closes apache#4463. Tests: extend `bit_length.sql` / `octet_length.sql` with `expect_fallback` cases on binary input; convert `string_translate.sql` to assert the default fallback path; add `string_translate_enabled.sql` exercising the opt-in native path on ASCII inputs where Spark and DataFusion agree. The existing `CometStringExpressionSuite` translate assertion is wrapped in `withSQLConf(StringTranslate.allowIncompatible=true)`.

andygrove added 2 commits June 4, 2026 13:56

docs: mark translate as falls-back-by-default in expressions.md

f2257a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gate bit_length/octet_length on BinaryType and downgrade translate#4594

fix: gate bit_length/octet_length on BinaryType and downgrade translate#4594
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/string-audit-followups

andygrove commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Jun 4, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant