Skip to content

feat: support StringSplitSQL for split_part#4592

Open
michaelmitchell-bit wants to merge 2 commits into
apache:mainfrom
michaelmitchell-bit:support-stringsplitsql-split-part-4561
Open

feat: support StringSplitSQL for split_part#4592
michaelmitchell-bit wants to merge 2 commits into
apache:mainfrom
michaelmitchell-bit:support-stringsplitsql-split-part-4561

Conversation

@michaelmitchell-bit
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4561.

Rationale for this change

split_part lowers to element_at(StringSplitSQL(str, delimiter), partNum). Comet already supports element_at, but the inner StringSplitSQL expression was unsupported, so split_part fell back to Spark.

What changes are included in this PR?

Adds native support for Spark 4.x StringSplitSQL via a literal-delimiter split_sql function. Unlike split / StringSplit, this splits on the delimiter literally rather than as a regex and keeps trailing empty parts.

This also fixes ListExtract default-value typing so defaults are cast to the array element type rather than the list type. That is needed for split_part, which uses an empty-string default through element_at.

Updates expression support docs and adds SQL-file coverage for split_part, including literal delimiters, dynamic delimiters, nulls, out-of-range defaults, negative positions, and invalid zero index behavior.

How are these changes tested?

Ran:

  • cargo fmt -p datafusion-comet-spark-expr
  • cargo test -p datafusion-comet-spark-expr string_funcs::split
  • cargo test -p datafusion-comet-spark-expr list_extract
  • cargo build
  • ./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometSqlFileTestSuite split_part" -Dscalastyle.skip=true
  • ./mvnw compile -Pspark-4.0 -DskipTests -Dscalastyle.skip=true
  • ./mvnw compile -Pspark-4.2 -DskipTests -Dscalastyle.skip=true
  • git diff --check

@andygrove
Copy link
Copy Markdown
Member

Thanks @michaelmitchell-bit. This looks good overall.

StringSplitSQL is collation-aware in Spark and Comet does not support different collations yet, so it would be good to add a fallback for that case. See concat and reverse for examples.

@michaelmitchell-bit
Copy link
Copy Markdown
Contributor Author

Thanks @michaelmitchell-bit. This looks good overall.

StringSplitSQL is collation-aware in Spark and Comet does not support different collations yet, so it would be good to add a fallback for that case. See concat and reverse for examples.

awesome, ty. added a fallback for non-utf8_binary collations and covered it in the split_part SQL file test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support StringSplitSQL to enable split_part

2 participants