[opt](fe) Rewrite regexp functions in expression rules#64064
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found two semantic regressions in the new regexp rewrite rule. The goal is clear and the change is focused, with unit and regression coverage, but the current rewrite does not preserve results for valid inline regex flag cases, so it should not be enabled as-is.
Critical checkpoints:
- Goal/test: The PR adds cheaper rewrites for regexp_replace and regexp_extract and includes FE unit plus regression tests, but tests miss inline flag cases that break equivalence.
- Scope/focus: The implementation is small and localized to Nereids expression rewrite.
- Concurrency/lifecycle/config/storage/transaction: Not applicable; no shared state, lifecycle management, config, persistence, or transaction changes.
- Parallel paths: Only Nereids expression rewrite is changed; the affected FE/BE function semantics were checked against regexp_replace/regexp_replace_one and regexp_extract implementations.
- Conditional checks: The shape checks for anchors and trailing .*$ are not sufficient because inline regex flags can change anchor/dot semantics.
- Test coverage/results: Coverage exists for common cases but misses the counterexamples below.
- Observability/performance: No additional observability needed for this optimizer rewrite; performance intent is valid once semantic guards are fixed.
- User focus: No additional user-provided review focus was specified.
| if (pattern == null || pattern.isEmpty()) { | ||
| return regexpReplace; | ||
| } | ||
| if (!startsWithUnescapedCaret(pattern) && !endsWithUnescapedDollar(pattern)) { |
There was a problem hiding this comment.
This treats a trailing $ as proving that regexp_replace can match at most once, but inline regex flags can make $ match per line. For example, regexp_replace('a\na', '(?m)a$', 'x') is valid RE2 syntax; the original regexp_replace uses RE2::GlobalReplace and replaces both line-ending matches (x\nx), while the rewritten regexp_replace_one uses RE2::Replace and only replaces the first (x\na). Please skip this rewrite when the pattern can enable multiline mode, or otherwise prove the anchor is still single-match under inline flags.
| private static String trimExtractPattern(String pattern) { | ||
| String trimmed = pattern; | ||
| if (endsWithUnescapedDotStarDollar(trimmed)) { | ||
| trimmed = trimmed.substring(0, trimmed.length() - 3); |
There was a problem hiding this comment.
Trimming a trailing .*$ is not semantics-preserving when inline flags disable dot-newline matching. Doris sets RE2 dot_nl by default, but RE2 inline flags can override it; for example regexp_extract(concat('a', char(10), 'b'), '(?-s)^(a).*$', 1) does not match because . cannot consume the newline, whereas the rewritten pattern (?-s)^(a) matches and returns a. Please guard against inline dot-mode changes, or avoid this rewrite for patterns containing inline flag groups that can affect ./$.
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add a Nereids expression rewrite for regexp functions. The rule rewrites anchored regexp_replace calls without alternation to regexp_replace_one, and trims a trailing unescaped .* end anchor from eligible regexp_extract patterns. This reduces unnecessary global replacement work and regexp suffix matching while preserving conservative semantics.
### Release note
None
### Check List (For Author)
- Test:
- FE build: sh build.sh --fe
- FE unit test: sh run-fe-ut.sh --run org.apache.doris.nereids.rules.expression.rules.RegexpFunctionRewriteTest
- Regression test: sh run-regression-test.sh --run -d nereids_rules_p0/expression -s test_regexp_function_rewrite -c "jdbc:mysql://127.0.0.1:9408/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round" -parallel 1
- Behavior changed: No
- Does this need documentation: No
6693874 to
ba951eb
Compare
|
/review |
There was a problem hiding this comment.
Summary:
I found one blocking correctness issue in the new regexp_extract suffix-trimming rewrite. The existing inline threads already cover the RE2 inline-flag cases, so I did not duplicate those; this comment is for a separate execution path through the extended-regex Boost fallback.
Critical checkpoint conclusions:
- Goal/test: The goal is to optimize anchored regexp_replace and suffix-anchored regexp_extract without changing results. Tests cover the main RE2 cases and the already-known inline-flag cases, but do not cover enable_extended_regex/Boost fallback behavior.
- Scope/focus: The change is small and focused, but the regexp_extract proof is incomplete for all BE regex engines.
- Concurrency/lifecycle: No new shared mutable state, locks, threads, or lifecycle-sensitive objects are introduced.
- Config compatibility: No new config is added, but behavior depends on the existing enable_extended_regex session variable; the rewrite currently ignores that runtime-dependent engine choice.
- Parallel paths: regexp_extract has both RE2 and Boost fallback execution paths in BE; the rewrite is only safe for the RE2 dot_nl path. regexp_extract_or_null is not modified.
- Tests/results: Unit and regression tests exist, but they miss the extended-regex fallback negative case described inline. Regression output files are not involved because this suite compares enabled/disabled results directly.
- Observability/transactions/persistence: Not applicable; this is a deterministic expression rewrite with no transaction or persistence changes.
- Performance: The intended optimization is reasonable once the semantic guard is complete.
- User focus: No additional user-provided focus points were listed.
| String pattern = getStringLiteral(regexpExtract.child(1)); | ||
| if (pattern == null || pattern.isEmpty() || !isPositiveGroupIndex(regexpExtract.child(2)) | ||
| || !hasCapturingGroup(pattern) || hasUnescapedAlternation(pattern) | ||
| || hasInlineRegexpFlag(pattern, 's')) { |
There was a problem hiding this comment.
This guard still lets the suffix trim run for patterns that force BE's regexp_extract into the extended-regex Boost fallback when enable_extended_regex=true. That path is distinct from the RE2 inline-flag cases already discussed: RegexpExtractEngine::compile first tries RE2 with dot_nl=true, but unsupported constructs like look-around fall back to boost::regex::normal, where . does not match newlines by default. For example, with extended regex enabled, regexp_extract(concat('fooa', char(10), 'tail'), '(?<=foo)(a).*$', 1) should not match because the trailing .*$ cannot consume through the newline in Boost; after this rewrite the pattern becomes (?<=foo)(a) and returns a. Please either skip this optimization for patterns that may require the Boost fallback, or otherwise prove the runtime engine keeps the same dot/newline semantics before dropping .*$.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add a Nereids expression rewrite for regexp functions. The rule rewrites anchored regexp_replace calls without alternation to regexp_replace_one, and trims a trailing unescaped .* end anchor from eligible regexp_extract patterns. This reduces unnecessary global replacement work and regexp suffix matching while preserving conservative semantics.
Release note
None
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)