[opt](fe) Rewrite regexp functions in expression rules by Mryange · Pull Request #64064 · apache/doris

Mryange · 2026-06-03T08:17:02Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add a Nereids expression rewrite for regexp functions. The rule rewrites anchored regexp_replace calls without alternation to regexp_replace_one, and trims a trailing unescaped .* end anchor from eligible regexp_extract patterns. This reduces unnecessary global replacement work and regexp suffix matching while preserving conservative semantics.

Release note

None

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-06-03T08:17:20Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Mryange · 2026-06-03T08:22:46Z

/review

github-actions

I found two semantic regressions in the new regexp rewrite rule. The goal is clear and the change is focused, with unit and regression coverage, but the current rewrite does not preserve results for valid inline regex flag cases, so it should not be enabled as-is.

Critical checkpoints:

Goal/test: The PR adds cheaper rewrites for regexp_replace and regexp_extract and includes FE unit plus regression tests, but tests miss inline flag cases that break equivalence.
Scope/focus: The implementation is small and localized to Nereids expression rewrite.
Concurrency/lifecycle/config/storage/transaction: Not applicable; no shared state, lifecycle management, config, persistence, or transaction changes.
Parallel paths: Only Nereids expression rewrite is changed; the affected FE/BE function semantics were checked against regexp_replace/regexp_replace_one and regexp_extract implementations.
Conditional checks: The shape checks for anchors and trailing .*$ are not sufficient because inline regex flags can change anchor/dot semantics.
Test coverage/results: Coverage exists for common cases but misses the counterexamples below.
Observability/performance: No additional observability needed for this optimizer rewrite; performance intent is valid once semantic guards are fixed.
User focus: No additional user-provided review focus was specified.

github-actions · 2026-06-03T08:42:45Z

+        if (pattern == null || pattern.isEmpty()) {
+            return regexpReplace;
+        }
+        if (!startsWithUnescapedCaret(pattern) && !endsWithUnescapedDollar(pattern)) {


This treats a trailing $ as proving that regexp_replace can match at most once, but inline regex flags can make $ match per line. For example, regexp_replace('a\na', '(?m)a$', 'x') is valid RE2 syntax; the original regexp_replace uses RE2::GlobalReplace and replaces both line-ending matches (x\nx), while the rewritten regexp_replace_one uses RE2::Replace and only replaces the first (x\na). Please skip this rewrite when the pattern can enable multiline mode, or otherwise prove the anchor is still single-match under inline flags.

github-actions · 2026-06-03T08:42:45Z

+    private static String trimExtractPattern(String pattern) {
+        String trimmed = pattern;
+        if (endsWithUnescapedDotStarDollar(trimmed)) {
+            trimmed = trimmed.substring(0, trimmed.length() - 3);


Trimming a trailing .*$ is not semantics-preserving when inline flags disable dot-newline matching. Doris sets RE2 dot_nl by default, but RE2 inline flags can override it; for example regexp_extract(concat('a', char(10), 'b'), '(?-s)^(a).*$', 1) does not match because . cannot consume the newline, whereas the rewritten pattern (?-s)^(a) matches and returns a. Please guard against inline dot-mode changes, or avoid this rewrite for patterns containing inline flag groups that can affect ./$.

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Add a Nereids expression rewrite for regexp functions. The rule rewrites anchored regexp_replace calls without alternation to regexp_replace_one, and trims a trailing unescaped .* end anchor from eligible regexp_extract patterns. This reduces unnecessary global replacement work and regexp suffix matching while preserving conservative semantics. ### Release note None ### Check List (For Author) - Test: - FE build: sh build.sh --fe - FE unit test: sh run-fe-ut.sh --run org.apache.doris.nereids.rules.expression.rules.RegexpFunctionRewriteTest - Regression test: sh run-regression-test.sh --run -d nereids_rules_p0/expression -s test_regexp_function_rewrite -c "jdbc:mysql://127.0.0.1:9408/?useLocalSessionState=true&allowLoadLocalInfile=true&zeroDateTimeBehavior=round" -parallel 1 - Behavior changed: No - Does this need documentation: No

Mryange · 2026-06-03T13:12:43Z

/review

github-actions

Summary:
I found one blocking correctness issue in the new regexp_extract suffix-trimming rewrite. The existing inline threads already cover the RE2 inline-flag cases, so I did not duplicate those; this comment is for a separate execution path through the extended-regex Boost fallback.

Critical checkpoint conclusions:

Goal/test: The goal is to optimize anchored regexp_replace and suffix-anchored regexp_extract without changing results. Tests cover the main RE2 cases and the already-known inline-flag cases, but do not cover enable_extended_regex/Boost fallback behavior.
Scope/focus: The change is small and focused, but the regexp_extract proof is incomplete for all BE regex engines.
Concurrency/lifecycle: No new shared mutable state, locks, threads, or lifecycle-sensitive objects are introduced.
Config compatibility: No new config is added, but behavior depends on the existing enable_extended_regex session variable; the rewrite currently ignores that runtime-dependent engine choice.
Parallel paths: regexp_extract has both RE2 and Boost fallback execution paths in BE; the rewrite is only safe for the RE2 dot_nl path. regexp_extract_or_null is not modified.
Tests/results: Unit and regression tests exist, but they miss the extended-regex fallback negative case described inline. Regression output files are not involved because this suite compares enabled/disabled results directly.
Observability/transactions/persistence: Not applicable; this is a deterministic expression rewrite with no transaction or persistence changes.
Performance: The intended optimization is reasonable once the semantic guard is complete.
User focus: No additional user-provided focus points were listed.

github-actions · 2026-06-03T18:59:18Z

+        String pattern = getStringLiteral(regexpExtract.child(1));
+        if (pattern == null || pattern.isEmpty() || !isPositiveGroupIndex(regexpExtract.child(2))
+                || !hasCapturingGroup(pattern) || hasUnescapedAlternation(pattern)
+                || hasInlineRegexpFlag(pattern, 's')) {


This guard still lets the suffix trim run for patterns that force BE's regexp_extract into the extended-regex Boost fallback when enable_extended_regex=true. That path is distinct from the RE2 inline-flag cases already discussed: RegexpExtractEngine::compile first tries RE2 with dot_nl=true, but unsupported constructs like look-around fall back to boost::regex::normal, where . does not match newlines by default. For example, with extended regex enabled, regexp_extract(concat('fooa', char(10), 'tail'), '(?<=foo)(a).*$', 1) should not match because the trailing .*$ cannot consume through the newline in Boost; after this rewrite the pattern becomes (?<=foo)(a) and returns a. Please either skip this optimization for patterns that may require the Boost fallback, or otherwise prove the runtime engine keeps the same dot/newline semantics before dropping .*$.

github-actions Bot requested changes Jun 3, 2026

View reviewed changes

Mryange force-pushed the opt-regexp-function-rewrite branch from 6693874 to ba951eb Compare June 3, 2026 13:11

github-actions Bot requested changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt](fe) Rewrite regexp functions in expression rules#64064

[opt](fe) Rewrite regexp functions in expression rules#64064
Mryange wants to merge 1 commit into
apache:masterfrom
Mryange:opt-regexp-function-rewrite

Mryange commented Jun 3, 2026

Uh oh!

hello-stephen commented Jun 3, 2026

Uh oh!

Mryange commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 3, 2026

Uh oh!

github-actions Bot Jun 3, 2026

Uh oh!

Mryange commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mryange commented Jun 3, 2026

What problem does this PR solve?

Release note

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Jun 3, 2026

Uh oh!

Mryange commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Mryange commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants