Skip to content

[fix](be) Fix SIGSEGV in bvar::take_sample caused by AgentCombiner/TLS Agent lifetime race under high EPS#64040

Open
vchag wants to merge 2 commits into
apache:masterfrom
vchag:bug/BE-SIGSEGV-bvar-take-sample-high-EPS
Open

[fix](be) Fix SIGSEGV in bvar::take_sample caused by AgentCombiner/TLS Agent lifetime race under high EPS#64040
vchag wants to merge 2 commits into
apache:masterfrom
vchag:bug/BE-SIGSEGV-bvar-take-sample-high-EPS

Conversation

@vchag
Copy link
Copy Markdown

@vchag vchag commented Jun 3, 2026

What problem does this PR solve?

Issue Number: close 63193

Related PR: #2949

Problem Summary:

Under high throughput, a race condition in brpc's bvar subsystem causes a SIGSEGV during take_sample. When a thread's TLS Agent destructs after its owning
AgentCombiner (Reducer, IntRecorder, or Percentile) has already been freed, the agent dereferences a dangling raw pointer in its destructor via
combiner->commit_and_erase(this).

The fix (backport of apache/brpc#2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via
shared_ptr. The agent destructor now calls combiner.lock() — if the combiner is already destroyed, lock() returns null and the destructor safely no-ops, eliminating
the use-after-free.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants