Skip to content

vip, config: avoid multiple VIP owners exist at the same time#1118

Open
djshow832 wants to merge 12 commits intopingcap:mainfrom
djshow832:multi_vip
Open

vip, config: avoid multiple VIP owners exist at the same time#1118
djshow832 wants to merge 12 commits intopingcap:mainfrom
djshow832:multi_vip

Conversation

@djshow832
Copy link
Copy Markdown
Collaborator

@djshow832 djshow832 commented Apr 2, 2026

What problem does this PR solve?

Issue Number: close #1117

Problem Summary:
When the client and the cluster are in different network segments, the client may fail to connect to the VIP after a VIP switch happens:

  1. The new owner sends a GARP
  2. The switch sends a who-has ARP to confirm the VIP
  3. The old owner doesn't delete the VIP and both the owners respond to the switch
  4. The switch may learn the old VIP owner
  5. The switch caches the wrong owner for a long time
  6. The client connects to the old owner and fails

What is changed and how it works:

  • The old owner deletes the VIP and then resigns the owner
  • The new owner refresh GARP for some time

Note:

  • TiProxy doesn't guarantee always only one owner, such as TiProxy unexpected down, TiProxy cannot connect to PD. The new owner refreshes GARP to takeover the VIP with the best effort.
  • This change will introduce a longer failover because there's a no-VIP window during the switchover.
  • May need supporting VMAC in the future if the switch doesn't send who-has when the new owner sends a GARP.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Notable changes

  • Has configuration change
  • Has HTTP API interfaces change
  • Has tiproxyctl change
  • Other user behavior changes

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

- Fix VIP failover may leave upstream devices forwarding to the old owner by deleting the VIP before resigning ownership, removing intentional owner overlap, and refreshing GARP after takeover

@ti-chi-bot ti-chi-bot bot requested a review from bb7133 April 2, 2026 13:17
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangkeao for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot requested a review from xhebox April 2, 2026 13:17
@ti-chi-bot ti-chi-bot bot added the size/XL label Apr 2, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 67.76316% with 49 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@30d5b3f). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pkg/manager/vip/network.go 8.57% 32 Missing ⚠️
pkg/manager/elect/election.go 55.00% 9 Missing ⚠️
pkg/manager/vip/manager.go 93.82% 5 Missing ⚠️
lib/config/proxy.go 81.25% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1118   +/-   ##
=======================================
  Coverage        ?   67.19%           
=======================================
  Files           ?      144           
  Lines           ?    15176           
  Branches        ?        0           
=======================================
  Hits            ?    10197           
  Misses          ?     4288           
  Partials        ?      691           
Flag Coverage Δ
unit 67.19% <67.76%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@djshow832
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bfefbfbf2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +118 to +119
vm.stopARPRefresh()
vm.delVIP()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove VIP before waiting for refresh worker

OnRetired waits for stopARPRefresh() before deleting the VIP. If the refresh goroutine is inside SendARP() (e.g. slow arping, non-zero burst interval, or command stall), retirement blocks while the VIP remains configured. The new owner can already acquire and bind the VIP once etcd ownership advances, so this ordering reintroduces a simultaneous-owner window that can repopulate upstream ARP caches with the old MAC.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will always cause the multi-vip problem if we only delete vip in onRetired(). However, the vip is also deleted before closing the election.

@ti-chi-bot ti-chi-bot bot added size/XXL and removed size/XL labels Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cross-segment VIP switch may cause connection failure

2 participants