persist: prevent duplicate rollup re-insertion in add_rollup#36740
Open
DAlperin wants to merge 1 commit into
Open
persist: prevent duplicate rollup re-insertion in add_rollup#36740DAlperin wants to merge 1 commit into
add_rollup#36740DAlperin wants to merge 1 commit into
Conversation
`StateCollections::add_rollup` previously only checked `self.rollups.get(&rollup_seqno)`. If that returned `None`, it inserted unconditionally. The `apply_unbatched_idempotent_cmd` retry path for `add_rollup` re-runs the captured `(rollup_seqno, rollup)` closure argument against the latest state on each retry. If a concurrent GC removed the rollup's map entry between attempts (perfectly legal — the indeterminate CaS may have already committed it, and GC may have moved on past `seqno_since`), the retry observes `None` and re-inserts the same `PartialRollupKey` under a new map entry. That produces a second `Insert` event for the same rollup key in the diff stream, and (later, when both entries get GC'd) a second `Delete` event for the same key. A subsequent GC pass walks both deletes in a single `find_removable_blobs` invocation and trips the hard `assert!(rollups_to_delete.insert(...))` at `gc.rs:569`, SIGABRTing the persist worker. Fix it at the source: `add_rollup`'s `None` arm now refuses to insert when the rollup's key is already referenced anywhere in `state.rollups`, returning `Continue(false)` so the caller treats it as a duplicate write (i.e., deletes its now-orphaned blob, like it does today for seqno collisions). Adds a `state.rs` unit test, `add_rollup_rejects_duplicate_key`, that covers fresh insert, same-seqno idempotent retry, and the new duplicate-key-different-seqno path. Fixes PER-16
mtabebe
approved these changes
May 27, 2026
| // GC, producing duplicate Insert events in the diff stream | ||
| // and (later) duplicate Delete events that trip gc.rs:569. | ||
| if self.rollups.values().any(|r| r.key == rollup.key) { | ||
| return Continue(false); |
| /// duplicate Delete events that trip the `assert!` in `gc.rs`'s | ||
| /// `find_removable_blobs`. | ||
| #[mz_ore::test] | ||
| fn add_rollup_rejects_duplicate_key() { |
Contributor
There was a problem hiding this comment.
This is a really nice test
| let applied = match self.rollups.get(&rollup_seqno) { | ||
| Some(x) => x.key == rollup.key, | ||
| None => { | ||
| // PER-16: refuse to insert a rollup whose key is already |
Contributor
There was a problem hiding this comment.
Not related to the change but thanks for the detailed bug in linear, reading through it helped me understand the invariants and how they were being violated
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the
gc.rs:569panic tracked in PER-16 by tighteningStateCollections::add_rolluprather than softening the GC assertion.add_rolluppreviously only checkedself.rollups.get(&rollup_seqno). Theapply_unbatched_idempotent_cmdretry path captures(rollup_seqno, rollup)and replays the closure on each retry. If the original CaS attempt actually succeeded but returned an indeterminate error, and a concurrent GC then removed the just-committed rollup map entry before the retry runs, the retry observesNoneand re-inserts the samePartialRollupKeyunder a new state seqno. That produces a secondInsertfor the same key in the diff stream and — once both entries are later GC'd — a secondDeletefor the same key. A subsequentfind_removable_blobswalks both deletes in a single pass and tripsassert!(rollups_to_delete.insert(...)), SIGABRTing the persist worker.Fix it at the source: the
Nonearm now refuses the insert when the rollup's key is already referenced anywhere instate.rollups, returningContinue(false)so callers treat it as a duplicate write (and delete the orphaned blob, as they already do for same-seqno collisions). This makes the GC-side invariant ("at most one Delete perPartialRollupKeyper pass") naturally hold, leaving the existing hard assert in place as a meaningful tripwire for any future regression.Tests
state.rsunit testadd_rollup_rejects_duplicate_keycovering fresh insert, same-seqno idempotent retry, and the new duplicate-key-different-seqno rejection path.gc_rollups/regression_gc_behinddatadriven coverage continues to exercise the happy path.Test plan
cargo check -p mz-persist-client --testscargo clippy --all-targets -p mz-persist-client -- -D warningsbin/fmtcargo test -p mz-persist-client --lib add_rollup_rejects_duplicate_keycargo test -p mz-persist-client --lib datadriven(coversgc/gc_rollups/regression_gc_behind)Fixes PER-16
🤖 Generated with Claude Code