Skip to content

persist: prevent duplicate rollup re-insertion in add_rollup#36740

Open
DAlperin wants to merge 1 commit into
MaterializeInc:mainfrom
DAlperin:dov/per-16-duplicate-rollup-insert
Open

persist: prevent duplicate rollup re-insertion in add_rollup#36740
DAlperin wants to merge 1 commit into
MaterializeInc:mainfrom
DAlperin:dov/per-16-duplicate-rollup-insert

Conversation

@DAlperin
Copy link
Copy Markdown
Member

Summary

Fixes the gc.rs:569 panic tracked in PER-16 by tightening StateCollections::add_rollup rather than softening the GC assertion.

add_rollup previously only checked self.rollups.get(&rollup_seqno). The apply_unbatched_idempotent_cmd retry path captures (rollup_seqno, rollup) and replays the closure on each retry. If the original CaS attempt actually succeeded but returned an indeterminate error, and a concurrent GC then removed the just-committed rollup map entry before the retry runs, the retry observes None and re-inserts the same PartialRollupKey under a new state seqno. That produces a second Insert for the same key in the diff stream and — once both entries are later GC'd — a second Delete for the same key. A subsequent find_removable_blobs walks both deletes in a single pass and trips assert!(rollups_to_delete.insert(...)), SIGABRTing the persist worker.

Fix it at the source: the None arm now refuses the insert when the rollup's key is already referenced anywhere in state.rollups, returning Continue(false) so callers treat it as a duplicate write (and delete the orphaned blob, as they already do for same-seqno collisions). This makes the GC-side invariant ("at most one Delete per PartialRollupKey per pass") naturally hold, leaving the existing hard assert in place as a meaningful tripwire for any future regression.

Tests

  • New state.rs unit test add_rollup_rejects_duplicate_key covering fresh insert, same-seqno idempotent retry, and the new duplicate-key-different-seqno rejection path.
  • Existing gc_rollups / regression_gc_behind datadriven coverage continues to exercise the happy path.

Test plan

  • cargo check -p mz-persist-client --tests
  • cargo clippy --all-targets -p mz-persist-client -- -D warnings
  • bin/fmt
  • cargo test -p mz-persist-client --lib add_rollup_rejects_duplicate_key
  • cargo test -p mz-persist-client --lib datadriven (covers gc / gc_rollups / regression_gc_behind)

Fixes PER-16

🤖 Generated with Claude Code

`StateCollections::add_rollup` previously only checked
`self.rollups.get(&rollup_seqno)`. If that returned `None`, it inserted
unconditionally. The `apply_unbatched_idempotent_cmd` retry path for
`add_rollup` re-runs the captured `(rollup_seqno, rollup)` closure
argument against the latest state on each retry. If a concurrent GC
removed the rollup's map entry between attempts (perfectly legal — the
indeterminate CaS may have already committed it, and GC may have moved
on past `seqno_since`), the retry observes `None` and re-inserts the
same `PartialRollupKey` under a new map entry.

That produces a second `Insert` event for the same rollup key in the
diff stream, and (later, when both entries get GC'd) a second `Delete`
event for the same key. A subsequent GC pass walks both deletes in a
single `find_removable_blobs` invocation and trips the hard
`assert!(rollups_to_delete.insert(...))` at `gc.rs:569`, SIGABRTing the
persist worker.

Fix it at the source: `add_rollup`'s `None` arm now refuses to insert
when the rollup's key is already referenced anywhere in `state.rollups`,
returning `Continue(false)` so the caller treats it as a duplicate
write (i.e., deletes its now-orphaned blob, like it does today for
seqno collisions).

Adds a `state.rs` unit test, `add_rollup_rejects_duplicate_key`, that
covers fresh insert, same-seqno idempotent retry, and the new
duplicate-key-different-seqno path.

Fixes PER-16
@DAlperin DAlperin requested a review from a team as a code owner May 26, 2026 21:27
@DAlperin DAlperin requested a review from antiguru May 26, 2026 21:50
// GC, producing duplicate Insert events in the diff stream
// and (later) duplicate Delete events that trip gc.rs:569.
if self.rollups.values().any(|r| r.key == rollup.key) {
return Continue(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// duplicate Delete events that trip the `assert!` in `gc.rs`'s
/// `find_removable_blobs`.
#[mz_ore::test]
fn add_rollup_rejects_duplicate_key() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice test

let applied = match self.rollups.get(&rollup_seqno) {
Some(x) => x.key == rollup.key,
None => {
// PER-16: refuse to insert a rollup whose key is already
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the change but thanks for the detailed bug in linear, reading through it helped me understand the invariants and how they were being violated

Copy link
Copy Markdown
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants