fix(client): recover gc-orphaned forest nodes on 404 via verified CID race#25
Merged
Merged
Conversation
The online forest walk aborted when a walkable-v8 forest node or manifest page returned 404 NoSuchKey from a reachable master -- the case where a server-side `ipfs repo gc` destroyed the gateway storage-key->CID index entry while the block still exists in IPFS by CID. Offline mode already recovered these via the verified gateway race; online did not, because `get_object_with_offline_fallback_known_cid` engaged the race only on `is_master_unreachable_error`, never on a 404. Add a forest-scoped `get_forest_object_known_cid` (private inner + `recover_on_not_found` bool) that also races the manifest-supplied CID on a not-found, and route the two forest callers (`S3BlobBackend::get_with_cid_hint`, `load_manifest_pages`) through it. The generic method keeps its strict propagate-404 invariant (and its security test). Fetched bytes are content-verified against the CID (fetch_verified) + AEAD-decrypted + storage-key-recomputed, so recovery can only return the exact block the manifest points at. A WARN surfaces each recovery so gateway-index rot is visible. Native-only; the recovered node is not re-uploaded on read. Verified end-to-end against the live `videos` bucket: the two orphaned objects (one page, one node) that abort the wasm walk now recover and the bucket lists fully (33 files). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #24 — the online forest walk aborted when a walkable-v8 forest node or manifest page returned 404 NoSuchKey from a reachable master: the case where a server-side
ipfs repo gcdestroyed the gateway storage-key->CID index entry while the block still exists in IPFS by CID. Offline mode already recovered these via the verified gateway race; online didn't, becauseget_object_with_offline_fallback_known_cidengaged the race only onis_master_unreachable_error, never on a 404.Change
FulaClient::get_forest_object_known_cid= the generic cid-hint fetch + recover-on-not-found. Implemented as a shared private inner (..._inner(..., recover_on_not_found: bool)); the generic public method delegatesfalse(its strict propagate-404 invariant andtest_cid_hint_master_4xx_propagates_without_fallbackare unchanged), the new method delegatestrue.S3BlobBackend::get_with_cid_hint(HAMT nodes) andEncryptedClient::load_manifest_pages(manifest pages) — now use the new method. No other caller changes.warn!surfaces each recovery so ongoing gateway-index rot is visible (silent recovery would hide the very signal that surfaced this bug).Why it's safe
Recovery races the gateway pool for the manifest-supplied CID;
fetch_verifiedcontent-verifies bytes against that CID, then the node store AEAD-decrypts and recomputes the storage-key + page-id/seq. So recovery can only ever return the exact block the freshly-decrypted, authoritative manifest points at — the CID is the capability. On gateway-race failure the original 404 propagates. Concept + diff reviewed by independent advisors (Gemini, Cursor, Copilot [source-grounded]; Codex description-only).Verification
s3_backend_get_with_cid_hint_recovers_orphaned_node_on_master_404(new — fails before the fix, passes after). 4/4 inwalkable_v8_offline_walk.rs, 208/208 lib tests pass (incl. the propagate-404 security test).videosbucket, master up, default public gateways = the FxFiles config): the two orphaned objects that abort the wasm walk — pageQm94e8de…and node__fula_forest_v7_nodes/a096c036…— recover via the gateway race and the bucket lists fully (33 files). Harness not committed (uses real credentials).Scope / rollout
cfg(not(target_arch = "wasm32"))). The wasmget_with_cid_hintdegrades to plainget()(no gateway pool on web), so pinning-webui is NOT fixed by this and needs separate work (it also lists via the HEAD-per-object path, not the forest walk).gateway_fallback_enabled = true(FxFiles already does).page_ref.cidwiring is pre-existing and covered by the E2E + the priorrecover_walkevidence (855 files reconstructed via manifest CID hints), not a new unit test; the recovery method itself is unit-tested via the node caller.