Skip to content

fix(client): recover gc-orphaned forest nodes on 404 via verified CID race#25

Merged
ehsan6sha merged 2 commits into
mainfrom
fix/walkable-v8-recover-orphaned-node-on-404
Jun 4, 2026
Merged

fix(client): recover gc-orphaned forest nodes on 404 via verified CID race#25
ehsan6sha merged 2 commits into
mainfrom
fix/walkable-v8-recover-orphaned-node-on-404

Conversation

@ehsan6sha
Copy link
Copy Markdown
Member

What

Fixes #24 — the online forest walk aborted when a walkable-v8 forest node or manifest page returned 404 NoSuchKey from a reachable master: the case where a server-side ipfs repo gc destroyed the gateway storage-key->CID index entry while the block still exists in IPFS by CID. Offline mode already recovered these via the verified gateway race; online didn't, because get_object_with_offline_fallback_known_cid engaged the race only on is_master_unreachable_error, never on a 404.

Change

  • New forest-scoped FulaClient::get_forest_object_known_cid = the generic cid-hint fetch + recover-on-not-found. Implemented as a shared private inner (..._inner(..., recover_on_not_found: bool)); the generic public method delegates false (its strict propagate-404 invariant and test_cid_hint_master_4xx_propagates_without_fallback are unchanged), the new method delegates true.
  • The two forest-infrastructure callers — S3BlobBackend::get_with_cid_hint (HAMT nodes) and EncryptedClient::load_manifest_pages (manifest pages) — now use the new method. No other caller changes.
  • A warn! surfaces each recovery so ongoing gateway-index rot is visible (silent recovery would hide the very signal that surfaced this bug).

Why it's safe

Recovery races the gateway pool for the manifest-supplied CID; fetch_verified content-verifies bytes against that CID, then the node store AEAD-decrypts and recomputes the storage-key + page-id/seq. So recovery can only ever return the exact block the freshly-decrypted, authoritative manifest points at — the CID is the capability. On gateway-race failure the original 404 propagates. Concept + diff reviewed by independent advisors (Gemini, Cursor, Copilot [source-grounded]; Codex description-only).

Verification

  • Unit: s3_backend_get_with_cid_hint_recovers_orphaned_node_on_master_404 (new — fails before the fix, passes after). 4/4 in walkable_v8_offline_walk.rs, 208/208 lib tests pass (incl. the propagate-404 security test).
  • E2E (live videos bucket, master up, default public gateways = the FxFiles config): the two orphaned objects that abort the wasm walk — page Qm94e8de… and node __fula_forest_v7_nodes/a096c036… — recover via the gateway race and the bucket lists fully (33 files). Harness not committed (uses real credentials).

Scope / rollout

  • Native only (cfg(not(target_arch = "wasm32"))). The wasm get_with_cid_hint degrades to plain get() (no gateway pool on web), so pinning-webui is NOT fixed by this and needs separate work (it also lists via the HEAD-per-object path, not the forest walk).
  • Inert unless the shipped app config has gateway_fallback_enabled = true (FxFiles already does).
  • The page-caller's page_ref.cid wiring is pre-existing and covered by the E2E + the prior recover_walk evidence (855 files reconstructed via manifest CID hints), not a new unit test; the recovery method itself is unit-tested via the node caller.
  • Recovered nodes are not re-uploaded on the read path (deliberate); the 404 persists and is re-raced per read until the next forest write re-pins the node.

The online forest walk aborted when a walkable-v8 forest node or manifest
page returned 404 NoSuchKey from a reachable master -- the case where a
server-side `ipfs repo gc` destroyed the gateway storage-key->CID index
entry while the block still exists in IPFS by CID. Offline mode already
recovered these via the verified gateway race; online did not, because
`get_object_with_offline_fallback_known_cid` engaged the race only on
`is_master_unreachable_error`, never on a 404.

Add a forest-scoped `get_forest_object_known_cid` (private inner +
`recover_on_not_found` bool) that also races the manifest-supplied CID on a
not-found, and route the two forest callers (`S3BlobBackend::get_with_cid_hint`,
`load_manifest_pages`) through it. The generic method keeps its strict
propagate-404 invariant (and its security test). Fetched bytes are
content-verified against the CID (fetch_verified) + AEAD-decrypted +
storage-key-recomputed, so recovery can only return the exact block the
manifest points at. A WARN surfaces each recovery so gateway-index rot is
visible. Native-only; the recovered node is not re-uploaded on read.

Verified end-to-end against the live `videos` bucket: the two orphaned
objects (one page, one node) that abort the wasm walk now recover and the
bucket lists fully (33 files).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ehsan6sha ehsan6sha merged commit 80c6398 into main Jun 4, 2026
15 of 16 checks passed
@ehsan6sha ehsan6sha deleted the fix/walkable-v8-recover-orphaned-node-on-404 branch June 4, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fula-client: online forest walk aborts on gc-orphaned nodes (recover via manifest CID-hint on 404)

1 participant