fix(client): recover gc-orphaned forest nodes on 404 via verified CID race by ehsan6sha · Pull Request #25 · functionland/fula-api

ehsan6sha · 2026-06-04T14:32:31Z

What

Fixes #24 — the online forest walk aborted when a walkable-v8 forest node or manifest page returned 404 NoSuchKey from a reachable master: the case where a server-side ipfs repo gc destroyed the gateway storage-key->CID index entry while the block still exists in IPFS by CID. Offline mode already recovered these via the verified gateway race; online didn't, because get_object_with_offline_fallback_known_cid engaged the race only on is_master_unreachable_error, never on a 404.

Change

New forest-scoped FulaClient::get_forest_object_known_cid = the generic cid-hint fetch + recover-on-not-found. Implemented as a shared private inner (..._inner(..., recover_on_not_found: bool)); the generic public method delegates false (its strict propagate-404 invariant and test_cid_hint_master_4xx_propagates_without_fallback are unchanged), the new method delegates true.
The two forest-infrastructure callers — S3BlobBackend::get_with_cid_hint (HAMT nodes) and EncryptedClient::load_manifest_pages (manifest pages) — now use the new method. No other caller changes.
A warn! surfaces each recovery so ongoing gateway-index rot is visible (silent recovery would hide the very signal that surfaced this bug).

Why it's safe

Recovery races the gateway pool for the manifest-supplied CID; fetch_verified content-verifies bytes against that CID, then the node store AEAD-decrypts and recomputes the storage-key + page-id/seq. So recovery can only ever return the exact block the freshly-decrypted, authoritative manifest points at — the CID is the capability. On gateway-race failure the original 404 propagates. Concept + diff reviewed by independent advisors (Gemini, Cursor, Copilot [source-grounded]; Codex description-only).

Verification

Unit: s3_backend_get_with_cid_hint_recovers_orphaned_node_on_master_404 (new — fails before the fix, passes after). 4/4 in walkable_v8_offline_walk.rs, 208/208 lib tests pass (incl. the propagate-404 security test).
E2E (live videos bucket, master up, default public gateways = the FxFiles config): the two orphaned objects that abort the wasm walk — page Qm94e8de… and node __fula_forest_v7_nodes/a096c036… — recover via the gateway race and the bucket lists fully (33 files). Harness not committed (uses real credentials).

Scope / rollout

Native only (cfg(not(target_arch = "wasm32"))). The wasm get_with_cid_hint degrades to plain get() (no gateway pool on web), so pinning-webui is NOT fixed by this and needs separate work (it also lists via the HEAD-per-object path, not the forest walk).
Inert unless the shipped app config has gateway_fallback_enabled = true (FxFiles already does).
The page-caller's page_ref.cid wiring is pre-existing and covered by the E2E + the prior recover_walk evidence (855 files reconstructed via manifest CID hints), not a new unit test; the recovery method itself is unit-tested via the node caller.
Recovered nodes are not re-uploaded on the read path (deliberate); the 404 persists and is re-raced per read until the next forest write re-pins the node.

The online forest walk aborted when a walkable-v8 forest node or manifest page returned 404 NoSuchKey from a reachable master -- the case where a server-side `ipfs repo gc` destroyed the gateway storage-key->CID index entry while the block still exists in IPFS by CID. Offline mode already recovered these via the verified gateway race; online did not, because `get_object_with_offline_fallback_known_cid` engaged the race only on `is_master_unreachable_error`, never on a 404. Add a forest-scoped `get_forest_object_known_cid` (private inner + `recover_on_not_found` bool) that also races the manifest-supplied CID on a not-found, and route the two forest callers (`S3BlobBackend::get_with_cid_hint`, `load_manifest_pages`) through it. The generic method keeps its strict propagate-404 invariant (and its security test). Fetched bytes are content-verified against the CID (fetch_verified) + AEAD-decrypted + storage-key-recomputed, so recovery can only return the exact block the manifest points at. A WARN surfaces each recovery so gateway-index rot is visible. Native-only; the recovered node is not re-uploaded on read. Verified end-to-end against the live `videos` bucket: the two orphaned objects (one page, one node) that abort the wasm walk now recover and the bucket lists fully (33 files). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ehsan6sha mentioned this pull request Jun 4, 2026

fula-client: online forest walk aborts on gc-orphaned nodes (recover via manifest CID-hint on 404) #24

Closed

v u

a71f508

ehsan6sha merged commit 80c6398 into main Jun 4, 2026
15 of 16 checks passed

ehsan6sha deleted the fix/walkable-v8-recover-orphaned-node-on-404 branch June 4, 2026 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): recover gc-orphaned forest nodes on 404 via verified CID race#25

fix(client): recover gc-orphaned forest nodes on 404 via verified CID race#25
ehsan6sha merged 2 commits into
mainfrom
fix/walkable-v8-recover-orphaned-node-on-404

ehsan6sha commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ehsan6sha commented Jun 4, 2026

What

Change

Why it's safe

Verification

Scope / rollout

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant