Skip to content

feat(rivetkit): sqlite vfs v2#4673

Draft
NathanFlurry wants to merge 1 commit into04-15-chore_rename_sandbox_-_kitchen_sinkfrom
feat/sqlite-vfs-v2
Draft

feat(rivetkit): sqlite vfs v2#4673
NathanFlurry wants to merge 1 commit into04-15-chore_rename_sandbox_-_kitchen_sinkfrom
feat/sqlite-vfs-v2

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented Apr 16, 2026

🚅 Deployed to the rivet-pr-4673 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web Apr 24, 2026 at 8:12 am
frontend-inspector ❌ Build Failed (View Logs) Web Apr 24, 2026 at 7:17 am
frontend-cloud ❌ Build Failed (View Logs) Web Apr 24, 2026 at 7:16 am
website ❌ Build Failed (View Logs) Web Apr 24, 2026 at 7:16 am
mcp-hub ✅ Success (View Logs) Web Apr 16, 2026 at 4:08 pm
ladle ❌ Build Failed (View Logs) Web Apr 16, 2026 at 9:18 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 16, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

Preview packages published to npm

Install with:

npm install rivetkit@pr-4673

All packages published as 0.0.0-pr.4673.33edc3b with tag pr-4673.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-33edc3b
docker pull rivetdev/engine:full-33edc3b
Individual packages
npm install rivetkit@pr-4673
npm install @rivetkit/react@pr-4673
npm install @rivetkit/rivetkit-native@pr-4673
npm install @rivetkit/workflow-engine@pr-4673

@claude
Copy link
Copy Markdown

claude Bot commented Apr 17, 2026

SQLite VFS v2 — Code Review

Note: This is a draft PR — reviewing as an updated implementation check.

Overview

This PR introduces SQLite VFS v2: a new custom Virtual File System layer that stores SQLite data in Rivet's distributed key-value store (UniversalDB). It replaces the v1 journal approach with a structured LTX (Log-Structured Transaction) format, adding:

  • A new sqlite-storage Rust package (~7,500 lines) with delta storage, shard compaction, page index caching, takeover/fencing, and quota tracking
  • A new envoy protocol version (v2.bare) with 5 new SQLite message types
  • Rust and TypeScript envoy client/server integration for all 5 commands
  • New sqlite_runtime.rs in pegboard-envoy that ties the engine to actor lifecycle

The architecture is well thought out. The generation-fencing mechanism, the two-path commit (fast-path vs. staged), and the PIDX cache are clean. The test coverage for concurrency, compaction, and latency is solid for a draft. A few specific issues worth addressing before merge:


Bugs / Correctness

1. SqliteStorageQuotaExceeded is not a typed error (commit.rs:278, commit.rs:524)

Both the fast-path and staged-commit quota checks use a raw bail! with an inline string rather than a structured SqliteStorageError variant. The error.rs enum already covers CommitTooLarge, FenceMismatch, etc. Callers in the envoy handler cannot pattern-match on quota errors without string parsing. Add a QuotaExceeded { used: u64, max: u64 } variant to the enum.

2. Dead variable _shard_txids in compaction (compaction/shard.rs:136)

let _shard_txids = shard_rows.iter().map(|row| row.txid).collect::<BTreeSet<_>>(); is collected but never read. If it was intended for a future validation step, document the intent; otherwise remove it.

3. Orphan count conflates chunks and txids (takeover.rs:210)

orphan_count = mutations.len() counts individual mutation operations, including one deletion per delta chunk. A single orphaned txid with a multi-chunk staged delta counts as multiple orphans in the metric. Consider counting distinct orphaned txids instead for a more accurate metric.


Design Concerns

4. Global static SQLITE_ENGINE silently captures first UDB instance (sqlite_runtime.rs:11)

OnceCell::get_or_try_init captures the db from the first StandaloneCtx that calls shared_engine. Subsequent calls with a different context silently use the first context's DB. This is fine in production (single UDB per process), but will silently use the wrong DB in any test or integration scenario that creates multiple engine instances. Document the single-instance contract explicitly, or assert that the DB matches on subsequent calls.

5. encode_db_head_with_usage convergence loop has no termination guard (quota.rs:36)

The loop converges in practice (serde_bare encoding grows monotonically to a fixed point), but has no explicit iteration limit. If a serialization anomaly produced a non-monotone size it would loop forever. A simple assertion or bail after N iterations would make the invariant explicit.

6. Sequential source fetches in preload_pages (takeover.rs:244)

Preload sources are fetched one at a time in a for-loop. For actors preloading pages from multiple different shards or deltas, all fetches could be issued concurrently. This is on the hot path for actor startup latency.

7. Linear page scan inside decoded LTX blob (takeover.rs:283)

After decoding a shard or delta blob, finding a specific page uses pages.iter().find(|page| page.pgno == pgno), which is O(n) per lookup. If a shard contains 64 pages and 32 of them are requested, that totals O(n^2/2) comparisons. Converting to a HashMap<u32, Vec<u8>> after decode would make each lookup O(1).


Code Quality

8. Duplicated PIDX decode constants and functions (takeover.rs:22, compaction/shard.rs:19)

PIDX_PGNO_BYTES, PIDX_TXID_BYTES, decode_pidx_pgno, and decode_pidx_txid are defined identically in both modules. These should be consolidated into a shared location such as keys.rs.

9. Unbounded compaction channel (engine.rs:41, compaction/mod.rs)

Under high commit rates, actors continuously enqueue compaction signals. The CompactionCoordinator deduplicates concurrent workers per actor, but the channel itself is unbounded. A bounded channel with a non-blocking try_send and drop-on-full would provide backpressure and is appropriate here since compaction signals are idempotent.


Test Issues

10. unsafe { std::env::set_var(...) } in multi-threaded test (tests/latency.rs:49)

Calling std::env::set_var in a multi-threaded context is unsound (data race under concurrent reads). Rust 1.81 marked it unsafe for exactly this reason. Even wrapped in unsafe {}, this is a real data race if other tests read env vars concurrently. Consider passing the latency value directly to the DB driver, or initializing it via a process-level std::sync::LazyLock before threads spawn.

11. tempdir()?.keep() leaks test directories (tests/concurrency.rs:18, tests/latency.rs:18)

tempdir()?.keep() prevents cleanup on drop. After each test run the temp directory is never removed, which accumulates in CI. Consider using tempfile::tempdir() without keep() so the directory is auto-deleted when the DB handle is closed at the end of the test.


Minor Notes

  • Quota check ordering (commit.rs:278): The quota check fires after writes are staged in the transaction (rolled back on failure, so correct). Checking before staging writes would make the flow clearer and avoid unnecessary serialization work.
  • DEFAULT_REAP_INTERVAL: 100ms (compaction/mod.rs:20): Reaping finished workers every 100ms slowly accumulates finished entries between ticks. Reaping immediately inside spawn_worker_if_needed when the handle is already finished would keep the map tighter.
  • Protocol version sync: The new v2.bare schema correctly adds a versioned file rather than modifying the existing one. Verify that PROTOCOL_MK2_VERSION in engine/packages/runner-protocol/src/lib.rs and PROTOCOL_VERSION in rivetkit-typescript/packages/engine-runner/src/mod.ts are bumped together per CLAUDE.md convention.

Overall, the design is solid and the core correctness guarantees (generation fencing, PIDX consistency, staged commit atomicity) look well-reasoned. The issues above are mostly polish and performance; the quota error typing (#1) and the unsafe env var in tests (#10) are the most important to address before merge.

@NathanFlurry NathanFlurry mentioned this pull request Apr 18, 2026
11 tasks
@NathanFlurry NathanFlurry force-pushed the 04-15-chore_rename_sandbox_-_kitchen_sink branch from 83bef30 to 1bc8cd6 Compare April 24, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant