Skip to content

refactor(depot): remove sqlite cold storage#5180

Open
NathanFlurry wants to merge 1 commit into
mainfrom
stack/refactor-depot-remove-sqlite-cold-storage-oxsukmwr
Open

refactor(depot): remove sqlite cold storage#5180
NathanFlurry wants to merge 1 commit into
mainfrom
stack/refactor-depot-remove-sqlite-cold-storage-oxsukmwr

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Deletes OSS S3 cold-tier code and rewires Depot compaction/read/reclaim to remain FDB/hot-only.

Removes workflow_cold_storage config/schema and AWS S3 dependencies so old cold-storage config fails as an unknown field. Retains the unstable SQLite commit-size-cap opt-out in the same config/schema artifact.

Deletes OSS S3 cold-tier code and rewires Depot compaction/read/reclaim to remain FDB/hot-only.

Removes workflow_cold_storage config/schema and AWS S3 dependencies so old cold-storage config fails as an unknown field. Retains the unstable SQLite commit-size-cap opt-out in the same config/schema artifact.
@NathanFlurry
Copy link
Copy Markdown
Member Author

Stack for rivet-dev/rivet

Check out this stack: jj-stack get https://github.com/rivet-dev/rivet/pull/5180
Pull/update this stack: jj-stack sync
Publish local edits: jj-stack submit
Merge when ready: jj-stack merge

@railway-app
Copy link
Copy Markdown

railway-app Bot commented Jun 5, 2026

🚅 Deployed to the rivet-pr-5180 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-inspector ❌ Build Failed (View Logs) Web Jun 5, 2026 at 11:12 pm
mcp-hub ✅ Success (View Logs) Web Jun 5, 2026 at 11:12 pm
kitchen-sink ❌ Build Failed (View Logs) Web Jun 5, 2026 at 11:10 pm
frontend-cloud ❌ Build Failed (View Logs) Web Jun 5, 2026 at 11:09 pm
website ❌ Build Failed (View Logs) Web Jun 5, 2026 at 11:09 pm
ladle ❌ Build Failed (View Logs) Web Jun 5, 2026 at 11:09 pm

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 5, 2026

Code Review: refactor(depot): remove sqlite cold storage

Overview

This PR strips the OSS S3 cold-tier from the depot package entirely, removing around 18,000 lines of code across the cold compacter workflow, cold read path, shard-cache fill workers, cold manifest types, and cold-tier S3/filesystem implementations. It rewires compaction and read to be FDB/hot-only, and promotes the previously TODO-commented commit size cap to an enforceable constraint with a config-gated opt-out.

This is a well-scoped cleanup PR. The removal is thorough and the new CLAUDE.md hard constraints clearly document what must not be reintroduced.


Correctness

CompactionRoot retains dead fields (cold_watermark_txid, cold_watermark_versionstamp)

CompactionRoot in conveyer/types/compaction.rs still carries cold_watermark_txid and cold_watermark_versionstamp. After this PR neither field is written by any compaction path (the write sites in db_cold_compacter were deleted), but they are still deserialized from FDB for existing rows. That is correct for backwards compatibility. It is worth confirming that nothing now acts on these watermarks: compaction/shared.rs::has_actionable_lag now only reads hot_lag, and burst_mode.rs now reads hot_watermark_txid rather than cold_watermark_txid. Both look correct.

_fallback_compaction_watermark_txid is an unused parameter

In conveyer/commit/dirty.rs::admit_deltas_available, the parameter is renamed to _fallback_compaction_watermark_txid to silence the warning, but the value is never consulted anywhere in the function. The call site in conveyer/commit/apply.rs still passes burst_signal.compaction_watermark_txid. Consider removing the parameter from the signature entirely and updating the call site to match, rather than leaving a dead _-prefixed argument. This is the kind of vestigial signature that confuses future readers.

Burst mode active is hardcoded to false

burst_mode.rs::signal_from_txids now returns active: false unconditionally, and adjusted_hot_quota_cap ignores the signal. The burst-mode path is dead. This is intentional since cold storage drove burst mode, but BurstSignal.active and BurstSignal.lag_txids now carry no meaning. If burst mode is permanently removed rather than just paused, consider simplifying or removing the struct and its callers.

SQLITE_SHARD_CACHE_RESIDENT_BYTES gauge has no write path

metrics.rs removes all the shard-cache fill and eviction metrics including the counter that decremented SQLITE_SHARD_CACHE_RESIDENT_BYTES, but the gauge itself is kept. Since the decrement site is gone, this gauge will never change after initialization (always 0). Either remove it or restore the write path.


Doctor / Migration

Cold FDB key detection removed from doctor.rs

The three cold-compact metadata key checks and two cold shard/retired-cold-object prefix scans are removed from detect_unsupported and load_storage_facts. For existing deployments that previously ran with cold storage enabled, these rows may still exist in FDB. The CLAUDE.md constraint says unsupported cold config must fail clearly, but that applies to the config shape, not to existing data. Consider whether a migration or a one-time doctor scan is needed to clean up orphaned cold FDB rows, or document explicitly that they are inert and will age out naturally.


Config / Breaking Change

workflow_cold_storage config becomes an unknown field error

The PR description calls this out: #[serde(deny_unknown_fields)] on Sqlite means old configs with workflow_cold_storage will now fail to parse at startup. The CLAUDE.md constraint reinforces this intent. Good. Just confirm this is communicated to operators before deploying.

New unstable_disable_commit_size_cap field is well-guarded

The field is Option<bool> with a default serde attribute and a non-panicking accessor. The naming makes the escape-hatch nature clear. Wiring through CommitOptions.disable_size_cap into validate_dirty_pages looks correct.


Test Coverage

conveyer_read.rs (2119 lines) deleted without clear replacement

The cold read path tests are rightfully deleted. However, conveyer_read.rs also covered hot-path read scenarios (FDB delta hits, shard fallback, zero-fill) that have nothing to do with cold storage. Worth confirming that equivalent hot-path read coverage exists in the retained test files (workflow_compaction_skeletons.rs, workflow_compaction_payloads.rs, conveyer_commit.rs). If those scenarios are not covered, they are now untested.

Inline workflows_compaction.rs (1250 lines) deleted

This file tested cold job planning and manager effects that no longer exist, so deletion is appropriate. The retained skeleton/payload tests should cover hot and reclaim paths adequately.

Burst-mode test is updated correctly

burst_signal_is_inactive_without_cold_storage now asserts active == false and lag_txids == 0. Assertions match the new behavior. Good.


Minor

  • The metric rename from sqlite_shard_cache_fill_skipped_queue_full_total to sqlite_conveyer_pidx_fdb_load_total is a semantic change. Ensure any dashboards or alerts referencing the old metric name are updated.
  • Commented-out registry.register_workflow calls in lib.rs::registry() remain for hot, manager, and reclaimer. These are presumably intentional (workflows exist but registration is deferred). A short inline comment explaining why they are commented out would help future readers distinguish "TODO: wire up" from "intentionally disabled".

Summary

Clean, thorough removal. Main items worth a follow-up:

  1. Dead _fallback_compaction_watermark_txid parameter - remove the parameter and its call-site argument.
  2. SQLITE_SHARD_CACHE_RESIDENT_BYTES with no write path - remove or restore the gauge.
  3. Orphaned cold FDB rows in existing deployments - confirm inert, add a migration note, or add a doctor scan.
  4. Hot-path read coverage - verify the non-cold scenarios from the deleted conveyer_read.rs are covered by retained tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant