LLM Benchmark Improvements + More Evals by bradleyshep · Pull Request #4740 · clockworklabs/SpacetimeDB

bradleyshep · 2026-04-01T15:13:12Z

Description of Changes

LLM benchmark infrastructure improvements and new benchmark tasks.

Runner & scoring:

Add retry logic with backoff for LLM API calls (rate limits, 502/503/504, timeouts)
Fix generation_duration_ms to only time the successful attempt, not retries+sleep delays
Add --dry-run flag to run benchmarks without saving results
Add OpenRouter client as unified fallback when direct vendor keys aren't set
Add web search mode via OpenRouter :online suffix
Extract shared OpenAI-compatible response types into oa_compat.rs
Add ReducerCallBothScorer for calling reducers on both golden and LLM databases
Set max_tokens on OpenRouter and Meta clients to prevent silent truncation

Model routing:

Add ModelRoute with display name, vendor, API model, and OpenRouter model ID
Support ad-hoc model IDs via --models vendor:model without static registration
Add model name normalization (OpenRouter IDs, case variants → canonical display names)

Context modes:

Add guidelines, cursor_rules, search, no_context modes with is_empty_context_mode() helper
Add mode-specific prompt preambles
Consolidate mode alias normalization (none/no_guidelines → no_context)

CI workflows:

Add llm-benchmark-periodic.yml for scheduled nightly runs with per-language failure tracking
- Note: The periodic workflow requires OPENROUTER_API_KEY, LLM_BENCHMARK_UPLOAD_URL, and LLM_BENCHMARK_API_KEY as GitHub secrets.
Add llm-benchmark-validate-goldens.yml for validating golden answers still compile

Results & summary:

Add cmd_status to show incomplete benchmark combinations with rerun commands
Add cmd_analyze for LLM-powered failure analysis
Split normalize_details_file from write_summary_from_details_file
Derive task categories from filesystem for summary generation
Add timestamp tracking (started_at/finished_at) and token usage

New benchmark tasks:

30 new tasks across auth, data_modeling, queries, basics, and schema categories
Updated/fixed existing task prompts and golden answers

API and ABI breaking changes

None. Internal tooling only.

Expected complexity level and risk

2 — Changes are scoped to the LLM benchmark CLI tool (xtask-llm-benchmark) and CI workflows. No impact on SpacetimeDB core.

Testing

cargo check -p xtask-llm-benchmark — zero errors, zero warnings
Dry run: llm_benchmark run --lang typescript --modes no_context --tasks t_001 --models openai:gpt-5-mini --dry-run — ran end-to-end, confirmed no results saved to disk
Verify periodic workflow runs successfully on next scheduled trigger

bradleyshep added 30 commits March 23, 2026 12:08

open router

36a875a

no guidelines variant, new workflows, results save updates

3b88747

new evals batch one

76016e7

query evals

3bdecca

more evals + categories

b3ce8f7

fixes

52e28b9

fixes

617e052

fmt

6eb1168

llm benchmark site

b9a545f

Create ModelDetail.tsx

afee2e0

site + details

61d815e

benchmark site + run

e132ed8

more evals + fixes

56e693f

fixes

1216af6

refinements

ec966f9

updates

00d6598

updates; guidelines mode

850254e

Create README.md

4abe096

fixes

b432278

updates

bed39d0

remove tools/site

9ffba0b

normalize model names

bb26681

scoring fixes

139408e

fixes

6f740ed

results

dd35c66

rust concurrency and details updates

db0e185

Update spacetimedb-typescript.mdc

25a246e

update actions

741fcf4

Merge branch 'master' into bradley/llm-benchmarks-improvements

603b5ee

Update llm-benchmark-periodic.yml

5fd1a0e

bradleyshep added 7 commits March 31, 2026 16:58

updates

b6677f9

Update spacetimedb-typescript.mdc

b9e43b8

refinements

68ae3ef

updates

e8b039a

fixes/cleanup

28662c6

cleanup

8f070f4

cleanup

a3e4421

bradleyshep requested a review from cloutiertyler April 1, 2026 15:13

bradleyshep requested review from bfops and jdetter as code owners April 1, 2026 15:13

bradleyshep changed the title ~~LLM Benchmark Improvements~~ LLM Benchmark Improvements + More Evals Apr 1, 2026

bradleyshep added 6 commits April 1, 2026 11:16

Update global.json

6feb97d

Delete llm-comparison-details.lock

920217c

fmt

e5a5546

fmt

aa3caf1

lints

c9770da

clippy

fc1d685

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Benchmark Improvements + More Evals#4740

LLM Benchmark Improvements + More Evals#4740
bradleyshep wants to merge 43 commits intomasterfrom
bradley/llm-benchmarks-improvements

bradleyshep commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bradleyshep commented Apr 1, 2026

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant