Skip to content

LLM Benchmark Improvements + More Evals#4740

Open
bradleyshep wants to merge 43 commits intomasterfrom
bradley/llm-benchmarks-improvements
Open

LLM Benchmark Improvements + More Evals#4740
bradleyshep wants to merge 43 commits intomasterfrom
bradley/llm-benchmarks-improvements

Conversation

@bradleyshep
Copy link
Copy Markdown
Contributor

Description of Changes

LLM benchmark infrastructure improvements and new benchmark tasks.

Runner & scoring:

  • Add retry logic with backoff for LLM API calls (rate limits, 502/503/504, timeouts)
  • Fix generation_duration_ms to only time the successful attempt, not retries+sleep delays
  • Add --dry-run flag to run benchmarks without saving results
  • Add OpenRouter client as unified fallback when direct vendor keys aren't set
  • Add web search mode via OpenRouter :online suffix
  • Extract shared OpenAI-compatible response types into oa_compat.rs
  • Add ReducerCallBothScorer for calling reducers on both golden and LLM databases
  • Set max_tokens on OpenRouter and Meta clients to prevent silent truncation

Model routing:

  • Add ModelRoute with display name, vendor, API model, and OpenRouter model ID
  • Support ad-hoc model IDs via --models vendor:model without static registration
  • Add model name normalization (OpenRouter IDs, case variants → canonical display names)

Context modes:

  • Add guidelines, cursor_rules, search, no_context modes with is_empty_context_mode() helper
  • Add mode-specific prompt preambles
  • Consolidate mode alias normalization (none/no_guidelinesno_context)

CI workflows:

  • Add llm-benchmark-periodic.yml for scheduled nightly runs with per-language failure tracking
    • Note: The periodic workflow requires OPENROUTER_API_KEY, LLM_BENCHMARK_UPLOAD_URL, and LLM_BENCHMARK_API_KEY as GitHub secrets.
  • Add llm-benchmark-validate-goldens.yml for validating golden answers still compile

Results & summary:

  • Add cmd_status to show incomplete benchmark combinations with rerun commands
  • Add cmd_analyze for LLM-powered failure analysis
  • Split normalize_details_file from write_summary_from_details_file
  • Derive task categories from filesystem for summary generation
  • Add timestamp tracking (started_at/finished_at) and token usage

New benchmark tasks:

  • 30 new tasks across auth, data_modeling, queries, basics, and schema categories
  • Updated/fixed existing task prompts and golden answers

API and ABI breaking changes

None. Internal tooling only.

Expected complexity level and risk

2 — Changes are scoped to the LLM benchmark CLI tool (xtask-llm-benchmark) and CI workflows. No impact on SpacetimeDB core.

Testing

  • cargo check -p xtask-llm-benchmark — zero errors, zero warnings
  • Dry run: llm_benchmark run --lang typescript --modes no_context --tasks t_001 --models openai:gpt-5-mini --dry-run — ran end-to-end, confirmed no results saved to disk
  • Verify periodic workflow runs successfully on next scheduled trigger

@bradleyshep bradleyshep changed the title LLM Benchmark Improvements LLM Benchmark Improvements + More Evals Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant