feat: machine-readable eval report for LLM optimization agents#112
Open
StephaneWamba wants to merge 4 commits intoCodSpeedHQ:masterfrom
Open
feat: machine-readable eval report for LLM optimization agents#112StephaneWamba wants to merge 4 commits intoCodSpeedHQ:masterfrom
StephaneWamba wants to merge 4 commits intoCodSpeedHQ:masterfrom
Conversation
The result files written by each run were only used for CI uploads. This uses them locally too: on the second run, pytest-codspeed finds the most recent prior .codspeed/results_*.json and prints a short regression/improvement summary to the terminal. Skipped when --codspeed-profile-folder is set or in non-walltime modes. Implements the TODO left in plugin.py. Tests in test_comparison.py (unit) and test_comparison_integration.py (pytester end-to-end).
Adds --codspeed-capture-output flag. When set, each walltime run hashes the return value of the benchmarked function (pickle + sha256, repr fallback) and stores it in the result JSON alongside mean_ns. On the second run, the local comparison checks whether the hash changed. If it did, the report flags the benchmark with "! output changed" and counts correctness warnings in the footer. This closes the gap in optimisation loops where a suggestion improves perf but silently alters the function's output. Score formula exposed in eval_harness.py: score = perf_gain if output correct, 0 if broken, nan if capture was not enabled.
- Move _make_result/_bench helpers to conftest.py — single source of truth - Add result.assert_outcomes(passed=1) to all 9 integration tests so a broken feature cannot hide behind a passing stdout check - Fix test_no_comparison_with_profile_folder: CODSPEED_PROFILE_FOLDER is an env var, not a CLI flag; use monkeypatch.setenv instead - Ruff TC003/I001 fixes (Path in TYPE_CHECKING block, import sort)
- EvalReport.aggregate_score: conservative min-score across all benchmarks; 0.0 if any correctness broken, nan if any unknown - EvalReport.is_acceptable: single bool for binary accept/reject - EvalReport.to_dict(): JSON-serializable dict (nan -> null) - --codspeed-eval-report=PATH: write the eval report as JSON after each comparison run, so automated agents need no Python API knowledge - examples/optimize_loop.py: runnable demo of the full baseline -> patch -> rerun -> read verdict loop
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This builds on #111 (output hash / correctness check) to close the loop for automated optimization agents.
The problem with #111 alone: the agent gets a human-readable terminal output but still needs to parse unstructured text and know the Python API to make a binary accept/reject decision.
What this adds:
--codspeed-eval-report=eval.json-- write a machine-readable JSON file after each comparison run:{ "aggregate_score": 0.33, "is_acceptable": true, "benchmarks": [ { "name": "tests/test_sort.py::test_sort", "perf_gain": 0.33, "output_changed": false, "score": 0.33 } ] }The agent reads this file and makes a decision. No Python knowledge needed, works from any language or shell script.
Scoring logic (also in #111):
aggregate_scoreis the minimum score across all benchmarks -- conservative by design, one correctness failure vetoes the whole suggestionscore = perf_gainwhen output is correct,0.0when broken,nullwhen unknown (no--codspeed-capture-output)is_acceptable = aggregate_score > 0.0 and not nanDemo:
examples/optimize_loop.pyshows the full loop: baseline run -> apply patch -> rerun with--codspeed-eval-report-> read JSON -> print verdict. Runs standalone withpython examples/optimize_loop.py.Tests added:
test_eval_report_written_after_second_run-- file is created, has the right keystest_eval_report_acceptable_when_output_stable--output_changed: falsefor unchanged outputtest_eval_report_not_acceptable_when_output_breaks--aggregate_score: 0.0,is_acceptable: falsetest_eval_report_not_written_on_first_run-- no baseline means no fileaggregate_score,is_acceptable,to_dictincluding nan->null serializationDepends on #111.