Add perf run metadata, variance/confidence reporting, and historical regression checks#35
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3902714609
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR enhances ExcelBench’s performance benchmarking outputs by recording richer run context (machine/environment metadata), surfacing variance/confidence cues in human- and machine-readable reports, and adding a lightweight historical regression indicator based on prior runs.
Changes:
- Add
PerfRunEnvironmentand include captured CPU/core/memory info inPerfMetadataandresults.json. - Update markdown rendering to show run environment and include a confidence note plus
p50/p95display. - Extend CSV output with variance/confidence and regression-status columns, backed by history-based comparisons, and document a canonical comparison procedure.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
src/excelbench/perf/runner.py |
Adds run-environment collection and serializes it into perf metadata JSON. |
src/excelbench/perf/renderer.py |
Updates markdown/CSV rendering, adds CV + regression-status helpers, appends perf history. |
tests/test_perf_cli.py |
Updates CLI/render assertions to expect new metadata and markdown cues. |
tests/test_perf_data_shape.py |
Adds coverage that CSV output includes new regression/confidence fields. |
tests/test_perf_workloads.py |
Adds coverage that runtime perf metadata includes run environment. |
docs/trackers/performance-benchmark-runs.md |
Documents the canonical procedure for comparing perf runs and interpreting new columns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Motivation
Description
PerfRunEnvironmentdataclass and collect runtime environment with_collect_run_environment()to recordcpu_model,core_count, andmemory_total_mb, then include it onPerfMetadataand in the serializedresults.jsonviaperf_results_to_json_dict()(changes insrc/excelbench/perf/runner.py).src/excelbench/perf/renderer.pyto display run environment, add a shortConfidence note, and renderp50/p95in summary tables via_fmt_p50_p95_ms()for variance visibility.read_cv,write_cv,confidence_note,regression_status) and add helper functions_cv()and_regression_status()that compare the currentread_p50to a historical baseline drawn fromresults/perf/history.jsonl(median of recent samples, last 5 with a minimum of 3) to reportstable,regressed,improved, or lightweight status codes when history is missing/insufficient.p50_wall_msper run) and use that file for non-flaky median-based comparisons instead of per-run statistical tests.tests/test_perf_cli.py(assertrun_environmentin metadata and markdown cues),tests/test_perf_data_shape.py(CSV regression/confidence output shape), andtests/test_perf_workloads.py(metadata capture at runtime).docs/trackers/performance-benchmark-runs.mdwith a canonical run-comparison procedure describing stable knobs, metadata checks, CSV confidence columns, and history-based regression gates.Testing
python -m ruff check src/excelbench/perf/runner.py src/excelbench/perf/renderer.py tests/test_perf_cli.py tests/test_perf_data_shape.py tests/test_perf_workloads.pyand fixed reported issues; the linter check passed.python -m pytest -o addopts='' tests/test_perf_cli.py tests/test_perf_data_shape.py tests/test_perf_workloads.py, but test collection failed in this environment due to a missing external dependency (openpyxl), so the new/updated tests could not be fully executed here.Codex Task