Add perf run metadata, variance/confidence reporting, and historical regression checks by wolfiesch · Pull Request #35 · SynthGL/ExcelBench

wolfiesch · 2026-05-08T08:30:24Z

Motivation

Capture richer run context so benchmark outputs can be compared meaningfully across machines and runs, including CPU model, core count, total memory and Python version.
Surface variance and confidence guidance in human- and machine-readable outputs to help spot noisy measurements.
Add a lightweight, non-flaky baseline regression check using historical run data so regressions can be triaged without brittle statistical tests.
Document the canonical run-comparison procedure for maintainers and CI operators.

Description

Add a PerfRunEnvironment dataclass and collect runtime environment with _collect_run_environment() to record cpu_model, core_count, and memory_total_mb, then include it on PerfMetadata and in the serialized results.json via perf_results_to_json_dict() (changes in src/excelbench/perf/runner.py).
Extend markdown rendering in src/excelbench/perf/renderer.py to display run environment, add a short Confidence note, and render p50/p95 in summary tables via _fmt_p50_p95_ms() for variance visibility.
Extend CSV output with variance/confidence fields (read_cv, write_cv, confidence_note, regression_status) and add helper functions _cv() and _regression_status() that compare the current read_p50 to a historical baseline drawn from results/perf/history.jsonl (median of recent samples, last 5 with a minimum of 3) to report stable, regressed, improved, or lightweight status codes when history is missing/insufficient.
Keep history append behavior (writing p50_wall_ms per run) and use that file for non-flaky median-based comparisons instead of per-run statistical tests.
Add tests to validate the new metadata and renderer shapes: updates to tests/test_perf_cli.py (assert run_environment in metadata and markdown cues), tests/test_perf_data_shape.py (CSV regression/confidence output shape), and tests/test_perf_workloads.py (metadata capture at runtime).
Update docs/trackers/performance-benchmark-runs.md with a canonical run-comparison procedure describing stable knobs, metadata checks, CSV confidence columns, and history-based regression gates.

Testing

Ran the linter with python -m ruff check src/excelbench/perf/runner.py src/excelbench/perf/renderer.py tests/test_perf_cli.py tests/test_perf_data_shape.py tests/test_perf_workloads.py and fixed reported issues; the linter check passed.
Attempted to run the unit tests with python -m pytest -o addopts='' tests/test_perf_cli.py tests/test_perf_data_shape.py tests/test_perf_workloads.py, but test collection failed in this environment due to a missing external dependency (openpyxl), so the new/updated tests could not be fully executed here.

Codex Task

…n checks

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3902714609

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot

Pull request overview

This PR enhances ExcelBench’s performance benchmarking outputs by recording richer run context (machine/environment metadata), surfacing variance/confidence cues in human- and machine-readable reports, and adding a lightweight historical regression indicator based on prior runs.

Changes:

Add PerfRunEnvironment and include captured CPU/core/memory info in PerfMetadata and results.json.
Update markdown rendering to show run environment and include a confidence note plus p50/p95 display.
Extend CSV output with variance/confidence and regression-status columns, backed by history-based comparisons, and document a canonical comparison procedure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`src/excelbench/perf/runner.py`	Adds run-environment collection and serializes it into perf metadata JSON.
`src/excelbench/perf/renderer.py`	Updates markdown/CSV rendering, adds CV + regression-status helpers, appends perf history.
`tests/test_perf_cli.py`	Updates CLI/render assertions to expect new metadata and markdown cues.
`tests/test_perf_data_shape.py`	Adds coverage that CSV output includes new regression/confidence fields.
`tests/test_perf_workloads.py`	Adds coverage that runtime perf metadata includes run environment.
`docs/trackers/performance-benchmark-runs.md`	Documents the canonical procedure for comparing perf runs and interpreting new columns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add perf metadata, variance/confidence output, and baseline regressio…

3902714

…n checks

Copilot AI review requested due to automatic review settings May 8, 2026 08:30

wolfiesch added the codex label May 8, 2026 — with ChatGPT Codex Connector

Copilot started reviewing on behalf of wolfiesch May 8, 2026 08:31 View session

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/excelbench/perf/renderer.py Outdated

Copilot AI reviewed May 8, 2026

View reviewed changes

wolfiesch added 4 commits May 8, 2026 01:37

fix(perf): satisfy strict metadata typing

ff65b1b

fix(perf): read regression history from output path

5aff01c

fix(perf): clarify tail ratio regression output

ed9c082

fix(perf): render csv header from columns

dd8ad61

wolfiesch merged commit 0785aeb into master May 8, 2026
5 checks passed

wolfiesch deleted the codex/2026-05-08-08-30-21-add-performance-metadata-capture branch May 8, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add perf run metadata, variance/confidence reporting, and historical regression checks#35

Add perf run metadata, variance/confidence reporting, and historical regression checks#35
wolfiesch merged 5 commits into
masterfrom
codex/2026-05-08-08-30-21-add-performance-metadata-capture

wolfiesch commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wolfiesch commented May 8, 2026

Motivation

Description

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants