LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison by alessandralanz · Pull Request #265 · lightspeed-core/lightspeed-evaluation

alessandralanz · 2026-06-26T21:22:32Z

Description

Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:

97 OKP RAG benchmark conversations covering single-turn knowledge, multi-turn retention, edge cases, and
negative/off-topic queries
Baseline comparison script (compare_against_baseline) with --check-only mode for CI gating
System config for PR-gate evaluation using gpt-4o-mini as the judge LLM
Version-controlled baseline snapshot (102 evaluations)
A/B/C comparison script for comparing multiple evaluation runs

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude (Claude Code CLI)
Generated by: N/A

Related Tickets & Documents

Related Issue #
Closes # LCORE-2802

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.
make pre-commit passes all checks (pylint, pyright, ruff, black, etc.)
- GitLab CI pipeline runs 97 conversations producing 506 evaluations against a live lightspeed-stack instance with OKP enabled
- Baseline comparison script validates current results against stored baseline

Summary by CodeRabbit

New Features
- Added regression comparison tooling to check a current evaluation run against a baseline and highlight score or pass-rate changes.
- Added a new evaluation gate configuration for LCORE regression runs, including output, logging, and metric settings.
- Added a stored baseline summary for regression checks.
Tests
- Added coverage for summary loading, regression classification, threshold handling, and markdown report output.

…. Checked against a live OKP image with 85% pass rate

… Guide's dataset sizing and distribution recommendations

…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py

coderabbitai · 2026-06-26T21:22:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ecb76a1-7d1a-4779-b824-73ece4f99dc1

📥 Commits

Reviewing files that changed from the base of the PR and between e7ab2ae and db2d156.

📒 Files selected for processing (1)

baselines/lcore_regression/current_baseline_summary.json

💤 Files with no reviewable changes (1)

baselines/lcore_regression/current_baseline_summary.json

Walkthrough

Adds a new LCORE regression PR-gate evaluation configuration, a CLI script comparing evaluation run summaries against a baseline to detect metric regressions with critical/warning thresholds, supporting test fixtures and tests, and a baseline JSON snapshot.

Changes

Regression gate workflow

Layer / File(s)	Summary
Gate config and package doc `config/lcore_regression/system-config-pr-gate.yaml`, `script/regression/__init__.py`	New PR-gate YAML config defines execution limits, judge/embedding LLM settings, API connection, RAGAS context metrics with thresholds, file-based storage/CSV schema, visualization defaults, and logging; a package docstring is added for the regression scripts module.
Baseline loading and deltas `script/regression/compare_against_baseline.py`, `tests/script/conftest.py`, `tests/script/test_compare_against_baseline.py`	Implements CLI arg parsing, `find_and_load_summary` to locate/load a single *_summary.json, and `compute_metric_deltas` to classify PASS/WARN/FAIL by critical/non-critical thresholds; adds `make_summary`/`write_summary` test helpers, updates fixture metric identifiers to RAGAS context metrics, and adds tests for loading and delta classification.
Baseline reporting and entrypoint `script/regression/compare_against_baseline.py`, `tests/script/test_compare_against_baseline.py`, `baselines/lcore_regression/current_baseline_summary.json`	Implements `generate_markdown_summary` for report formatting, `main()` control flow with check-only mode, terminal table output, optional markdown file output, and critical-regression exit code; adds markdown output tests and a baseline JSON snapshot fixture.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as compare_against_baseline.main
    participant Loader as find_and_load_summary
    participant Delta as compute_metric_deltas
    participant Report as generate_markdown_summary

    User->>CLI: run with --baseline, --current args
    CLI->>Loader: load baseline summary JSON
    Loader-->>CLI: baseline_data
    CLI->>Loader: load current summary JSON
    Loader-->>CLI: current_data
    CLI->>Delta: compute_metric_deltas(baseline_data, current_data)
    Delta-->>CLI: metric deltas with PASS/WARN/FAIL status
    alt --check-only
        CLI-->>User: print "regression" or "ok"
    else full output
        CLI-->>User: print terminal table
        opt --output provided
            CLI->>Report: generate_markdown_summary(deltas)
            Report-->>CLI: markdown report
            CLI-->>User: write markdown file
        end
    end
    CLI-->>User: exit code (non-zero if critical FAIL and --fail-on-critical-regression)

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title matches the main change: adding OKP RAG regression evaluation support and baseline comparison tooling.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.

In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.

In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff47a6 and 5d59325.

📒 Files selected for processing (9)

baselines/lcore_regression/current_baseline_summary.json
config/lcore_regression/system-config-pr-gate.yaml
eval_data/lcore_regression/okp_rag_quality.yaml
script/regression/__init__.py
script/regression/compare_abc_runs.py
script/regression/compare_against_baseline.py
tests/script/conftest.py
tests/script/test_compare_abc_runs.py
tests/script/test_compare_against_baseline.py

…exts

…t metrics

…tdout, verdict logic, test determinism

…s with 4 context/retrieval metrics, and remove judge_panel config since response-quality judging no longer needed

xmican10 · 2026-07-01T07:07:50Z

Thanks!! I came across your PR, and I'm not sure if it's ready yet, but I'm wondering a few things:

What is the source of these QnAs? Is it our private GitLab repo?
Since OKP is a paid Red Hat capability, I'm wondering if keeping the QnAs and expected responses in a public repo is a risk. Even if the answers seem generic, they could reveal internal knowledge base structure, coverage, and quality characteristics. Should these live in a private repo instead?
The data files are huge, can't we move it away from the ls-eval tooling? Or create even more separation and move it all to a separate repo?

cc: @asamal4 @Anxhela21

…ables independently

alessandralanz · 2026-07-02T15:40:34Z

@xmican10 Thank you so much for the feedback! These are all great points, and I've addressed them just now with my latest push:

What is the source of these QnAs? Is it our private GitLab repo?

The queries and expected responses are from datasets provided by the OpenShift Installer and RHEL product groups (they are the only ones currently rated as gold) that can be found in Lightspeed Core's Evaluation Data GitLab repository. The evaluation data has been removed fro this public repo and is now cloned at pipeline runtime from a forked version of the internal Evaluation Data GitLab repo.

Since OKP is a paid Red Hat capability, I'm wondering if keeping the QnAs and expected responses in a public repo is a risk. Even if the answers seem generic, they could reveal internal knowledge base structure, coverage, and quality characteristics. Should these live in a private repo instead?

I agree that keeping the full Q&A datasets in the public repo is a risk, so I have removed okp_rag_quality.yaml and have updated the pipeline so that it pulls the evaluation data from the internal source at runtime instead.

The data files are huge, can't we move it away from the ls-eval tooling? Or create even more separation and move it all to a separate repo?

The baseline summary has been stripped to 3KB and now only contains aggregate metric scores (pass rates, means, confidence intervals) with no per-conversation IDs, topic names, or individual scores. The comparison scripts only need these aggregate values. The large evaluation data files have been moved to the internal GitLab repo and are no longer part of this PR.

xmican10 · 2026-07-03T08:20:03Z

Thanks @alessandralanz for actively addressing my concerns!

Thinking a bit further into the future, keeping the regression tests within ls-eval itself might expand the scope of the tooling without adding direct value for the standard ls-eval user.
Would it be possible to move these regression tests out of ls-eval into a separate repository dedicated entirely to this type of testing?

Wdyt @asamal4 and @Anxhela21? I'd love to get your thoughts on this from an architectural perspective.

alessandralanz added 5 commits June 26, 2026 16:24

Add OKP RAG quality benchmarks and system config for LCORE regression…

ff73802

…. Checked against a live OKP image with 85% pass rate

Add baseline comparison script for regression gating

7e29a9a

Revised benchmarks to adhere to Evaluation Data Collection Standard +…

74768c9

… Guide's dataset sizing and distribution recommendations

Add version-controlled baseline for regression gating

5e1f772

Add three run A/B/C comparison script and regression tests. Attribute…

5d59325

…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread script/regression/compare_abc_runs.py Outdated

Comment thread script/regression/compare_against_baseline.py

Comment thread script/regression/compare_against_baseline.py Outdated

Comment thread tests/script/test_compare_against_baseline.py Outdated

alessandralanz added 7 commits June 29, 2026 13:18

LCORE-2802: Update baseline from 102 to 506 evaluations

c03b462

Remove context-requiring metrics from turns where RAG returns no cont…

261f0b0

…exts

Update baseline to 491 evaluations after removing inapplicable contex…

f67c2d8

…t metrics

fix coderabbit review findings: missing-metric handling, check-only s…

7d4c4f2

…tdout, verdict logic, test determinism

switch to context-only RAGAS metrics, replace response-quality metric…

6f44c8b

…s with 4 context/retrieval metrics, and remove judge_panel config since response-quality judging no longer needed

Increase max_threads to 4 and disable skip_on_failure

e7ab2ae

update baseline bc we updated config

6ee6bb9

alessandralanz marked this pull request as draft July 1, 2026 13:10

alessandralanz added 3 commits July 1, 2026 10:28

remove OKP eval data from public repo and strip baseline to aggregates

c82599a

Merge branch 'lightspeed-core:main' into lcore-regression

b382b29

Remove A/B/C comparison script since weekly/PR gate jobs isolate vari…

db2d156

…ables independently

alessandralanz marked this pull request as ready for review July 2, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
alessandralanz wants to merge 15 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression

alessandralanz commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xmican10 commented Jul 1, 2026 •

edited

Loading

Uh oh!

alessandralanz commented Jul 2, 2026

Uh oh!

xmican10 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alessandralanz commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xmican10 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alessandralanz commented Jul 2, 2026

Uh oh!

xmican10 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alessandralanz commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

xmican10 commented Jul 1, 2026 •

edited

Loading