Skip to content

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265

Open
alessandralanz wants to merge 15 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression
Open

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
alessandralanz wants to merge 15 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression

Conversation

@alessandralanz

@alessandralanz alessandralanz commented Jun 26, 2026

Copy link
Copy Markdown

Description

Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:

  • 97 OKP RAG benchmark conversations covering single-turn knowledge, multi-turn retention, edge cases, and
    negative/off-topic queries
  • Baseline comparison script (compare_against_baseline) with --check-only mode for CI gating
  • System config for PR-gate evaluation using gpt-4o-mini as the judge LLM
  • Version-controlled baseline snapshot (102 evaluations)
  • A/B/C comparison script for comparing multiple evaluation runs

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude (Claude Code CLI)
  • Generated by: N/A

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.

  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

  • make pre-commit passes all checks (pylint, pyright, ruff, black, etc.)

    • GitLab CI pipeline runs 97 conversations producing 506 evaluations against a live lightspeed-stack instance with OKP enabled
    • Baseline comparison script validates current results against stored baseline

Summary by CodeRabbit

  • New Features

    • Added regression comparison tooling to check a current evaluation run against a baseline and highlight score or pass-rate changes.
    • Added a new evaluation gate configuration for LCORE regression runs, including output, logging, and metric settings.
    • Added a stored baseline summary for regression checks.
  • Tests

    • Added coverage for summary loading, regression classification, threshold handling, and markdown report output.

…. Checked against a live OKP image with 85% pass rate
… Guide's dataset sizing and distribution recommendations
…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ecb76a1-7d1a-4779-b824-73ece4f99dc1

📥 Commits

Reviewing files that changed from the base of the PR and between e7ab2ae and db2d156.

📒 Files selected for processing (1)
  • baselines/lcore_regression/current_baseline_summary.json
💤 Files with no reviewable changes (1)
  • baselines/lcore_regression/current_baseline_summary.json

Walkthrough

Adds a new LCORE regression PR-gate evaluation configuration, a CLI script comparing evaluation run summaries against a baseline to detect metric regressions with critical/warning thresholds, supporting test fixtures and tests, and a baseline JSON snapshot.

Changes

Regression gate workflow

Layer / File(s) Summary
Gate config and package doc
config/lcore_regression/system-config-pr-gate.yaml, script/regression/__init__.py
New PR-gate YAML config defines execution limits, judge/embedding LLM settings, API connection, RAGAS context metrics with thresholds, file-based storage/CSV schema, visualization defaults, and logging; a package docstring is added for the regression scripts module.
Baseline loading and deltas
script/regression/compare_against_baseline.py, tests/script/conftest.py, tests/script/test_compare_against_baseline.py
Implements CLI arg parsing, find_and_load_summary to locate/load a single *_summary.json, and compute_metric_deltas to classify PASS/WARN/FAIL by critical/non-critical thresholds; adds make_summary/write_summary test helpers, updates fixture metric identifiers to RAGAS context metrics, and adds tests for loading and delta classification.
Baseline reporting and entrypoint
script/regression/compare_against_baseline.py, tests/script/test_compare_against_baseline.py, baselines/lcore_regression/current_baseline_summary.json
Implements generate_markdown_summary for report formatting, main() control flow with check-only mode, terminal table output, optional markdown file output, and critical-regression exit code; adds markdown output tests and a baseline JSON snapshot fixture.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as compare_against_baseline.main
    participant Loader as find_and_load_summary
    participant Delta as compute_metric_deltas
    participant Report as generate_markdown_summary

    User->>CLI: run with --baseline, --current args
    CLI->>Loader: load baseline summary JSON
    Loader-->>CLI: baseline_data
    CLI->>Loader: load current summary JSON
    Loader-->>CLI: current_data
    CLI->>Delta: compute_metric_deltas(baseline_data, current_data)
    Delta-->>CLI: metric deltas with PASS/WARN/FAIL status
    alt --check-only
        CLI-->>User: print "regression" or "ok"
    else full output
        CLI-->>User: print terminal table
        opt --output provided
            CLI->>Report: generate_markdown_summary(deltas)
            Report-->>CLI: markdown report
            CLI-->>User: write markdown file
        end
    end
    CLI-->>User: exit code (non-zero if critical FAIL and --fail-on-critical-regression)
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main change: adding OKP RAG regression evaluation support and baseline comparison tooling.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.

In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.

In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff47a6 and 5d59325.

📒 Files selected for processing (9)
  • baselines/lcore_regression/current_baseline_summary.json
  • config/lcore_regression/system-config-pr-gate.yaml
  • eval_data/lcore_regression/okp_rag_quality.yaml
  • script/regression/__init__.py
  • script/regression/compare_abc_runs.py
  • script/regression/compare_against_baseline.py
  • tests/script/conftest.py
  • tests/script/test_compare_abc_runs.py
  • tests/script/test_compare_against_baseline.py

Comment thread script/regression/compare_abc_runs.py Outdated
Comment thread script/regression/compare_against_baseline.py
Comment thread script/regression/compare_against_baseline.py Outdated
Comment thread tests/script/test_compare_against_baseline.py Outdated
@xmican10

xmican10 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Thanks!! I came across your PR, and I'm not sure if it's ready yet, but I'm wondering a few things:

  • What is the source of these QnAs? Is it our private GitLab repo?
  • Since OKP is a paid Red Hat capability, I'm wondering if keeping the QnAs and expected responses in a public repo is a risk. Even if the answers seem generic, they could reveal internal knowledge base structure, coverage, and quality characteristics. Should these live in a private repo instead?
  • The data files are huge, can't we move it away from the ls-eval tooling? Or create even more separation and move it all to a separate repo?

cc: @asamal4 @Anxhela21

@alessandralanz alessandralanz marked this pull request as draft July 1, 2026 13:10
@alessandralanz

Copy link
Copy Markdown
Author

@xmican10 Thank you so much for the feedback! These are all great points, and I've addressed them just now with my latest push:

  • What is the source of these QnAs? Is it our private GitLab repo?

The queries and expected responses are from datasets provided by the OpenShift Installer and RHEL product groups (they are the only ones currently rated as gold) that can be found in Lightspeed Core's Evaluation Data GitLab repository. The evaluation data has been removed fro this public repo and is now cloned at pipeline runtime from a forked version of the internal Evaluation Data GitLab repo.

  • Since OKP is a paid Red Hat capability, I'm wondering if keeping the QnAs and expected responses in a public repo is a risk. Even if the answers seem generic, they could reveal internal knowledge base structure, coverage, and quality characteristics. Should these live in a private repo instead?

I agree that keeping the full Q&A datasets in the public repo is a risk, so I have removed okp_rag_quality.yaml and have updated the pipeline so that it pulls the evaluation data from the internal source at runtime instead.

  • The data files are huge, can't we move it away from the ls-eval tooling? Or create even more separation and move it all to a separate repo?

The baseline summary has been stripped to 3KB and now only contains aggregate metric scores (pass rates, means, confidence intervals) with no per-conversation IDs, topic names, or individual scores. The comparison scripts only need these aggregate values. The large evaluation data files have been moved to the internal GitLab repo and are no longer part of this PR.

@alessandralanz alessandralanz marked this pull request as ready for review July 2, 2026 15:42
@xmican10

xmican10 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Thanks @alessandralanz for actively addressing my concerns!

Thinking a bit further into the future, keeping the regression tests within ls-eval itself might expand the scope of the tooling without adding direct value for the standard ls-eval user.
Would it be possible to move these regression tests out of ls-eval into a separate repository dedicated entirely to this type of testing?

Wdyt @asamal4 and @Anxhela21? I'd love to get your thoughts on this from an architectural perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants