LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265alessandralanz wants to merge 15 commits into
Conversation
…. Checked against a live OKP image with 85% pass rate
… Guide's dataset sizing and distribution recommendations
…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
WalkthroughAdds a new LCORE regression PR-gate evaluation configuration, a CLI script comparing evaluation run summaries against a baseline to detect metric regressions with critical/warning thresholds, supporting test fixtures and tests, and a baseline JSON snapshot. ChangesRegression gate workflow
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as compare_against_baseline.main
participant Loader as find_and_load_summary
participant Delta as compute_metric_deltas
participant Report as generate_markdown_summary
User->>CLI: run with --baseline, --current args
CLI->>Loader: load baseline summary JSON
Loader-->>CLI: baseline_data
CLI->>Loader: load current summary JSON
Loader-->>CLI: current_data
CLI->>Delta: compute_metric_deltas(baseline_data, current_data)
Delta-->>CLI: metric deltas with PASS/WARN/FAIL status
alt --check-only
CLI-->>User: print "regression" or "ok"
else full output
CLI-->>User: print terminal table
opt --output provided
CLI->>Report: generate_markdown_summary(deltas)
Report-->>CLI: markdown report
CLI-->>User: write markdown file
end
end
CLI-->>User: exit code (non-zero if critical FAIL and --fail-on-critical-regression)
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.
In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.
In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc
📒 Files selected for processing (9)
baselines/lcore_regression/current_baseline_summary.jsonconfig/lcore_regression/system-config-pr-gate.yamleval_data/lcore_regression/okp_rag_quality.yamlscript/regression/__init__.pyscript/regression/compare_abc_runs.pyscript/regression/compare_against_baseline.pytests/script/conftest.pytests/script/test_compare_abc_runs.pytests/script/test_compare_against_baseline.py
…tdout, verdict logic, test determinism
…s with 4 context/retrieval metrics, and remove judge_panel config since response-quality judging no longer needed
|
Thanks!! I came across your PR, and I'm not sure if it's ready yet, but I'm wondering a few things:
cc: @asamal4 @Anxhela21 |
|
@xmican10 Thank you so much for the feedback! These are all great points, and I've addressed them just now with my latest push:
The queries and expected responses are from datasets provided by the OpenShift Installer and RHEL product groups (they are the only ones currently rated as gold) that can be found in Lightspeed Core's Evaluation Data GitLab repository. The evaluation data has been removed fro this public repo and is now cloned at pipeline runtime from a forked version of the internal Evaluation Data GitLab repo.
I agree that keeping the full Q&A datasets in the public repo is a risk, so I have removed
The baseline summary has been stripped to 3KB and now only contains aggregate metric scores (pass rates, means, confidence intervals) with no per-conversation IDs, topic names, or individual scores. The comparison scripts only need these aggregate values. The large evaluation data files have been moved to the internal GitLab repo and are no longer part of this PR. |
|
Thanks @alessandralanz for actively addressing my concerns! Thinking a bit further into the future, keeping the regression tests within ls-eval itself might expand the scope of the tooling without adding direct value for the standard ls-eval user. Wdyt @asamal4 and @Anxhela21? I'd love to get your thoughts on this from an architectural perspective. |
Description
Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:
negative/off-topic queries
compare_against_baseline) with--check-onlymode for CI gatinggpt-4o-minias the judge LLMType of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.
make pre-commitpasses all checks (pylint, pyright, ruff, black, etc.)lightspeed-stackinstance with OKP enabledSummary by CodeRabbit
New Features
Tests