Adding support for multiple tags#278
Conversation
|
Warning Review limit reached
Next review available in: 40 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (13)
WalkthroughThe tag field on EvaluationData and EvaluationResult changes from a single string to a list of strings, with new field validators normalizing legacy string input. Downstream consumers—statistics grouping, SQL/Langfuse storage, scope filtering, error result creation, and pipeline processing—are updated to handle multiple tags, alongside corresponding test updates. ChangesMulti-tag support
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant ConversationProcessor
participant EvaluationErrorHandler
participant EvaluationResult
ConversationProcessor->>EvaluationErrorHandler: create_error_result(..., tag=conv_data.tag)
EvaluationErrorHandler->>EvaluationResult: _create_result(..., tag)
EvaluationResult->>EvaluationResult: normalize tag to list[str]
EvaluationResult-->>ConversationProcessor: EvaluationResult with tag list
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
tests/unit/core/storage/test_sql_storage.py (1)
269-313: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd assertion for tag JSON serialization.
test_all_fields_storedsetstag=["integration"]but never asserts onrow.tag, so the newjson.dumps(result.tag)behavior insql_storage.pyisn't actually verified here (unlikeexpected_response, which has its owntest_list_serialization). As per coding guidelines, "Add comprehensive pytest coverage for new features, including mocked LLM calls where applicable."✅ Proposed assertion addition
assert row is not None assert row.conversation_group_id == "conv_full" assert row.score == 0.92 assert row.evaluation_latency == 2.5 assert row.tool_calls == '[{"name": "search"}]' + assert row.tag == '["integration"]'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/storage/test_sql_storage.py` around lines 269 - 313, The SQL storage test covers full-field persistence but does not verify JSON serialization for the tag field, so the new json.dumps(result.tag) behavior in SQLStorageBackend.save_result is untested. Update test_all_fields_stored in test_sql_storage.py to assert the persisted row.tag value matches the serialized tag list, using the existing result setup and row fetched from evaluation_results.Source: Coding guidelines
tests/unit/core/output/test_statistics_detailed.py (1)
317-373: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMissing coverage for a single result with multiple tags.
All updated cases use single-element tag lists. None exercises the new multi-bucket behavior itself, i.e. a result whose
tagcontains more than one value (e.g.,tag=["production", "staging"]) contributing to multipleby_taggroups simultaneously. Consider adding a case for that scenario to directly validate thefor t in r.tagiteration added incompute_tag_stats.✅ Example additional test case
def test_compute_detailed_stats_by_multiple_tags_single_result(self) -> None: """Test a single result with multiple tags contributes to each tag group.""" results = [ EvaluationResult( conversation_group_id="conv1", tag=["production", "staging"], turn_id="turn1", metric_identifier="metric1", result="PASS", score=0.9, threshold=0.7, reason="Good", ), ] stats = compute_detailed_stats(results).model_dump() assert "production" in stats["by_tag"] assert "staging" in stats["by_tag"] assert stats["by_tag"]["production"]["passed"] == 1 assert stats["by_tag"]["staging"]["passed"] == 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/output/test_statistics_detailed.py` around lines 317 - 373, The detailed stats tests still miss the new multi-tag bucket behavior. Add a focused test alongside test_compute_detailed_stats_by_tag that uses a single EvaluationResult with tag containing multiple values (for example, both production and staging) and assert compute_detailed_stats()/compute_tag_stats places that one result into each corresponding by_tag entry. Verify both tag groups show the expected passed/failed counts so the for t in r.tag logic is directly covered.src/lightspeed_evaluation/core/models/data.py (1)
589-595: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAvoid exposing validator helpers as public APIs without Google-style docs.
Either rename these validator methods to
_normalize_tagor expand the docstrings to Google style. As per coding guidelines,src/lightspeed_evaluation/**/*.py: all public APIs must use Google-style docstrings.Also applies to: 725-731
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/models/data.py` around lines 589 - 595, The validator helper methods normalize_tag and the similar validator in the referenced block are being exposed as public APIs without Google-style docs. Rename these helpers to private names like _normalize_tag, or update their docstrings to Google-style format so they comply with the project’s public API documentation rules while keeping the field_validator behavior unchanged.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/core/models/data.py`:
- Around line 523-527: Preserve non-empty tag validation in the tag
normalization path: `Field(min_length=1)` on `tag` only enforces list size, so
empty strings can still slip through after `EvaluationData.normalize_tag` and
`EvaluationResult.normalize_tag` convert inputs to `list[str]`. Update those
normalization methods to reject or filter out empty/blank items, and add an
item-level string constraint on `tag` so each element must be non-empty.
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 222-226: The tag grouping in the statistics aggregation is
counting duplicate tags from EvaluationResult.tag more than once. Update the
grouping logic in the statistics code that builds grouped from results so it
iterates over unique tags per result (for example by deduplicating r.tag before
appending), or normalize tags earlier in the EvaluationResult flow to prevent
double counting in tag stats.
In `@tests/unit/core/models/test_data.py`:
- Around line 457-464: The single-string normalization tests are not exercising
the actual compatibility path because
`test_single_string_tag_normalized_to_list` passes a list instead of a string.
Update the `EvaluationData` and `EvaluationResult` coverage in these tests to
pass `tag="basic"` so the `mode="before"` string-to-list normalization is
verified, and keep the assertions checking the normalized list output.
---
Nitpick comments:
In `@src/lightspeed_evaluation/core/models/data.py`:
- Around line 589-595: The validator helper methods normalize_tag and the
similar validator in the referenced block are being exposed as public APIs
without Google-style docs. Rename these helpers to private names like
_normalize_tag, or update their docstrings to Google-style format so they comply
with the project’s public API documentation rules while keeping the
field_validator behavior unchanged.
In `@tests/unit/core/output/test_statistics_detailed.py`:
- Around line 317-373: The detailed stats tests still miss the new multi-tag
bucket behavior. Add a focused test alongside test_compute_detailed_stats_by_tag
that uses a single EvaluationResult with tag containing multiple values (for
example, both production and staging) and assert
compute_detailed_stats()/compute_tag_stats places that one result into each
corresponding by_tag entry. Verify both tag groups show the expected
passed/failed counts so the for t in r.tag logic is directly covered.
In `@tests/unit/core/storage/test_sql_storage.py`:
- Around line 269-313: The SQL storage test covers full-field persistence but
does not verify JSON serialization for the tag field, so the new
json.dumps(result.tag) behavior in SQLStorageBackend.save_result is untested.
Update test_all_fields_stored in test_sql_storage.py to assert the persisted
row.tag value matches the serialized tag list, using the existing result setup
and row fetched from evaluation_results.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c4a50ffd-c5dc-4b52-b6f4-ae1a51dde01a
📒 Files selected for processing (12)
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/storage/langfuse_storage.pysrc/lightspeed_evaluation/core/storage/sql_storage.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.pytests/unit/core/models/test_data.pytests/unit/core/output/test_statistics_detailed.pytests/unit/core/storage/test_sql_storage.pytests/unit/core/system/test_validator.pytests/unit/pipeline/evaluation/conftest.py
3c147f2 to
8a9c46a
Compare
8a9c46a to
2df3380
Compare
Description
Support multiple tags, backward compatibility must be ensured.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Bug Fixes
Tests