-
Notifications
You must be signed in to change notification settings - Fork 62
Raae 1200: Add index level stopwords support in configuration in index schema #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raae 1200: Add index level stopwords support in configuration in index schema #436
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds index-level stopwords configuration support to RedisVL, enabling users to control which words are filtered during indexing rather than only at query time. This addresses the limitation where phrases like "Bank of America" couldn't be properly searched due to stopwords being filtered at index time.
Key Changes:
- Added
stopwordsfield toIndexInfoclass supporting three modes: None (default), [] (STOPWORDS 0), or custom list - Updated index creation logic to pass stopwords configuration to Redis FT.CREATE command
- Added parsing of stopwords from FT.INFO output to support
from_existing()reconstruction - Implemented warning system for counterproductive query-time stopwords with STOPWORDS 0
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| redisvl/schema/schema.py | Added stopwords field to IndexInfo class with documentation |
| redisvl/schema/fields.py | Refactored field modifier ordering with _normalize_field_modifiers helper |
| redisvl/redis/connection.py | Added stopwords parsing in convert_index_info_to_schema() |
| redisvl/redis/utils.py | Updated cluster_create_index functions to accept stopwords parameter |
| redisvl/index/index.py | Updated create() methods to extract and pass stopwords; added validation warning |
| redisvl/query/query.py | Enhanced TextQuery.stopwords docstring with interaction notes |
| tests/unit/test_stopwords_schema.py | 8 unit tests for schema stopwords configuration |
| tests/unit/test_convert_index_info.py | 3 tests for FT.INFO stopwords parsing |
| tests/unit/test_field_modifier_ordering.py | Comprehensive tests for field modifier ordering fix |
| tests/integration/test_stopwords_integration.py | 11 integration tests for stopwords functionality |
| tests/integration/test_field_modifier_ordering_integration.py | Integration tests against live Redis |
| docs/user_guide/11_advanced_queries.ipynb | Added 8 cells demonstrating stopwords usage |
| docs/api/schema.rst | Added documentation for index-level stopwords configuration |
| docs/api/query.rst | Added note about stopwords interaction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
justin-cechmanek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good. Just a few questions, and suggest we include AggregateHybridQuery to the stopwords modifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
docs/user_guide/11_advanced_queries.ipynb:1
- The example uses
SearchIndex.from_dict()to create an index, but based on the codebase patterns,from_dict()is typically a method onIndexSchema, notSearchIndex. The correct pattern should likely be:schema = IndexSchema.from_dict(stopwords_schema); company_index = SearchIndex(schema=schema, redis_url=...). This could confuse users trying to follow the example.
{
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.13.0" | ||
| "pygments_lexer": "ipython3" |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version field was removed from the notebook metadata's language_info section. This is generally acceptable as notebook executors can infer the version, but be aware this might cause version ambiguity in some environments. Consider documenting the required Python version elsewhere if it's critical.
| "pygments_lexer": "ipython3" | |
| "pygments_lexer": "ipython3", | |
| "version": "3.10.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
tests/unit/test_field_modifier_ordering.py:12
- Import of 'pytest' is not used.
import pytest
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rbs333
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also solid however I wouldn't use "Bank of America" as the primary example because it could be a bit of a faux pas. It can be any other phrase of that kind but if I were investigating a repo like this and saw "Bank of America" as the marquee I would deduct open source cred if that makes sense.
|
31e1237 to
3b0c8fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5c64a4d to
7cb8a16
Compare
Add support for configuring stopwords at index creation time via IndexInfo.stopwords field. - Add stopwords field to IndexInfo class (None/[]/custom list) - Update SearchIndex.create() and AsyncSearchIndex.create() to pass stopwords - Update convert_index_info_to_schema() to parse stopwords from FT.INFO - Update cluster_create_index() functions to accept stopwords parameter - Add warning when using query-time stopwords with index-level STOPWORDS 0 - Add comprehensive documentation in 11_advanced_queries.ipynb - Create stopwords_interaction_guide.md explaining best practices
- Add note to TextQuery docstring about index-level vs query-time stopwords - Add stopwords section to schema.rst with configuration examples - Add note to query.rst about stopwords interaction - Link stopwords_interaction_guide.md from 11_advanced_queries.ipynb These updates improve discoverability and help users understand the interaction between index-level and query-time stopwords configuration.
- Clear notebook outputs and Python version metadata to reduce git noise - Remove unused 'info' variable in test_create_index_with_default_stopwords - Add explanatory comments to all exception handlers in cleanup blocks These changes improve code quality and maintainability without affecting functionality.
Remove broad 'except Exception: pass' blocks in test cleanup that were hiding real failures. Instead: - test_stopwords_integration.py: Created pytest fixtures for each index type (stopwords_disabled_index, custom_stopwords_index, default_stopwords_index) that handle cleanup automatically - test_field_modifier_ordering_integration.py: Removed try/finally/except blocks and replaced with direct cleanup calls that will fail if there's a real problem (connection issues, permissions, etc.) This matches the pattern used in most other integration tests and ensures that cleanup failures are visible rather than silently ignored.
Three improvements:
1. Rename test class for clarity
- TestMultipleCommandsScenarioIntegration → TestFieldModifierIntegration
- test_mlp_commands_index_creation → test_index_creation_with_multiple_modifiers
- Use general naming instead of MLP-specific terminology for open-source
2. Move imports to file level
- Moved FilterQuery import from inside test function to top of file
- Follows standard Python convention
3. Extend stopwords validation to AggregateHybridQuery
- Added AggregateHybridQuery to stopwords warning in _validate_query()
- Both TextQuery and AggregateHybridQuery now warn when using query-time
stopwords with index-level STOPWORDS 0
- Warning message now includes the specific query type for clarity
Add documentation about index-level vs query-time stopwords interaction for AggregateHybridQuery: 1. Updated AggregateHybridQuery.__init__() docstring - Added note about query-time vs index-level stopwords - References stopwords_interaction_guide.md for details 2. Updated docs/api/query.rst - Added note for HybridQuery/AggregateHybridQuery section - Matches the note already present for TextQuery - Links to Stopwords Interaction Guide This completes the documentation updates for stopwords support across all query types that use stopwords (TextQuery, AggregateHybridQuery).
…mment
Two improvements:
1. Remove all references to docs/stopwords_interaction_guide.md
- Removed from docs/api/query.rst (HybridQuery and TextQuery notes)
- Removed from docs/api/schema.rst (replaced with reference to user guide)
- Removed from redisvl/query/query.py (TextQuery docstring)
- Removed from redisvl/query/aggregate.py (AggregateHybridQuery docstring)
- Removed from redisvl/index/index.py (warning message)
- Removed from docs/user_guide/11_advanced_queries.ipynb
- The notebook already has comprehensive stopwords documentation
2. Fix inaccurate comment about stopwords=None
- Updated comment in redisvl/index/index.py to clarify that ANY falsy
value (None, False, '', 0, [], etc.) results in an empty set, not just None
- This matches the actual implementation in _set_stopwords() which uses
'if not stopwords:' check
- Updated warning message to mention 'stopwords=None (or any falsy value)'
GitHub Copilot correctly identified that markdown cells should not have execution metadata. This metadata is an artifact from running the notebook and can cause issues with notebook validation or rendering in some tools. Cleaned 9 markdown cells by removing their 'execution' metadata blocks. Only code cells should have execution metadata.
Changed all references from 'Bank of America' to 'Bank of Berlin' for consistency throughout the stopwords documentation in the notebook. Updated 6 occurrences across: - Example data (company_name field) - Search query and comment - Print statement - Markdown explanation text This addresses the GitHub Copilot feedback about inconsistent entity names.
Updated the explanation text to use 'Bank Berlin' instead of 'Bank America' for consistency. The phrase now correctly shows that 'Bank of Berlin' would be indexed as 'Bank Berlin' (without 'of') when using default stopwords.
7cb8a16 to
8f8a86c
Compare
bsbodden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done @nkanu17 !
Add Index-Level STOPWORDS Configuration Support
Summary
This PR adds support for configuring stopwords at index creation time through the
IndexInfo.stopwordsfield, enabling users to control which words are filtered during indexing rather than only at query time.Related to issue: #432
Motivation
Previously, RedisVL had no way to configure index-level
STOPWORDSfor FT.CREATE, even though redis-py'screate_index()API supports it. Users could only control stopwords at query time viaTextQuery.stopwords, which is insufficient for use cases like:Changes
Core Implementation
redisvl/schema/schema.py: Addedstopwords: Optional[List[str]]field toIndexInfoclassNone(default): Use Redis default stopwords (~300 words)[](empty list): Disable stopwords completely (STOPWORDS 0)redisvl/index/index.py: UpdatedSearchIndex.create()andAsyncSearchIndex.create()to extract and pass stopwords from schema tocreate_index()redisvl/redis/connection.py: Updatedconvert_index_info_to_schema()to parse stopwords from FT.INFO output, enablingfrom_existing()to preserve stopwords configurationredisvl/redis/utils.py: Updatedcluster_create_index()andasync_cluster_create_index()to accept stopwords parameterWarning System
Added validation in
BaseSearchIndex._validate_query()to warn users when using query-time stopwords with index-level STOPWORDS 0, as this is counterproductive (indexes all words but filters them at query time).Tests
Documentation
docs/user_guide/11_advanced_queries.ipynb: Added 8 cells (22-29) with examples and explanationsdocs/stopwords_interaction_guide.md: Comprehensive guide explaining index-level vs query-time stopwords interactionIndexInfoclass with YAML and dict examplesUsage Examples
Disable stopwords (STOPWORDS 0)