-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix docFreq in score calculation after rewrite of boolean query consisting of blended query and boosted term query #12354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Test checks if after rewrite we still have original term query with null termStates and not one generated from blended fuzzy query that actually has termStates with wrong docFreq. It fails for tests.seed=1.
When there is a boolean query consisting of a fuzzy query and a boosted term query during rewrite: 1. fuzzy query is replaced by BlendedTermQuery with a series of term queries with a matching edit distance 2. BlendedTermQuery is replaced by a series of boosted term queries with a non-null termStates that have one common docFreq value that is false for some terms (see BlendedTermQuery::adjustFrequencies). 3. Because TermQuery::equals implementation was not taking into account termStates, both the generated term query with non-null termStates and original boosted termQuery were merged together. Resulting TermQuery termStates depended on hash code that is based on Solr startup time (can be changed using tests.seed property). Because of that similarities that use docFreq can return wrong score. This commit changes equals and hashCode implementation in TermQuery so one generated from fuzzy query and original one are not merged together anymore. Fixes testDocFreqAfterTermAndFuzzyRewrite (added in previous commit). Fixes apache#10309
stefanvodita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing this issue. I left a few comments.
| return false; | ||
| } | ||
| if (perReaderTermState != null && otherTermQuery.perReaderTermState != null) { | ||
| return perReaderTermState.docFreq() == otherTermQuery.perReaderTermState.docFreq(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of implementing equals for TermStates and delegating this part to it? I see docFreq can throw an exception and we could handle that case too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I didn't do it in the first place because TermStates has more fields so it felt wrong to implement equals/hashCode based only on one field, but at the same time I wasn't sure what is the meaning of other fields and I was worried that comparing them too would affect performance. Let me know if new code is okay for you or if I should make some adjustments
| @Override | ||
| public boolean equals(Object other) { | ||
| return sameClassAs(other) && term.equals(((TermQuery) other).term); | ||
| if (!sameClassAs(other)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucene code generally prefers to avoid ! for negations, so this would be sameClassAs(other) == false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negation was removed
| return classHash() ^ term.hashCode(); | ||
| int hash = classHash() ^ term.hashCode(); | ||
| if (perReaderTermState != null) { | ||
| hash ^= Integer.hashCode(perReaderTermState.docFreq()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the question about equals, what if we implemented TermStates.hashCode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I'm not sure if calculating hash code from just one field is okay
|
Thank you @rafalh! Query scores depending on |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
…core # Conflicts: # lucene/core/src/test/org/apache/lucene/search/TestTermQuery.java
|
Sorry for long delay @stefanvodita . I totally lost track of this PR |
5eb746a to
22144e1
Compare
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Description
When there is a boolean query consisting of a fuzzy query and a boosted term query during rewrite:
BlendedTermQuerywith a series of term queries with a matching edit distanceBlendedTermQueryis replaced by a series of boosted term queries with a non-nulltermStatesthat have one commondocFreqvalue that is false for some terms (seeBlendedTermQuery::adjustFrequencies).TermQuery::equalsimplementation was not taking into accounttermStates, both the generated term query with non-nulltermStatesand original boosted term query were merged together. ResultingTermQuerytermStatesdepended on hash code that is based on Solr startup time (can be changed usingtests.seedproperty). Because of that similarities that usedocFreqcan return wrong score.This PR changes
equalsandhashCodeimplementation inTermQueryso one generated from fuzzy query and original one are not merged together anymore. Also added a test making sure it works as intended. Test was failing fortests.seed=1.Fixes #10309