Fix docFreq in score calculation after rewrite of boolean query consisting of blended query and boosted term query #12354

rafalh · 2023-06-07T10:51:16Z

Description

When there is a boolean query consisting of a fuzzy query and a boosted term query during rewrite:

fuzzy query is replaced by BlendedTermQuery with a series of term queries with a matching edit distance
BlendedTermQuery is replaced by a series of boosted term queries with a non-null termStates that have one common docFreq value that is false for some terms (see BlendedTermQuery::adjustFrequencies).
Because TermQuery::equals implementation was not taking into account termStates, both the generated term query with non-null termStates and original boosted term query were merged together. Resulting TermQuery termStates depended on hash code that is based on Solr startup time (can be changed using tests.seed property). Because of that similarities that use docFreq can return wrong score.

This PR changes equals and hashCode implementation in TermQuery so one generated from fuzzy query and original one are not merged together anymore. Also added a test making sure it works as intended. Test was failing for tests.seed=1.

Fixes #10309

Test checks if after rewrite we still have original term query with null termStates and not one generated from blended fuzzy query that actually has termStates with wrong docFreq. It fails for tests.seed=1.

When there is a boolean query consisting of a fuzzy query and a boosted term query during rewrite: 1. fuzzy query is replaced by BlendedTermQuery with a series of term queries with a matching edit distance 2. BlendedTermQuery is replaced by a series of boosted term queries with a non-null termStates that have one common docFreq value that is false for some terms (see BlendedTermQuery::adjustFrequencies). 3. Because TermQuery::equals implementation was not taking into account termStates, both the generated term query with non-null termStates and original boosted termQuery were merged together. Resulting TermQuery termStates depended on hash code that is based on Solr startup time (can be changed using tests.seed property). Because of that similarities that use docFreq can return wrong score. This commit changes equals and hashCode implementation in TermQuery so one generated from fuzzy query and original one are not merged together anymore. Fixes testDocFreqAfterTermAndFuzzyRewrite (added in previous commit). Fixes apache#10309

stefanvodita

Thank you for addressing this issue. I left a few comments.

stefanvodita · 2023-07-29T10:31:10Z

lucene/core/src/java/org/apache/lucene/search/TermQuery.java

+      return false;
+    }
+    if (perReaderTermState != null && otherTermQuery.perReaderTermState != null) {
+      return perReaderTermState.docFreq() == otherTermQuery.perReaderTermState.docFreq();


What do you think of implementing equals for TermStates and delegating this part to it? I see docFreq can throw an exception and we could handle that case too.

Done. I didn't do it in the first place because TermStates has more fields so it felt wrong to implement equals/hashCode based only on one field, but at the same time I wasn't sure what is the meaning of other fields and I was worried that comparing them too would affect performance. Let me know if new code is okay for you or if I should make some adjustments

stefanvodita · 2023-07-29T10:32:12Z

lucene/core/src/java/org/apache/lucene/search/TermQuery.java

  @Override
  public boolean equals(Object other) {
-    return sameClassAs(other) && term.equals(((TermQuery) other).term);
+    if (!sameClassAs(other)) {


Lucene code generally prefers to avoid ! for negations, so this would be sameClassAs(other) == false.

Negation was removed

stefanvodita · 2023-07-29T10:34:03Z

lucene/core/src/java/org/apache/lucene/search/TermQuery.java

-    return classHash() ^ term.hashCode();
+    int hash = classHash() ^ term.hashCode();
+    if (perReaderTermState != null) {
+      hash ^= Integer.hashCode(perReaderTermState.docFreq());


Similar to the question about equals, what if we implemented TermStates.hashCode?

Done, but I'm not sure if calculating hash code from just one field is okay

mikemccand · 2023-11-02T14:28:53Z

Thank you @rafalh! Query scores depending on HashMap iteration order is really awful. And thank you @stefanvodita for reviewing. @rafalh do you want to fold in the feedback maybe? Thanks!

github-actions · 2024-01-08T12:24:02Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions · 2025-10-17T00:27:24Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

…core # Conflicts: # lucene/core/src/test/org/apache/lucene/search/TestTermQuery.java

rafalh · 2025-11-17T14:07:37Z

Sorry for long delay @stefanvodita . I totally lost track of this PR

github-actions · 2025-12-02T00:28:42Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Rafal Harabien added 2 commits June 7, 2023 10:41

Add test for term query + fuzzy query rewrite

a0b99b0

Test checks if after rewrite we still have original term query with null termStates and not one generated from blended fuzzy query that actually has termStates with wrong docFreq. It fails for tests.seed=1.

stefanvodita reviewed Jul 29, 2023

View reviewed changes

github-actions bot added the Stale label Jan 8, 2024

github-actions bot removed the Stale label Oct 1, 2025

github-actions bot added the Stale label Oct 17, 2025

Merge remote-tracking branch 'origin/main' into blended-query-wrong-s…

9567ad6

…core # Conflicts: # lucene/core/src/test/org/apache/lucene/search/TestTermQuery.java

github-actions bot added module:core/index module:core/search labels Nov 17, 2025

Fix review comments

22144e1

rafalh force-pushed the blended-query-wrong-score branch from 5eb746a to 22144e1 Compare November 17, 2025 14:08

rafalh added 2 commits November 17, 2025 15:47

Fix build

1300c95

Add CHANGES entry

ff9e5b0

github-actions bot added this to the 11.0.0 milestone Nov 17, 2025

github-actions bot removed the Stale label Nov 18, 2025

github-actions bot added the Stale label Dec 2, 2025

Fix docFreq in score calculation after rewrite of boolean query consisting of blended query and boosted term query #12354

Are you sure you want to change the base?

Fix docFreq in score calculation after rewrite of boolean query consisting of blended query and boosted term query #12354

Uh oh!

Conversation

rafalh commented Jun 7, 2023

Description

Uh oh!

stefanvodita left a comment

Choose a reason for hiding this comment

Uh oh!

stefanvodita Jul 29, 2023

Choose a reason for hiding this comment

Uh oh!

rafalh Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

stefanvodita Jul 29, 2023

Choose a reason for hiding this comment

Uh oh!

rafalh Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

stefanvodita Jul 29, 2023

Choose a reason for hiding this comment

Uh oh!

rafalh Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Nov 2, 2023

Uh oh!

github-actions bot commented Jan 8, 2024

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

rafalh commented Nov 17, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants