Skip to content

Conversation

@rafalh
Copy link

@rafalh rafalh commented Jun 7, 2023

Description

When there is a boolean query consisting of a fuzzy query and a boosted term query during rewrite:

  1. fuzzy query is replaced by BlendedTermQuery with a series of term queries with a matching edit distance
  2. BlendedTermQuery is replaced by a series of boosted term queries with a non-null termStates that have one common docFreq value that is false for some terms (see BlendedTermQuery::adjustFrequencies).
  3. Because TermQuery::equals implementation was not taking into account termStates, both the generated term query with non-null termStates and original boosted term query were merged together. Resulting TermQuery termStates depended on hash code that is based on Solr startup time (can be changed using tests.seed property). Because of that similarities that use docFreq can return wrong score.

This PR changes equals and hashCode implementation in TermQuery so one generated from fuzzy query and original one are not merged together anymore. Also added a test making sure it works as intended. Test was failing for tests.seed=1.

Fixes #10309

Rafal Harabien added 2 commits June 7, 2023 10:41
Test checks if after rewrite we still have original term query with null termStates and not one generated from blended fuzzy query that actually has termStates with wrong docFreq.
It fails for tests.seed=1.
When there is a boolean query consisting of a fuzzy query and a boosted term query during rewrite:
1. fuzzy query is replaced by BlendedTermQuery with a series of term queries with a matching edit distance
2. BlendedTermQuery is replaced by a series of boosted term queries with a non-null termStates that have one common docFreq value that is false for some terms (see BlendedTermQuery::adjustFrequencies).
3. Because TermQuery::equals implementation was not taking into account termStates, both the generated term query with non-null termStates and original boosted termQuery were merged together. Resulting TermQuery termStates depended on hash code that is based on Solr startup time (can be changed using tests.seed property). Because of that similarities that use docFreq can return wrong score.

This commit changes equals and hashCode implementation in TermQuery so one generated from fuzzy query and original one are not merged together anymore.

Fixes testDocFreqAfterTermAndFuzzyRewrite (added in previous commit).
Fixes apache#10309
Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing this issue. I left a few comments.

return false;
}
if (perReaderTermState != null && otherTermQuery.perReaderTermState != null) {
return perReaderTermState.docFreq() == otherTermQuery.perReaderTermState.docFreq();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of implementing equals for TermStates and delegating this part to it? I see docFreq can throw an exception and we could handle that case too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I didn't do it in the first place because TermStates has more fields so it felt wrong to implement equals/hashCode based only on one field, but at the same time I wasn't sure what is the meaning of other fields and I was worried that comparing them too would affect performance. Let me know if new code is okay for you or if I should make some adjustments

@Override
public boolean equals(Object other) {
return sameClassAs(other) && term.equals(((TermQuery) other).term);
if (!sameClassAs(other)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene code generally prefers to avoid ! for negations, so this would be sameClassAs(other) == false.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negation was removed

return classHash() ^ term.hashCode();
int hash = classHash() ^ term.hashCode();
if (perReaderTermState != null) {
hash ^= Integer.hashCode(perReaderTermState.docFreq());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the question about equals, what if we implemented TermStates.hashCode?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I'm not sure if calculating hash code from just one field is okay

@mikemccand
Copy link
Member

Thank you @rafalh! Query scores depending on HashMap iteration order is really awful. And thank you @stefanvodita for reviewing. @rafalh do you want to fold in the feedback maybe? Thanks!

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@github-actions github-actions bot removed the Stale label Oct 1, 2025
@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 17, 2025
…core

# Conflicts:
#	lucene/core/src/test/org/apache/lucene/search/TestTermQuery.java
@rafalh
Copy link
Author

rafalh commented Nov 17, 2025

Sorry for long delay @stefanvodita . I totally lost track of this PR

@rafalh rafalh force-pushed the blended-query-wrong-score branch from 5eb746a to 22144e1 Compare November 17, 2025 14:08
@github-actions github-actions bot added this to the 11.0.0 milestone Nov 17, 2025
@github-actions github-actions bot removed the Stale label Nov 18, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blended queries with boolean rewrite can result in inconsistent scores [LUCENE-9269]

3 participants