I believe that the goal of the hybrid search benchmark is to compute top-10 hits for both the lexical and semantic search, before combining hits by summing up scores.
This is not exactly what the Elasticsearch _search call does, as it puts the vector query as a SHOULD clause of the bool query. So vector scores are computed first, and then summed up with lexical scores before top-10 hits are selected based on the summed up scores (which is harder on dynamic pruning). Switching to a linear retriever should hopefully fix this and make the comparison with Vespa a bit fairer.