Skip to content

Can not reproduce the results on LongBench in the paper. #1

@nietzhuang

Description

@nietzhuang

Thank you for your great work on KV cache merging, D2O's dynamic token merging method inspires me a lot.

Currently I'm running your source code.
As stated in the paper, I choose N:M=3:1, alpha=0.3 and beta=0.7 under 20% (rho=0.2) KV cache compression ratio.
I'm sure that the Python dependancies and environment are met with your source code, and I'm using nvidia RTX PRO6000 GPU, which can run LongBench properly.

However, I found the results are very different from the paper.
The results from source code may not show D2O's strength.
If there are something I misunderstood or may omit, could you help to solve this issue?
Thank you very much.

Results from paper show,
| H2O | D2O
NarrativeQA | 13.27 | 14.43
Qasper | 11.05 | 12.66
MF-en | 17.72 | 19.93
HotpotQA | 10.38 | 11.92
2WikiMQA | 11.23 | 12.79
Musique | 6.38 | 9.88
GovReport | 21.29 | 24.36
QMSum | 21.33 | 23.42
MultiNews | 3.38 | 3.95
TREC | 66.63 | 69.72
TriviaQA | 89.19 | 90.99
SAMSum | 41.12 | 42.36
Pcount | 5.52 | 6.61
Pre | 11.11 | 14.67
Lcc | 71.86 | 72.43
RB-P | 58.29 | 60

Results from source code show,
| H2O | D2O
NarrativeQA | 12.43 | 12.64
Qasper | 12.55 | 11.92
MF-en | 19.95 | 19.87
HotpotQA | 10.92 | 10.72
2WikiMQA | 12.2 | 11.95
Musique | 6.65 | 6.75
GovReport | 22.97 | 21.13
QMSum | 23.44 | 23.13
MultiNews | 3.56 | 1.93
TREC | 69 | 69.67
TriviaQA | 90.57 | 90.63
SAMSum | 41.96 | 42.05
Pcount | 5.18 | 5.9
Pre | 11.58 | 13.86
Lcc | 69.26 | 71.57
RB-P | 55.67 | 58.43

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions