Skip to content

Commit 68b0365

Browse files
authored
Longbench group fix (#3359)
* make groups * add longbench_e groups * standardize scoring * fix readme * preserve original scoring method * renaming scoring for better readability * fix supergroup * change alias to LongBench-E * standardize * fix typo * increment version --------- Co-authored-by: jannalulu <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>
1 parent 0563daa commit 68b0365

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+484
-114
lines changed

lm_eval/evaluator_utils.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -509,6 +509,22 @@ def consolidate_group_results(
509509
group_metadata = group_config.get("metadata", None)
510510
if group_metadata is not None:
511511
versions[group_or_task] = group_metadata.get("version", None)
512+
513+
# Clean up duplicate score rows for subtasks that also report other metrics.
514+
for task in task_list:
515+
task_metrics = [
516+
key
517+
for key in results[task].keys()
518+
if "," in key and not key.startswith("score_stderr")
519+
]
520+
score_metrics = [
521+
key for key in task_metrics if key.startswith("score,")
522+
]
523+
if score_metrics and len(task_metrics) > len(score_metrics):
524+
for score_metric in score_metrics:
525+
results[task].pop(score_metric, None)
526+
stderr_key = score_metric.replace("score,", "score_stderr,")
527+
results[task].pop(stderr_key, None)
512528
# print(results)
513529
return results, versions, show_group_table, task_aggregation_list
514530

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,25 @@
11

22
tag:
3-
- longbench
3+
- longbench_multi_tasks
4+
- longbench_tasks
45
task: longbench_2wikimqa
56
dataset_path: Xnhyacinth/LongBench
67
test_split: test
78
dataset_name: 2wikimqa
89
doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{question}}\nAnswer:"
910
doc_to_target: '{{answers}}'
10-
process_results: !function metrics.get_qa_f1_score
11+
process_results: !function metrics.get_qa_f1_with_score
1112
generation_kwargs:
1213
max_gen_toks: 32
1314
temperature: 1
1415
do_sample: False
1516
until: []
1617
metric_list:
18+
- metric: "score"
19+
aggregation: mean
20+
higher_is_better: True
1721
- metric: "qa_f1_score"
1822
aggregation: mean
1923
higher_is_better: True
2024
metadata:
21-
version: 4.0
25+
version: 5.0
Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,25 @@
11

22
tag:
3-
- longbench_e
3+
- longbench_multi_tasks_e
4+
- longbench_tasks_e
45
task: longbench_2wikimqa_e
56
dataset_path: Xnhyacinth/LongBench
67
test_split: test
78
dataset_name: 2wikimqa_e
89
doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{question}}\nAnswer:"
910
doc_to_target: '{{answers}}'
10-
process_results: !function metrics.get_qa_f1_score
11+
process_results: !function metrics.get_qa_f1_with_score
1112
generation_kwargs:
1213
max_gen_toks: 32
1314
temperature: 1
1415
do_sample: False
1516
until: []
1617
metric_list:
18+
- metric: "score"
19+
aggregation: mean
20+
higher_is_better: True
1721
- metric: "qa_f1_score"
1822
aggregation: mean
1923
higher_is_better: True
2024
metadata:
21-
version: 4.0
25+
version: 5.0

lm_eval/tasks/longbench/README.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,23 +26,28 @@ Homepage: `https://github.com/THUDM/LongBench`
2626
pages = "3119--3137",
2727
}
2828
```
29-
### Notes
30-
31-
#### Tasks without Chat Template (with add_bos_token=True but model dependent)
32-
33-
The original implementation suggest not to use `chat_template` for these tasks (for instruct models):
34-
- longbench_lcc
35-
- longbench_repobench-p
36-
- longbench_samsum
37-
- longbench_trec
38-
- longbench_triviaqa
29+
> [!NOTE]
30+
> The original implementation suggest not to use `chat_template` for these tasks for instruct models (with add_bos_token=True but model dependent):
31+
> - longbench_fewshot
32+
> - longbench_trec
33+
> - longbench_triviaqa
34+
> - longbench_samsum
35+
> - longbench_lsht
36+
> - longbench_code
37+
> - longbench_lcc
38+
> - longbench_repobench-p
3939
4040

4141
### Groups, Tags, and Tasks
4242

4343
#### Groups
4444

45-
[//]: # (* `group_name`: `Short description`)
45+
* `longbench_single`: Single-Document QA tasks requiring comprehension of individual documents
46+
* `longbench_multi`: Multi-Document QA tasks requiring information synthesis across multiple documents
47+
* `longbench_summarization`: Summarization tasks for long documents and conversations
48+
* `longbench_fewshot`: Few-shot learning tasks with in-context examples
49+
* `longbench_synthetic`: Synthetic tasks including passage retrieval and counting
50+
* `longbench_code`: Code completion tasks for long code contexts
4651

4752
#### Tags
4853

lm_eval/tasks/longbench/_generate_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,7 @@ def parse_args():
211211
"generation_kwargs": generation_kwargs,
212212
"has_newline": has_newline, # Add the flag to the template context
213213
"metric_list": metric_list,
214-
"metadata": {"version": "4.0"},
214+
"metadata": {"version": "5.0"},
215215
}
216216

217217
# Render template
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
group: longbench
2+
task:
3+
- longbench_code
4+
- longbench_fewshot
5+
- longbench_multi
6+
- longbench_single
7+
- longbench_summarization
8+
- longbench_synthetic
9+
aggregate_metric_list:
10+
- metric: score
11+
weight_by_size: False
12+
metadata:
13+
version: 0.0
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
group: longbench_code
2+
group_alias: "Code Completion"
3+
task:
4+
- longbench_lcc
5+
- longbench_repobench-p
6+
aggregate_metric_list:
7+
- metric: score
8+
weight_by_size: False
9+
metadata:
10+
version: 0.0
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
group: longbench_code_e
2+
group_alias: "Code Completion (LongBench-E)"
3+
task:
4+
- longbench_lcc_e
5+
- longbench_repobench-p_e
6+
aggregate_metric_list:
7+
- metric: score
8+
weight_by_size: False
9+
metadata:
10+
version: 0.0
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
group: longbench_e
2+
task:
3+
- longbench_code_e
4+
- longbench_fewshot_e
5+
- longbench_multi_e
6+
- longbench_single_e
7+
aggregate_metric_list:
8+
- metric: score
9+
weight_by_size: False
10+
metadata:
11+
version: 0.0
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
group: longbench_fewshot
2+
group_alias: "Few-shot Learning"
3+
task:
4+
- longbench_trec
5+
- longbench_triviaqa
6+
- longbench_samsum
7+
- longbench_lsht
8+
aggregate_metric_list:
9+
- metric: score
10+
weight_by_size: False
11+
metadata:
12+
version: 0.0

0 commit comments

Comments
 (0)