Dataset not updated after issue #15: released CSV still appears to be generated by old code

Hello, thank you for releasing this dataset and the accompanying paper — it is very helpful for our research.

I noticed that in issue #15, a duplication problem in paraphrased_question was already identified, and updated data generation code was provided in the repository. However, it seems that the released dataset itself has not been updated accordingly.

Specifically, in unified_kg_cron_questions_all.csv, the column paraphrased_question still contains a large number of duplicated values.
I attempted to deduplicate the dataset based on this field, and found that the issue is particularly severe for complex questions: after deduplication, the number of usable complex questions drops from over 3,000 to only a few dozen (~30+).

This indicates that the released CSV was likely generated using an older version of the data generation code, rather than the updated one introduced after issue #15.

I would like to ask whether there are any plans to release an updated or regenerated version of the dataset using the corrected data generation code introduced after issue #15.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset not updated after issue #15: released CSV still appears to be generated by old code #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dataset not updated after issue #15: released CSV still appears to be generated by old code #17

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions