-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hello, thank you for releasing this dataset and the accompanying paper — it is very helpful for our research.
I noticed that in issue #15, a duplication problem in paraphrased_question was already identified, and updated data generation code was provided in the repository. However, it seems that the released dataset itself has not been updated accordingly.
Specifically, in unified_kg_cron_questions_all.csv, the column paraphrased_question still contains a large number of duplicated values.
I attempted to deduplicate the dataset based on this field, and found that the issue is particularly severe for complex questions: after deduplication, the number of usable complex questions drops from over 3,000 to only a few dozen (~30+).
This indicates that the released CSV was likely generated using an older version of the data generation code, rather than the updated one introduced after issue #15.
I would like to ask whether there are any plans to release an updated or regenerated version of the dataset using the corrected data generation code introduced after issue #15.