Skip to content

Dataset not updated after issue #15: released CSV still appears to be generated by old code #17

@ULun666

Description

@ULun666

Hello, thank you for releasing this dataset and the accompanying paper — it is very helpful for our research.

I noticed that in issue #15, a duplication problem in paraphrased_question was already identified, and updated data generation code was provided in the repository. However, it seems that the released dataset itself has not been updated accordingly.

Specifically, in unified_kg_cron_questions_all.csv, the column paraphrased_question still contains a large number of duplicated values.
I attempted to deduplicate the dataset based on this field, and found that the issue is particularly severe for complex questions: after deduplication, the number of usable complex questions drops from over 3,000 to only a few dozen (~30+).

This indicates that the released CSV was likely generated using an older version of the data generation code, rather than the updated one introduced after issue #15.

I would like to ask whether there are any plans to release an updated or regenerated version of the dataset using the corrected data generation code introduced after issue #15.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions