Skip to content

Bug: get_c4 broken with datasets>=2.14 (allenai--c4 config removed) #87

@UriKialy

Description

@UriKialy

Title: get_c4 broken with datasets>=2.14allenai--c4 config removed

Body:


Bug

lib/data.py get_c4() fails with recent versions of the datasets library (>=2.14):

ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', ...]

The config name allenai--c4 with explicit data_files no longer works. The fix is to use the standard 'en' config with streaming:

Before (broken)

traindata = load_dataset('allenai/c4', 'allenai--c4',
data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')

fix

traindata = list(load_dataset('allenai/c4', 'en', split='train', streaming=True).take(10000))

Same for validation split. Also valdata[:1100]['text'] needs to become [d['text'] for d in valdata] since streaming returns dicts not a Dataset object.

Additional: position_embeddings in newer transformers

lib/prune.py layer forward calls fail with transformers>=4.45 because LlamaDecoderLayer.forward() now requires position_embeddings as an explicit argument:

TypeError: cannot unpack non-iterable NoneType object

Fix: pre-compute RoPE embeddings and pass them to each layer call.

Environment

  • datasets==3.x
  • transformers==4.50+
  • Python 3.11

Happy to open a PR with the fixes if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions