Title: get_c4 broken with datasets>=2.14 — allenai--c4 config removed
Body:
Bug
lib/data.py get_c4() fails with recent versions of the datasets library (>=2.14):
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', ...]
The config name allenai--c4 with explicit data_files no longer works. The fix is to use the standard 'en' config with streaming:
Before (broken)
traindata = load_dataset('allenai/c4', 'allenai--c4',
data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
fix
traindata = list(load_dataset('allenai/c4', 'en', split='train', streaming=True).take(10000))
Same for validation split. Also valdata[:1100]['text'] needs to become [d['text'] for d in valdata] since streaming returns dicts not a Dataset object.
Additional: position_embeddings in newer transformers
lib/prune.py layer forward calls fail with transformers>=4.45 because LlamaDecoderLayer.forward() now requires position_embeddings as an explicit argument:
TypeError: cannot unpack non-iterable NoneType object
Fix: pre-compute RoPE embeddings and pass them to each layer call.
Environment
datasets==3.x
transformers==4.50+
- Python 3.11
Happy to open a PR with the fixes if helpful.
Title:
get_c4broken withdatasets>=2.14—allenai--c4config removedBody:
Bug
lib/data.pyget_c4()fails with recent versions of thedatasetslibrary (>=2.14):ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', ...]
The config name
allenai--c4with explicitdata_filesno longer works. The fix is to use the standard'en'config with streaming:Before (broken)
traindata = load_dataset('allenai/c4', 'allenai--c4',
data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
fix
traindata = list(load_dataset('allenai/c4', 'en', split='train', streaming=True).take(10000))
Same for validation split. Also
valdata[:1100]['text']needs to become[d['text'] for d in valdata]since streaming returns dicts not a Dataset object.Additional:
position_embeddingsin newertransformerslib/prune.pylayer forward calls fail withtransformers>=4.45becauseLlamaDecoderLayer.forward()now requiresposition_embeddingsas an explicit argument:TypeError: cannot unpack non-iterable NoneType object
Fix: pre-compute RoPE embeddings and pass them to each layer call.
Environment
datasets==3.xtransformers==4.50+Happy to open a PR with the fixes if helpful.