Bug: get_c4 broken with datasets>=2.14 (allenai--c4 config removed)

Title: **`get_c4` broken with `datasets>=2.14` — `allenai--c4` config removed**

Body:

---
**Bug**

`lib/data.py` `get_c4()` fails with recent versions of the `datasets` library (>=2.14):


ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', ...]


The config name `allenai--c4` with explicit `data_files` no longer works. The fix is to use the standard `'en'` config with streaming:

# Before (broken)
traindata = load_dataset('allenai/c4', 'allenai--c4', 
    data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')

# **fix**
traindata = list(load_dataset('allenai/c4', 'en', split='train', streaming=True).take(10000))

Same for validation split. Also `valdata[:1100]['text']` needs to become `[d['text'] for d in valdata]` since streaming returns dicts not a Dataset object.

### Additional: `position_embeddings` in newer `transformers`

`lib/prune.py` layer forward calls fail with `transformers>=4.45` because `LlamaDecoderLayer.forward()` now requires `position_embeddings` as an explicit argument:


TypeError: cannot unpack non-iterable NoneType object

**Fix**: pre-compute RoPE embeddings and pass them to each layer call.

### Environment
- `datasets==3.x`
- `transformers==4.50+`
- Python 3.11

Happy to open a PR with the fixes if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: get_c4 broken with datasets>=2.14 (allenai--c4 config removed) #87

Before (broken)

fix

Additional: `position_embeddings` in newer `transformers`

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: get_c4 broken with datasets>=2.14 (allenai--c4 config removed) #87

Description

Before (broken)

fix

Additional: position_embeddings in newer transformers

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Additional: `position_embeddings` in newer `transformers`