feat: safetensors hyperparameter extraction with GGUF parity by afogel · Pull Request #70 · GenAI-Security-Project/aibom-generator

afogel · 2026-03-09T22:33:09Z

Summary

Extracts hyperparameters from safetensors repos by combining config.json (llama.cpp's find_hparam key fallback chains), tokenizer_config.json, and safetensors tensor headers via huggingface_hub.get_safetensors_metadata()
Deduplicates config.json parsing into a canonical config_parsing.py module shared by both safetensors_metadata.py and extractor.py
Safetensors takes precedence over GGUF in default_extractors() since safetensors is the original source format that GGUF is derived from

Changes

File	Purpose
`src/models/config_parsing.py`	Canonical `HPARAM_KEYS` + `parse_config()` — llama.cpp key fallback chains, VLM text_config merge
`src/models/safetensors_metadata.py`	`SafetensorsModelInfo`, `map_to_metadata()`, `fetch_safetensors_metadata()` — config + tokenizer + tensor headers
`src/models/model_file_extractors.py`	`SafetensorsFileExtractor` + reorder `default_extractors()` (safetensors first)
`src/models/extractor.py`	Wire `_build_hyperparameters_from_config()` into `_try_config_extraction` for the `hyperparameter` field
`tests/test_safetensors_metadata.py`	40 tests: config parsing, metadata mapping, tensor extraction, HF Hub integration, extractor wiring, precedence, fixture end-to-end
`tests/test_model_file_extraction.py`	Safetensors integration tests + fix pre-existing monkeypatch targets
`tests/test_hyperparameter_wiring.py`	Hyperparameter field flows through EnhancedExtractor pipeline
`tests/fixtures/__init__.py`	`build_safetensors_fixture()` — real safetensors binary for end-to-end tests
`pyproject.toml`	Add `safetensors>=0.4.0` runtime dep, dev deps to `[dependency-groups]`

Design decisions

config.json is format-agnostic: The same parse_config() works for safetensors, GGUF, pytorch — any HF repo with a config.json. Extracted to a shared module to avoid duplication.
Header-only reading: Uses huggingface_hub.get_safetensors_metadata() (100KB range request) — never downloads full model weights. config.json and tokenizer_config.json are small JSON files fetched via hf_hub_download().
GGUF parity: Safetensors extractor now produces equivalent metadata fields: model_type, typeOfModel, vocab_size, context_length, tokenizer_class, hyperparameter dict (including rope_dimension_count), plus safetensors_total_parameters.
No changes to downstream: Output uses the same hyperparameter dict format that extractor.py and service.py already handle generically.

Test plan

uv run pytest tests/test_safetensors_metadata.py -v — 40 tests
uv run pytest tests/test_model_file_extraction.py -v — 17 tests (including new safetensors + precedence tests)
uv run pytest tests/test_hyperparameter_wiring.py -v — 8 tests
uv run pytest tests/test_gguf_metadata.py -v — 18 tests (no regressions)
All 83 tests pass

…ect/v0.2 V0.2

Extract hyperparameters from safetensors repos by combining config.json (using llama.cpp's find_hparam key fallback chains), tokenizer_config.json, and safetensors tensor headers. Safetensors takes precedence over GGUF as the original source format. - Add config_parsing.py as canonical home for HPARAM_KEYS and parse_config() - Add safetensors_metadata.py with SafetensorsModelInfo, map_to_metadata(), fetch_safetensors_metadata() (config.json + tokenizer + tensor headers) - Add SafetensorsFileExtractor to model_file_extractors.py - Wire hyperparameter extraction into EnhancedExtractor via _try_config_extraction - Add safetensors>=0.4.0 as runtime dependency - 83 tests covering config parsing, metadata mapping, tensor extraction, HF Hub integration, extractor wiring, precedence, and fixture end-to-end

eaglei15 and others added 2 commits March 1, 2026 12:20

Merge pull request GenAI-Security-Project#66 from GenAI-Security-Proj…

67829cb

…ect/v0.2 V0.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: safetensors hyperparameter extraction with GGUF parity#70

feat: safetensors hyperparameter extraction with GGUF parity#70
afogel wants to merge 2 commits intoGenAI-Security-Project:v0.2from
afogel:safetensors_extract_hyperparameters

afogel commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

afogel commented Mar 9, 2026

Summary

Changes

Design decisions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants