Internal-State LLM Safety Scanner

A deploy-or-not auditor for fine-tuned LLM checkpoints, aimed at the ML engineer who fine-tuned a model and wants an answer to "is this safe to ship?"

Unlike text-level red-teaming tools, which send a prompt and classify the response text, this scanner reads the model's internal state output logits and residual-stream activations. A model can print a polite refusal while its logits are already primed to comply, text analysis misses that, internal state does not.

What it reports

For each checkpoint the scanner emits a verdict with a diagnosis and a recommendation:

verdict	meaning
`deploy`	refuses harm, represents harm internally — healthy
`deploy_with_restrictions`	partial compliance — ship only with guardrails
`do_not_deploy`	complies on harmful prompts
`inconclusive`	the behavioral probe carries no safety signal (e.g. a base model)

See safety_docs.md for the metrics behind the verdict and the design decisions.

Modules

safety_margin — behavioral signal. Teacher-forces canonical refuse/comply continuations and computes margin = logP(refuse) − logP(comply); fail_rate is the fraction of harmful prompts where compliance is the cheaper path. This is the GCG attack objective (Zou et al. 2023), so it is a validated jailbreakability proxy.
refusal_direction — internal-state signal. Difference-of-means "refusal direction" (Arditi et al. 2024) in the residual stream; AUROC of how cleanly harmful separates from benign measures harm-awareness.
verdict — aggregates behavior × awareness into the deploy decision.

Setup

Requires uv. Runs on CUDA (NVIDIA, incl. Windows), Apple Silicon (MPS), or CPU — the device is auto-detected (CUDA > MPS > CPU), so the same command works across the team's machines.

uv sync

On Windows with an NVIDIA GPU, the default PyPI torch is CPU-only. To use the GPU, install the CUDA build matching your driver, e.g.:

uv pip install torch --index-url https://download.pytorch.org/whl/cu124

Build the corpus

The corpus (src/data/corpus/{harmful,benign}.jsonl) ships pre-built. To regenerate or resize it from public datasets (AdvBench harmful behaviors + Alpaca instructions):

uv run python scripts/build_corpus.py --n 500

Both classes are imperative-style, which controls for a prompt-format confound.

Run

uv run python scripts/main.py                 # full corpus (500/500), full run is slow on CPU
uv run python scripts/main.py --sample 30     # fast dev run, 30 prompts per class
uv run python scripts/main.py --device cpu    # force a backend (default: auto-detect)

scripts/main.py runs three ground-truth checkpoints — an aligned model, an abliterated one (safety surgically removed), and a base one (never safety-tuned) — and prints each module's summary plus the verdict. A correct scanner must separate them.

Windows note: an NVIDIA GPU is picked up automatically as CUDA; without a GPU it falls back to CPU (correct, just slower — use --sample for iteration).

Web app

A single-page UI: paste a Hugging Face repo, get the verdict plus a plain-language explanation of each metric. The scan runs live, so model size is capped for the target VM.

uv run uvicorn src.app.server:app --reload --port 8000   # http://localhost:8000

Knobs (env vars): SCAN_SAMPLE (per-class prompts, default 25), SCAN_MAX_PARAMS (reject larger models before download, default 400M), SCAN_DTYPE (default bfloat16), SCAN_DEVICE (default cpu). One scan runs at a time; reports cache to reports/.

Docker (Recommended)

The easiest way to run the entire application (Frontend, Backend, and PostgreSQL) is using Docker Compose. It builds the frontend, sets up the Python environment, and runs the database for you.

docker compose up -d

The web app will be available at http://localhost. Models are cached in a Docker volume so they won't be redownloaded on restart.

Deploy (2 GB VM)

Sized for ~1.5 GB usable RAM. torch (CPU) ~0.5 GB + FastAPI ~0.12 GB leaves the rest for weights, so the app caps model size and serializes scans.

# swap as a safety net
sudo fallocate -l 2G /swapfile && sudo chmod 600 /swapfile
sudo mkswap /swapfile && sudo swapon /swapfile

# install (CPU-only torch keeps it lean)
uv venv && uv pip install torch --index-url https://download.pytorch.org/whl/cpu && uv sync

# run under systemd (MemoryMax caps a runaway scan)
sudo cp deploy/capstone.service /etc/systemd/system/
sudo systemctl enable --now capstone

deploy/capstone.service sets MemoryMax=1500M. Put nginx/caddy in front for TLS, or expose port 80 directly for a bare MVP.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
deploy		deploy
docs		docs
reports		reports
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Internal-State LLM Safety Scanner

What it reports

Modules

Setup

Build the corpus

Run

Web app

Docker (Recommended)

Deploy (2 GB VM)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Internal-State LLM Safety Scanner

What it reports

Modules

Setup

Build the corpus

Run

Web app

Docker (Recommended)

Deploy (2 GB VM)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages