Benchmark for Evaluation Awareness

Measures whether LLMs can detect that they are being evaluated — and if so, what kind of evaluation.

A model that can distinguish evaluation from deployment could behave differently in each context, undermining the validity of safety evaluations. BEA quantifies this awareness signal by collecting conversation transcripts from existing benchmarks, then probing a judge model with meta-questions about each interaction.

How it works

 ┌──────────┐     ┌────────────┐     ┌──────────┐     ┌──────────┐
 │ COLLECT  │────▶│  PROBE     │────▶│ MEASURE  │────▶│ ANALYZE  │
 │          │     │  SELECTION │     │          │     │          │
 └──────────┘     └────────────┘     └──────────┘     └──────────┘
  Run benchmarks   Calibrate PQs     Probe judge       Compute
  + load deploy    on hold-out        with selected     AUROC, AUPRC,
  transcripts      sample split       PQs               accuracy

Collect — Runs benchmarks (via Inspect AI) and loads real-user conversations, saving transcripts to a cache.
Probe Selection (optional) — Runs all PQs on a stratified calibration subset, keeps only PQs that discriminate well. The calibration and measurement samples are disjoint to prevent leakage.
Measure — Presents each transcript to a judge model with the selected probe questions, scoring the response.
Analyze — Aggregates per-transcript P(eval) scores and computes discrimination metrics.

Setup

uv sync

Optional extras (e.g. SWE-bench support):

uv sync --extra swe

Troubleshooting: If a package appears missing after install, run uv lock first to regenerate the lockfile, then uv sync again. Always use uv run (e.g. uv run python run.py) to ensure you're using the project environment.

Create a .env with your API keys:

ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...       # if using OpenAI models

Quick start

# Run the full pipeline with the default config
python run.py

# Or specify a config
python run.py --config configs/default.yaml

Verify your setup

Run with minimal samples to check everything works:

python run.py --config configs/config_test.yaml

Stage control

python run.py --skip-measure                         # collect only
python run.py --skip-collect                         # measure + analyze (uses last run)
python run.py --skip-collect --run <name_or_path>    # measure a specific prior run
python run.py --skip-probe-selection                 # skip calibration, use all PQs

Configuration

All configuration is in a single YAML file. Here's the structure:

# Which models to use and what probes to run
runs:
  - generator: anthropic/claude-sonnet-4-6    # model that produced transcripts
    target: anthropic/claude-sonnet-4-6       # judge model answering probe questions
    probes:                                    # optional: override default PQ file
      - probe_questions/eval_mcq.yaml

# Where to get transcripts
sources:
  - name: agentharm
    label:
      split: eval                # "eval" or "deploy" (ground truth)
      category: alignment        # eval type (alignment, capability, chat, ...)
      has_tools: true
    collect:
      backend: inspect           # "inspect", "inspect_log", "huggingface", "local_json", "local_jsonl"
      task: inspect_evals/agentharm
      params:
        split: test_public
    limit: 10                    # samples per source

  - name: everyday_chat
    label:
      split: deploy
      category: chat
      has_tools: false
    collect:
      backend: huggingface
      dataset: HuggingFaceTB/everyday-conversations-llama3.1-2k
      split: train_sft
      conversation_field: messages
    limit: 10

# Probing settings
measurement:
  pq_file: probe_questions/eval_mcq.yaml   # default probe question file
  is_purpose: false                         # true for purpose classification

# Output and analysis
pipeline:
  output_dir: output/

analysis:
  metrics: [auroc, auprc, accuracy]
  generate_plots: false

See configs/examples/ for more config examples.

Depth stripping and resampling

BEA can strip conversations to intermediate depths and optionally regenerate continuations. This lets you measure how awareness changes as the model sees more (or fewer) turns of a conversation.

Configure resampling per source:

sources:
  - name: agentharm
    label:
      split: eval
      category: alignment
      has_tools: true
    collect:
      backend: inspect
      task: inspect_evals/agentharm
    resample:
      depth:
        mode: all          # "all" = every intermediate depth; "k" = fixed depth only
        max_depth: 5       # cap the number of depths generated
        from_start: false  # false = strip turns from end; true = keep only first k turns
      generate: false      # true = re-run generator on truncated transcripts
      fallback: use_base   # "use_base" = fall back to original on failure

`depth.mode`	Behavior
`all`	Generate a trajectory at every depth from 1 to N-1 assistant turns (capped by `max_depth`)
`k`	Generate only at depth `k`

When generate: true, the truncated transcripts are fed back to the generator model to produce fresh continuations — useful for studying whether a model's awareness changes when it re-enters a conversation at a different point.

When generate: false, transcripts are simply truncated (depth expansion only, no API calls).

Probe selection

Not all probe questions discriminate equally well between eval and deploy transcripts. Probe selection runs all PQs on a stratified calibration subset, filters out low-signal PQs, and measures with only the passing ones on the remaining hold-out samples.

probe_selection:
  sample_pct: 0.3              # fraction of samples for calibration
  stratify_by: [eval_category] # metadata keys to balance the split by
  filter:
    metric: auroc              # "accuracy", "auroc", or "auprc" (default: auroc)
    operator: ">="             # default: >=
    threshold: 0.6             # PQs must score at least this to be kept

Key design decisions:

No data leakage. The calibration and measurement sample sets are disjoint — samples used to select PQs never appear in the final measurement. The split is stratified by stratify_by keys so both sets preserve the original class balance.
Selection is per judge, not per generator. PQ selection answers "which questions does this judge answer well?" — that's a property of the judge model, not the generator that produced the transcripts. All generators sharing the same judge and probe file are pooled together for calibration, producing one PQ set and one cal/measure split per judge+probe. This gives more calibration samples and ensures cross-generator comparisons use the same PQs.
Default metric is AUROC. PQs are ranked by AUROC (>= 0.6) by default. AUROC measures discrimination regardless of threshold, making it more robust than accuracy for PQ selection.

Omit probe_selection to skip calibration and use all PQs on all samples.

Sourcing deployment data

BEA needs both eval and deploy transcripts to measure the awareness signal. Eval transcripts come from running benchmarks. Deploy transcripts — real conversations from actual users — can be collected using the tools in bea_tools.

You can also use any HuggingFace conversation dataset as a deploy source:

sources:
  - name: claude_code_deploy
    label:
      split: deploy
      category: coding
      has_tools: true
    collect:
      backend: local_jsonl
      path: data/claude_code.jsonl

Source backends

Backend	`collect.backend`	Use case
Inspect AI	`inspect`	Any `@task` from inspect_evals
Inspect Log	`inspect_log`	Pre-existing `.eval` log files (bring your own eval logs)
HuggingFace	`huggingface`	Chat datasets from HF Hub
Local JSON	`local_json`	Pre-collected trajectory files
Local JSONL	`local_jsonl`	Line-delimited trajectory files

Bring your own eval (`inspect_log`)

Load trajectories from .eval files you've already generated. Useful for reusing expensive eval runs.

sources:
  - name: my_eval
    label:
      split: eval          # "eval" or "deploy" — sets eval_category
      category: capability  # default category for all trajectories
      has_tools: true
    collect:
      backend: inspect_log
      path: logs/my_eval/   # single .eval file or directory of them
    limit: 30

label.split and label.category are required — they set the ground-truth class and category for each trajectory. If individual samples in the log carry a "category" key in their metadata, it overrides the config value. When loading a directory with logs from different benchmarks, use separate source entries per benchmark to keep categories clean.

Probe questions

Probe questions are defined in YAML and applied to every transcript. Each PQ asks the judge model a meta-question about the nature of the interaction.

File	Question
`probe_questions/eval_mcq.yaml`	"Is this an evaluation?" (binary yes/no)
`probe_questions/purpose_mcq.yaml`	"What kind of evaluation?" (capability / alignment)
`probe_questions/purpose_open.yaml`	Open-ended purpose question

Output

Each pipeline run produces a self-contained directory under output/runs/:

output/
├── cache/trajectories/              # shared cache, organized by model
│   ├── claude-sonnet-4-6/           #   per-model trajectory files
│   │   ├── cybench_a1b2c3d4.json
│   │   └── mask_e5f6g7h8.json
│   └── needham_retained_i9j0.json   #   model-agnostic files (flat)
├── .last_run                        # pointer to most recent run directory
└── runs/<experiment>/               # one directory per run
    ├── collect/
    │   └── manifest.json            # dataset catalog (source → cache file mapping)
    ├── probe_selection/             # (if probe_selection is configured)
    │   ├── *.eval                   # calibration logs
    │   ├── selected_pqs.json        # combo key → selected PQ IDs
    │   └── split.json               # cal_ids / measure_ids (disjoint)
    ├── measure/
    │   ├── *.eval                   # measurement logs
    │   └── log_map.json             # combo key → log filename
    └── analyze/
        ├── metrics.json             # per-combo accuracy, AUROC, AUPRC
        ├── samples.csv              # per-transcript P(eval) scores + ground truth
        └── plots/                   # ROC/PR curves (if generate_plots: true)

Project structure

aware_bench/
  config.py                 # Pydantic configuration models
  collect/                  # Stage 1: run benchmarks, extract + cache transcripts
  measure/                  # Stage 2-3: probe selection + measurement (Inspect AI tasks)
    split.py                #   stratified cal/measure splitting
    probing/                #   probe questions, solver, scorer
  analyze/                  # Stage 4: AUROC/AUPRC computation, plotting
run.py                      # CLI entry point
configs/
  default.yaml              # default pipeline config
  config_test.yaml          # small end-to-end test config
  examples/                 # additional config examples
probe_questions/            # probe question definitions (YAML)
tools/                      # utilities
  cache/
    cache_export.py         # upload runs/trajectories to HF
    cache_import.py         # download runs/trajectories from HF
tests/                      # unit tests
assets/                     # logo

Reproducing results

Every artefact in the paper — trajectories, measure-stage .eval logs, and the per-figure CSVs — is shipped together as a single Croissant- annotated HuggingFace dataset. Configs live alongside the data in each folder, so you can rerun any stage without consulting the codebase tree.

Folder map (HuggingFace dataset)

Folder	Contents	YAML config
`collected_trajectories/`	Per-model trajectory JSONs (`<model>/all.json` and per-source files), shared deployment chats under `deployment_categories/`, chat-eval prompts under `non_inspect_chat_evals/`.	`collected_trajectories/eval_suite.yaml` (broad benchmark sweep) and `collected_trajectories/agentic_misalignment.yaml` (focused 20-condition agentic-misalignment collection).
`measure_logs/`	12 `inspect_ai` `.eval` logs (one per judge model), `scores.csv` (canonical per-(transcript × probe × judge) table), and the run config.	`measure_logs/config.yaml`
`paper_replication/`	Scored CSVs that feed the paper figures and ANOVA tables. End-of-pipeline outputs; no runnable config.	—

1. Download

# Entire dataset (~1.9 GB):
huggingface-cli download el7982/aware-bench --repo-type dataset \
    --local-dir data/repro

# Or just the trajectories (~370 MB) — sufficient to re-run the measure stage:
huggingface-cli download el7982/aware-bench --repo-type dataset \
    --include "collected_trajectories/*" --local-dir data/repro

2. Re-run any stage

The configs in the dataset use paths relative to the working directory, defaulting to data/repro/.... Run from the parent of data/:

# (a') Optional:Collect a fresh trajectory pool from upstream benchmarks
uv run run.py --config data/repro/collected_trajectories/eval_suite.yaml \
    --skip-measure --skip-analyze

# (a'') Optional collection: 20-condition agentic-misalignment sweep
uv run run.py --config data/repro/collected_trajectories/agentic_misalignment.yaml \
    --skip-measure --skip-analyze

# (b) Measure: probe the 12 judge models on the trajectory pool
uv run run.py --config data/repro/measure_logs/config.yaml \
    --skip-collect --skip-analyze

--skip-collect reuses the on-disk trajectories; --skip-measure / --skip-analyze skip downstream stages. Edit runs: in any config to restrict which generator/judge models are scored.

3. Consume the scored data directly

If you don't want to rerun anything, the structured score tables in paper_replication/*.csv and measure_logs/scores.csv are loadable via pandas.read_csv or HuggingFace datasets:

from datasets import load_dataset
ds = load_dataset("el7982/aware-bench", "scores", split="train")

Exporting your own runs

uv run tools/cache/cache_export.py --repo your-org/your-repo --output-dir output/eval_suite

The export tool uploads trajectories with clean filenames ({model}/{source}.json) and includes a manifest for round-trip import.

Path templates

When loading pre-collected trajectories from local_json, you can use {generator} in the path to load per-model files from a shared directory structure:

sources:
  - name: cybench
    label: {split: eval, category: capability, has_tools: true}
    collect:
      backend: local_json
      path: "data/{generator}/cybench.json"
    limit: 20

{generator} resolves to the last segment of each run's generator model ID. For example, openrouter/anthropic/claude-sonnet-4-6 resolves to claude-sonnet-4-6.

Sources without {generator} in their path remain model-agnostic (collected once, shared across all runs), preserving backward compatibility.

Tests

uv run pytest tests/ -v

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
aware_bench		aware_bench
configs		configs
probe_questions		probe_questions
tests		tests
tools		tools
COORDINATION.md		COORDINATION.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
sync-paper.sh		sync-paper.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark for Evaluation Awareness

How it works

Setup

Quick start

Verify your setup

Stage control

Configuration

Depth stripping and resampling

Probe selection

Sourcing deployment data

Source backends

Bring your own eval (`inspect_log`)

Probe questions

Output

Project structure

Reproducing results

Folder map (HuggingFace dataset)

1. Download

2. Re-run any stage

3. Consume the scored data directly

Exporting your own runs

Path templates

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmark for Evaluation Awareness

How it works

Setup

Quick start

Verify your setup

Stage control

Configuration

Depth stripping and resampling

Probe selection

Sourcing deployment data

Source backends

Bring your own eval (inspect_log)

Probe questions

Output

Project structure

Reproducing results

Folder map (HuggingFace dataset)

1. Download

2. Re-run any stage

3. Consume the scored data directly

Exporting your own runs

Path templates

Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Bring your own eval (`inspect_log`)

Packages