Skip to content

evalawarebench-temp/evalawarebench

Repository files navigation

BEA logo

Benchmark for Evaluation Awareness

Measures whether LLMs can detect that they are being evaluated — and if so, what kind of evaluation.


A model that can distinguish evaluation from deployment could behave differently in each context, undermining the validity of safety evaluations. BEA quantifies this awareness signal by collecting conversation transcripts from existing benchmarks, then probing a judge model with meta-questions about each interaction.

How it works

 ┌──────────┐     ┌────────────┐     ┌──────────┐     ┌──────────┐
 │ COLLECT  │────▶│  PROBE     │────▶│ MEASURE  │────▶│ ANALYZE  │
 │          │     │  SELECTION │     │          │     │          │
 └──────────┘     └────────────┘     └──────────┘     └──────────┘
  Run benchmarks   Calibrate PQs     Probe judge       Compute
  + load deploy    on hold-out        with selected     AUROC, AUPRC,
  transcripts      sample split       PQs               accuracy
  1. Collect — Runs benchmarks (via Inspect AI) and loads real-user conversations, saving transcripts to a cache.
  2. Probe Selection (optional) — Runs all PQs on a stratified calibration subset, keeps only PQs that discriminate well. The calibration and measurement samples are disjoint to prevent leakage.
  3. Measure — Presents each transcript to a judge model with the selected probe questions, scoring the response.
  4. Analyze — Aggregates per-transcript P(eval) scores and computes discrimination metrics.

Setup

uv sync

Optional extras (e.g. SWE-bench support):

uv sync --extra swe

Troubleshooting: If a package appears missing after install, run uv lock first to regenerate the lockfile, then uv sync again. Always use uv run (e.g. uv run python run.py) to ensure you're using the project environment.

Create a .env with your API keys:

ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...       # if using OpenAI models

Quick start

# Run the full pipeline with the default config
python run.py

# Or specify a config
python run.py --config configs/default.yaml

Verify your setup

Run with minimal samples to check everything works:

python run.py --config configs/config_test.yaml

Stage control

python run.py --skip-measure                         # collect only
python run.py --skip-collect                         # measure + analyze (uses last run)
python run.py --skip-collect --run <name_or_path>    # measure a specific prior run
python run.py --skip-probe-selection                 # skip calibration, use all PQs

Configuration

All configuration is in a single YAML file. Here's the structure:

# Which models to use and what probes to run
runs:
  - generator: anthropic/claude-sonnet-4-6    # model that produced transcripts
    target: anthropic/claude-sonnet-4-6       # judge model answering probe questions
    probes:                                    # optional: override default PQ file
      - probe_questions/eval_mcq.yaml

# Where to get transcripts
sources:
  - name: agentharm
    label:
      split: eval                # "eval" or "deploy" (ground truth)
      category: alignment        # eval type (alignment, capability, chat, ...)
      has_tools: true
    collect:
      backend: inspect           # "inspect", "inspect_log", "huggingface", "local_json", "local_jsonl"
      task: inspect_evals/agentharm
      params:
        split: test_public
    limit: 10                    # samples per source

  - name: everyday_chat
    label:
      split: deploy
      category: chat
      has_tools: false
    collect:
      backend: huggingface
      dataset: HuggingFaceTB/everyday-conversations-llama3.1-2k
      split: train_sft
      conversation_field: messages
    limit: 10

# Probing settings
measurement:
  pq_file: probe_questions/eval_mcq.yaml   # default probe question file
  is_purpose: false                         # true for purpose classification

# Output and analysis
pipeline:
  output_dir: output/

analysis:
  metrics: [auroc, auprc, accuracy]
  generate_plots: false

See configs/examples/ for more config examples.

Depth stripping and resampling

BEA can strip conversations to intermediate depths and optionally regenerate continuations. This lets you measure how awareness changes as the model sees more (or fewer) turns of a conversation.

Configure resampling per source:

sources:
  - name: agentharm
    label:
      split: eval
      category: alignment
      has_tools: true
    collect:
      backend: inspect
      task: inspect_evals/agentharm
    resample:
      depth:
        mode: all          # "all" = every intermediate depth; "k" = fixed depth only
        max_depth: 5       # cap the number of depths generated
        from_start: false  # false = strip turns from end; true = keep only first k turns
      generate: false      # true = re-run generator on truncated transcripts
      fallback: use_base   # "use_base" = fall back to original on failure
depth.mode Behavior
all Generate a trajectory at every depth from 1 to N-1 assistant turns (capped by max_depth)
k Generate only at depth k

When generate: true, the truncated transcripts are fed back to the generator model to produce fresh continuations — useful for studying whether a model's awareness changes when it re-enters a conversation at a different point.

When generate: false, transcripts are simply truncated (depth expansion only, no API calls).

Probe selection

Not all probe questions discriminate equally well between eval and deploy transcripts. Probe selection runs all PQs on a stratified calibration subset, filters out low-signal PQs, and measures with only the passing ones on the remaining hold-out samples.

probe_selection:
  sample_pct: 0.3              # fraction of samples for calibration
  stratify_by: [eval_category] # metadata keys to balance the split by
  filter:
    metric: auroc              # "accuracy", "auroc", or "auprc" (default: auroc)
    operator: ">="             # default: >=
    threshold: 0.6             # PQs must score at least this to be kept

Key design decisions:

  • No data leakage. The calibration and measurement sample sets are disjoint — samples used to select PQs never appear in the final measurement. The split is stratified by stratify_by keys so both sets preserve the original class balance.
  • Selection is per judge, not per generator. PQ selection answers "which questions does this judge answer well?" — that's a property of the judge model, not the generator that produced the transcripts. All generators sharing the same judge and probe file are pooled together for calibration, producing one PQ set and one cal/measure split per judge+probe. This gives more calibration samples and ensures cross-generator comparisons use the same PQs.
  • Default metric is AUROC. PQs are ranked by AUROC (>= 0.6) by default. AUROC measures discrimination regardless of threshold, making it more robust than accuracy for PQ selection.

Omit probe_selection to skip calibration and use all PQs on all samples.

Sourcing deployment data

BEA needs both eval and deploy transcripts to measure the awareness signal. Eval transcripts come from running benchmarks. Deploy transcripts — real conversations from actual users — can be collected using the tools in bea_tools.

You can also use any HuggingFace conversation dataset as a deploy source:

sources:
  - name: claude_code_deploy
    label:
      split: deploy
      category: coding
      has_tools: true
    collect:
      backend: local_jsonl
      path: data/claude_code.jsonl

Source backends

Backend collect.backend Use case
Inspect AI inspect Any @task from inspect_evals
Inspect Log inspect_log Pre-existing .eval log files (bring your own eval logs)
HuggingFace huggingface Chat datasets from HF Hub
Local JSON local_json Pre-collected trajectory files
Local JSONL local_jsonl Line-delimited trajectory files

Bring your own eval (inspect_log)

Load trajectories from .eval files you've already generated. Useful for reusing expensive eval runs.

sources:
  - name: my_eval
    label:
      split: eval          # "eval" or "deploy" — sets eval_category
      category: capability  # default category for all trajectories
      has_tools: true
    collect:
      backend: inspect_log
      path: logs/my_eval/   # single .eval file or directory of them
    limit: 30

label.split and label.category are required — they set the ground-truth class and category for each trajectory. If individual samples in the log carry a "category" key in their metadata, it overrides the config value. When loading a directory with logs from different benchmarks, use separate source entries per benchmark to keep categories clean.

Probe questions

Probe questions are defined in YAML and applied to every transcript. Each PQ asks the judge model a meta-question about the nature of the interaction.

File Question
probe_questions/eval_mcq.yaml "Is this an evaluation?" (binary yes/no)
probe_questions/purpose_mcq.yaml "What kind of evaluation?" (capability / alignment)
probe_questions/purpose_open.yaml Open-ended purpose question

Output

Each pipeline run produces a self-contained directory under output/runs/:

output/
├── cache/trajectories/              # shared cache, organized by model
│   ├── claude-sonnet-4-6/           #   per-model trajectory files
│   │   ├── cybench_a1b2c3d4.json
│   │   └── mask_e5f6g7h8.json
│   └── needham_retained_i9j0.json   #   model-agnostic files (flat)
├── .last_run                        # pointer to most recent run directory
└── runs/<experiment>/               # one directory per run
    ├── collect/
    │   └── manifest.json            # dataset catalog (source → cache file mapping)
    ├── probe_selection/             # (if probe_selection is configured)
    │   ├── *.eval                   # calibration logs
    │   ├── selected_pqs.json        # combo key → selected PQ IDs
    │   └── split.json               # cal_ids / measure_ids (disjoint)
    ├── measure/
    │   ├── *.eval                   # measurement logs
    │   └── log_map.json             # combo key → log filename
    └── analyze/
        ├── metrics.json             # per-combo accuracy, AUROC, AUPRC
        ├── samples.csv              # per-transcript P(eval) scores + ground truth
        └── plots/                   # ROC/PR curves (if generate_plots: true)

Project structure

aware_bench/
  config.py                 # Pydantic configuration models
  collect/                  # Stage 1: run benchmarks, extract + cache transcripts
  measure/                  # Stage 2-3: probe selection + measurement (Inspect AI tasks)
    split.py                #   stratified cal/measure splitting
    probing/                #   probe questions, solver, scorer
  analyze/                  # Stage 4: AUROC/AUPRC computation, plotting
run.py                      # CLI entry point
configs/
  default.yaml              # default pipeline config
  config_test.yaml          # small end-to-end test config
  examples/                 # additional config examples
probe_questions/            # probe question definitions (YAML)
tools/                      # utilities
  cache/
    cache_export.py         # upload runs/trajectories to HF
    cache_import.py         # download runs/trajectories from HF
tests/                      # unit tests
assets/                     # logo

Reproducing results

Every artefact in the paper — trajectories, measure-stage .eval logs, and the per-figure CSVs — is shipped together as a single Croissant- annotated HuggingFace dataset. Configs live alongside the data in each folder, so you can rerun any stage without consulting the codebase tree.

Folder map (HuggingFace dataset)

Folder Contents YAML config
collected_trajectories/ Per-model trajectory JSONs (<model>/all.json and per-source files), shared deployment chats under deployment_categories/, chat-eval prompts under non_inspect_chat_evals/. collected_trajectories/eval_suite.yaml (broad benchmark sweep) and collected_trajectories/agentic_misalignment.yaml (focused 20-condition agentic-misalignment collection).
measure_logs/ 12 inspect_ai .eval logs (one per judge model), scores.csv (canonical per-(transcript × probe × judge) table), and the run config. measure_logs/config.yaml
paper_replication/ Scored CSVs that feed the paper figures and ANOVA tables. End-of-pipeline outputs; no runnable config.

1. Download

# Entire dataset (~1.9 GB):
huggingface-cli download el7982/aware-bench --repo-type dataset \
    --local-dir data/repro

# Or just the trajectories (~370 MB) — sufficient to re-run the measure stage:
huggingface-cli download el7982/aware-bench --repo-type dataset \
    --include "collected_trajectories/*" --local-dir data/repro

2. Re-run any stage

The configs in the dataset use paths relative to the working directory, defaulting to data/repro/.... Run from the parent of data/:

# (a') Optional:Collect a fresh trajectory pool from upstream benchmarks
uv run run.py --config data/repro/collected_trajectories/eval_suite.yaml \
    --skip-measure --skip-analyze

# (a'') Optional collection: 20-condition agentic-misalignment sweep
uv run run.py --config data/repro/collected_trajectories/agentic_misalignment.yaml \
    --skip-measure --skip-analyze

# (b) Measure: probe the 12 judge models on the trajectory pool
uv run run.py --config data/repro/measure_logs/config.yaml \
    --skip-collect --skip-analyze

--skip-collect reuses the on-disk trajectories; --skip-measure / --skip-analyze skip downstream stages. Edit runs: in any config to restrict which generator/judge models are scored.

3. Consume the scored data directly

If you don't want to rerun anything, the structured score tables in paper_replication/*.csv and measure_logs/scores.csv are loadable via pandas.read_csv or HuggingFace datasets:

from datasets import load_dataset
ds = load_dataset("el7982/aware-bench", "scores", split="train")

Exporting your own runs

uv run tools/cache/cache_export.py --repo your-org/your-repo --output-dir output/eval_suite

The export tool uploads trajectories with clean filenames ({model}/{source}.json) and includes a manifest for round-trip import.

Path templates

When loading pre-collected trajectories from local_json, you can use {generator} in the path to load per-model files from a shared directory structure:

sources:
  - name: cybench
    label: {split: eval, category: capability, has_tools: true}
    collect:
      backend: local_json
      path: "data/{generator}/cybench.json"
    limit: 20

{generator} resolves to the last segment of each run's generator model ID. For example, openrouter/anthropic/claude-sonnet-4-6 resolves to claude-sonnet-4-6.

Sources without {generator} in their path remain model-agnostic (collected once, shared across all runs), preserving backward compatibility.

Tests

uv run pytest tests/ -v

License

MIT

About

evalawarebench

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors