Code Hivemind

Frontier LLMs Generate Code With Half the Effective Diversity of Independent Human Authors

A three-kernel Vendi-Score evaluation framework that disentangles lexical, structural, and semantic diversity in LLM-generated code, paired with a three-pillar security analysis (vulnerability rate, cross-model CWE-pattern homogeneity, slopsquatting).

Extends Artificial Hivemind (Jiang et al., NeurIPS 2025 Best Paper) to code generation.

Research Question

Do frontier LLMs converge on the same code? Across 8,799 samples from 11 LLMs spanning 6 families on 20 open-ended Python tasks, the cross-model ensemble shows 2.3--3.5x less lexical/semantic diversity than 50 independent human authors per task -- while being structurally indistinguishable at the AST level.

Prerequisites

Python 3.10+
(Optional) CUDA-capable GPU for neural embedding kernels

Installation

# Clone the repository
git clone <repo-url>
cd code_hive

# Install core dependencies
pip install -r requirements.txt

# Optional: neural embedding kernels (CodeT5+, UniXcoder, CodeBERT)
pip install torch transformers

# Optional: APTED tree-edit distance (AST kernel, falls back to node-bag Jaccard without it)
pip install apted

# Optional: Meta CodeShield / Insecure Code Detector (security pillar primary backend)
pip install codeshield

# Optional: mixed-effects statistical model (proof_homogeneity_v2.py)
pip install statsmodels

Environment Setup

API keys are required for data collection. Create a .env file in the project root or export them directly:

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GOOGLE_API_KEY=...
export TOGETHER_API_KEY=...

The .env file is loaded automatically via python-dotenv and is excluded from version control by .gitignore.

Quick Start (Demo)

No API keys needed — generates synthetic responses from 5 models and runs the legacy diversity metrics on them:

python run.py --demo

This writes mock JSONL to results/raw_responses/OD-01.jsonl and prints pairwise naming similarity, library-choice entropy, design-pattern detection, and embedding cosine similarity.

Running the Full Pipeline

The pipeline has 6 scripts that run in sequence. Each step produces outputs consumed by later steps. Below is the exact execution order.

Step 1 — Collect LLM Responses (`run.py --collect`)

Calls 11 models across 4 API providers (OpenAI, Anthropic, Google, Together) and writes per-prompt JSONL files to results/raw_responses/.

python run.py --collect --samples 20 --temps 0.0 0.7

Flag	Default	Description
`--samples N`	50	Samples per model per temperature
`--temps T [T ...]`	0.0 1.0	Temperature values to sweep
`--prompts ID [ID ...]`	all 60	Subset of prompt IDs (e.g. `AL-01 OD-03 SE-12`)
`--models NAME [NAME ...]`	all 11	Subset of model display names

Requires: API keys in .env (see Environment Setup). Produces: results/raw_responses/<PROMPT_ID>.jsonl — one line per (model, temperature, sample).

Step 2 — Fetch Human Baseline (`fetch_human_baseline.py`)

Downloads human-written Python solutions from the APPS dataset (Hendrycks et al., 2021) and writes them in the same JSONL schema so the diversity proof can compare LLM vs. human pools.

python fetch_human_baseline.py \
    --prompts AL-01 AL-03 AL-05 \
    --out-dir results/human_baseline \
    --max-solutions-per-prompt 30

Flag	Default	Description
`--prompts ID [ID ...]`	AL-01 AL-03 AL-05	Prompt IDs to fetch baselines for
`--out-dir PATH`	`results/human_baseline`	Output directory
`--max-solutions-per-prompt N`	30	Cap human solutions per prompt
`--max-problems-per-prompt N`	5	Max APPS problems matched per prompt

Requires: pip install datasets (already in requirements.txt). Produces: results/human_baseline/<PROMPT_ID>.jsonl.

Step 3 — Diversity Proof v1 (`proof_homogeneity.py`)

The paper's core three-pillar analysis on a single temperature. Computes Vendi Scores under three kernels across four pools (shuffled-null, human, inter-LLM, intra-LLM), runs a paired ordering test (LLM vs. human), and evaluates functional convergence on algorithmic prompts.

python proof_homogeneity.py \
    --prompts AL-01 AL-03 AL-05 \
    --temperature 0.7 \
    --human-dir results/human_baseline \
    --embedder codet5p \
    --device cpu

Flag	Default	Description
`--prompts ID [ID ...]`	AL-01 AL-03 AL-05	Prompt IDs to analyze
`--temperature T`	1.0	Single temperature to analyze
`--raw-dir PATH`	`results/raw_responses`	Directory with LLM JSONL files
`--human-dir PATH`	(none)	Directory with human JSONL files
`--out-dir PATH`	auto-generated	Output directory
`--embedder NAME`	unixcoder	Embedding model: `codet5p`, `unixcoder`, `codebert`, or `none`
`--device`	cpu	`cpu` or `cuda`
`--n-boot N`	1000	Bootstrap resamples for CIs
`--skip-pillar3`	false	Skip functional-convergence tests
`--dry-run`	false	Print configuration and exit

Requires: pip install torch transformers (for neural kernels), pip install apted (for AST kernel; falls back to node-bag Jaccard without it). Produces (in --out-dir):

File	Contents
`summary.json`	Full numeric results
`summary.md`	Human-readable report
`fig_pillar1_vendi.png`	Grouped bars: pools x kernels x prompts
`fig_pillar2_ordering.png`	Paired scatter: LLM vs. human Vendi
`fig_pillar3_traces.png`	Stacked trace-cluster bars per prompt

Step 4 — Diversity Proof v2 (`proof_homogeneity_v2.py`)

Wraps v1 and adds temperature sweep, leave-one-family-out ablation, and a mixed-effects regression. This is the script that produces the paper's main diversity figures.

python proof_homogeneity_v2.py \
    --prompts PB-01 PB-02 PB-03 \
    --temperatures 0.0 0.7 \
    --human-dir results/human_baseline \
    --embedder codebert \
    --device auto

Flag	Default	Description
`--prompts ID [ID ...]`	PB-01 PB-02 PB-03	Prompt IDs
`--temperatures T [T ...]`	1.0	Temperatures to sweep
`--embedder NAME`	codebert	`codet5p`, `unixcoder`, `codebert`, or `none`
`--device`	auto	`auto`, `cpu`, or `cuda`
`--skip-pillar3`	false	Skip functional convergence
`--skip-family-ablation`	false	Skip leave-one-family-out
`--skip-mixed-effects`	false	Skip regression model
`--dry-run`	false	Print config and exit

Requires: all v1 deps + pip install statsmodels pandas (for mixed-effects model). Produces (in results/proof_v2/):

File	Contents
`summary.json`	Master results across all temperatures
`summary.md`	Human-readable v2 report
`fig_temperature_sweep.png`	Vendi (LLM/human) vs. temperature
`fig_family_ablation.png`	Inter-LLM Vendi with each family removed
`fig_mixed_effects.png`	Fixed-effect contrasts vs. human pool
`t<TEMP>/`	Per-temperature v1 outputs and figures

Step 5 — Security Analysis (`security_pillar.py`)

Three security pillars: vulnerability rate (S1), cross-model CWE-pattern homogeneity (S2), and slopsquatting (S3).

python security_pillar.py \
    --prompts SE-01 SE-02 SE-03 SE-04 SE-05 \
    --temperatures 0.0 0.7 \
    --bandit \
    --pypi-live-check \
    --human-dir results/human_baseline \
    --out-dir results/proof_security

Flag	Default	Description
`--prompts ID [ID ...]`	all available	Prompt IDs to scan
`--temperatures T [T ...]`	auto-detect	Temperatures to include
`--bandit`	false	Also run Bandit SAST and report consensus
`--pypi-live-check`	false	Live PyPI/npm existence check for S3
`--human-dir PATH`	(none)	Human baseline for comparison
`--out-dir PATH`	auto-generated	Output directory
`--from-cache PATH`	(none)	Reuse cached scan results
`--dry-run`	false	Print config and exit

Requires: pip install codeshield (primary backend; falls back to regex scanner), pip install bandit (if --bandit flag used). Produces (in --out-dir):

File	Contents
`summary.json`	Full security results
`summary.md`	Human-readable report
`fig_pillar_s1_vuln_rates.png`	Per-model + per-CWE + severity-weighted
`fig_pillar_s2_homogeneity.png`	Vendi/N over CWE-pattern signatures
`fig_pillar_s3_slopsquatting.png`	Deterministic-vs-stochastic rates
`cwe_signatures.csv`	Per-(prompt, model, sample) CWE signatures
`slop_findings.jsonl`	Per-sample hallucinated package names

Step 6 — Legacy Analysis (`run.py --analyze`)

Runs the older pipeline.py analysis (intra/inter-model diversity, family clustering, temperature effects) using the 4-layer diversity_metrics.py framework. Not used for paper figures but retained for exploratory analysis.

python run.py --analyze

Produces: outputs in results/metrics/ and results/figures/.

Execution Order at a Glance

┌──────────────────────────────────────────────────────────────┐
│  1. run.py --collect          → results/raw_responses/       │
│  2. fetch_human_baseline.py   → results/human_baseline/      │
│  3. proof_homogeneity.py      → results/proof/v1/            │
│  4. proof_homogeneity_v2.py   → results/proof_v2/            │
│  5. security_pillar.py        → results/proof_security/      │
│  6. run.py --analyze          → results/metrics/ + figures/  │
└──────────────────────────────────────────────────────────────┘
Steps 1-2 produce the data. Steps 3-5 produce the paper's analysis.
Step 6 is optional (legacy exploratory pipeline).

Viewing Results

All outputs are written to subdirectories under results/:

Directory	What's There	How to View
`results/raw_responses/`	One JSONL per prompt, one line per (model, temp, sample)	Any text editor or `python -m json.tool`
`results/human_baseline/`	Human JSONL files from APPS	Same as above
`results/proof/v1/`	v1 diversity proof: `summary.json`, `summary.md`, PNG figures	Open `summary.md` for a narrative; open PNGs directly
`results/proof_v2/`	v2 diversity proof: temperature sweep, ablation, regression	Same — `summary.md` is the entry point
`results/proof_security/`	Security pillars: `summary.md`, CSVs, PNGs	`summary.md` + `cwe_signatures.csv` for details
`results/metrics/`	Legacy pipeline outputs	JSON files
`results/figures/`	Legacy pipeline plots	PNG files

The summary.md files are self-contained human-readable reports with all key numbers, statistical tests, and references to the figures.

Project Structure

code_hive/
├── config.py                  # Model registry (11 models, 6 families), sampling params
├── run.py                     # CLI entry point (collect / analyze / demo)
├── prompt_suite.py            # 60 prompts: 30 diversity + 30 security-eliciting (SE-01..30)
├── multi_model_sampler.py     # Async multi-provider API collector (OpenAI, Anthropic, Google, Together)
├── requirements.txt
│
├── proof_homogeneity.py       # Three-pillar diversity proof (Vendi Score, 3 kernels, 4 pools)
├── proof_homogeneity_v2.py    # V2: temperature sweep, family ablation, mixed-effects model
├── security_pillar.py         # Security pillars S1 (vuln rate), S2 (CWE homogeneity), S3 (slopsquatting)
├── security_analysis.py       # Regex-based CWE scanner (fallback when CodeShield unavailable)
│
├── fetch_human_baseline.py    # APPS dataset keyword-matcher for human reference solutions
├── pypi_check.py              # Live PyPI/npm existence checker (Pillar S3)
├── cwe_to_cvss.json           # CWE -> CVSS base score severity table
│
├── diversity_metrics.py       # Legacy 4-layer metrics (surface, structural, semantic, behavioral)
├── pipeline.py                # Legacy intra/inter-model analysis pipeline
│
├── data_prep/                 # Dataset curation (Pool A/B from HuggingFace)
├── local_datasets/            # Prompt manifests (JSON)
├── docs/                      # Extended documentation
└── results/                   # Generated outputs (gitignored)
    ├── raw_responses/         #   Per-prompt JSONL sample files
    ├── human_baseline/        #   Human reference solutions from APPS
    ├── proof_v2/              #   Diversity analysis outputs + figures
    └── proof_security/        #   Security analysis outputs + figures

Diversity Kernels (Paper Methodology)

The paper's three-kernel Vendi-Score framework measures diversity along independent axes:

Kernel	Similarity Function	Axis Measured
Token n-gram Jaccard	Jaccard over 3-gram token sets	Lexical: identifiers, imports, idioms
AST tree-edit (APTED)	1 - normalized tree-edit distance	Structural: control flow, nesting, decomposition
CodeT5+ embedding cosine	Cosine of mean-pooled CodeT5+ vectors	Semantic: algorithmic intent and strategy

Each kernel produces a Vendi Score (effective number of distinct items) normalized by pool size to give per-sample effective uniqueness in [0, 1].

Security Pillars

Pillar	Measurement	Key Finding
S1	Per-model vulnerability rate, CVSS-B weighted	42.9% overall, mean CVSS-B 6.88
S2	Cross-model CWE-pattern homogeneity (novel)	100% exact-pattern match on SQL/XML tasks
S3	Slopsquatting / package hallucination	Near-zero on security prompts

Prompt Categories

Category	Count	Description
OPEN_DESIGN	5	"Build X" with many valid designs
ALGORITHM	5	Multiple valid algorithmic approaches
REFACTOR	5	Improve given code
NAMING	5	Tasks where naming is unconstrained
CREATIVE_TOOL	5	Creative coding tasks
SYSTEM_DESIGN	5	High-level design decompositions
SECURITY_ELICITING	30	Neutrally-phrased prompts in security-prone domains (SE-01..30)

Key Analyses

Effective-sample-size collapse: Vendi/N under three kernels across four pools (null, human, inter-LLM, intra-LLM)
Paired ordering test: Bootstrap-paired delta showing LLM diversity < human diversity
Family ablation: Leave-one-family-out robustness check across all six model families
Temperature sweep: Diversity at T=0.0 vs T=0.7 (gap narrows ~15% but does not close)
Cross-model vulnerability convergence: Same exploitable (CWE, sink) patterns across providers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Hivemind

Research Question

Prerequisites

Installation

Environment Setup

Quick Start (Demo)

Running the Full Pipeline

Step 1 — Collect LLM Responses (`run.py --collect`)

Step 2 — Fetch Human Baseline (`fetch_human_baseline.py`)

Step 3 — Diversity Proof v1 (`proof_homogeneity.py`)

Step 4 — Diversity Proof v2 (`proof_homogeneity_v2.py`)

Step 5 — Security Analysis (`security_pillar.py`)

Step 6 — Legacy Analysis (`run.py --analyze`)

Execution Order at a Glance

Viewing Results

Project Structure

Diversity Kernels (Paper Methodology)

Security Pillars

Prompt Categories

Key Analyses

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.claude		.claude
data_prep		data_prep
docs		docs
local_datasets		local_datasets
results		results
verification_calibration		verification_calibration
.gitignore		.gitignore
README.md		README.md
approach_analysis.py		approach_analysis.py
compute_se_bridge.py		compute_se_bridge.py
config.py		config.py
cwe_to_cvss.json		cwe_to_cvss.json
dataset_sample_collect.py		dataset_sample_collect.py
diversity_metrics.py		diversity_metrics.py
fetch_human_baseline.py		fetch_human_baseline.py
multi_model_sampler.py		multi_model_sampler.py
pipeline.py		pipeline.py
prompt_suite.py		prompt_suite.py
proof_homogeneity.py		proof_homogeneity.py
proof_homogeneity_v2.py		proof_homogeneity_v2.py
pypi_check.py		pypi_check.py
recompute_8model.py		recompute_8model.py
recompute_diversity_8model.py		recompute_diversity_8model.py
recompute_security_8model.py		recompute_security_8model.py
requirements.txt		requirements.txt
run.py		run.py
security_analysis.py		security_analysis.py
security_pillar.py		security_pillar.py
security_pillar_calibrated.py		security_pillar_calibrated.py

Folders and files

Latest commit

History

Repository files navigation

Code Hivemind

Research Question

Prerequisites

Installation

Environment Setup

Quick Start (Demo)

Running the Full Pipeline

Step 1 — Collect LLM Responses (run.py --collect)

Step 2 — Fetch Human Baseline (fetch_human_baseline.py)

Step 3 — Diversity Proof v1 (proof_homogeneity.py)

Step 4 — Diversity Proof v2 (proof_homogeneity_v2.py)

Step 5 — Security Analysis (security_pillar.py)

Step 6 — Legacy Analysis (run.py --analyze)

Execution Order at a Glance

Viewing Results

Project Structure

Diversity Kernels (Paper Methodology)

Security Pillars

Prompt Categories

Key Analyses

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1 — Collect LLM Responses (`run.py --collect`)

Step 2 — Fetch Human Baseline (`fetch_human_baseline.py`)

Step 3 — Diversity Proof v1 (`proof_homogeneity.py`)

Step 4 — Diversity Proof v2 (`proof_homogeneity_v2.py`)

Step 5 — Security Analysis (`security_pillar.py`)

Step 6 — Legacy Analysis (`run.py --analyze`)

Packages