Frontier LLMs Generate Code With Half the Effective Diversity of Independent Human Authors
A three-kernel Vendi-Score evaluation framework that disentangles lexical, structural, and semantic diversity in LLM-generated code, paired with a three-pillar security analysis (vulnerability rate, cross-model CWE-pattern homogeneity, slopsquatting).
Extends Artificial Hivemind (Jiang et al., NeurIPS 2025 Best Paper) to code generation.
Do frontier LLMs converge on the same code? Across 8,799 samples from 11 LLMs spanning 6 families on 20 open-ended Python tasks, the cross-model ensemble shows 2.3--3.5x less lexical/semantic diversity than 50 independent human authors per task -- while being structurally indistinguishable at the AST level.
- Python 3.10+
- (Optional) CUDA-capable GPU for neural embedding kernels
# Clone the repository
git clone <repo-url>
cd code_hive
# Install core dependencies
pip install -r requirements.txt
# Optional: neural embedding kernels (CodeT5+, UniXcoder, CodeBERT)
pip install torch transformers
# Optional: APTED tree-edit distance (AST kernel, falls back to node-bag Jaccard without it)
pip install apted
# Optional: Meta CodeShield / Insecure Code Detector (security pillar primary backend)
pip install codeshield
# Optional: mixed-effects statistical model (proof_homogeneity_v2.py)
pip install statsmodelsAPI keys are required for data collection. Create a .env file in the project root or export them directly:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GOOGLE_API_KEY=...
export TOGETHER_API_KEY=...The .env file is loaded automatically via python-dotenv and is excluded from version control by .gitignore.
No API keys needed — generates synthetic responses from 5 models and runs the legacy diversity metrics on them:
python run.py --demoThis writes mock JSONL to results/raw_responses/OD-01.jsonl and prints pairwise naming similarity, library-choice entropy, design-pattern detection, and embedding cosine similarity.
The pipeline has 6 scripts that run in sequence. Each step produces outputs consumed by later steps. Below is the exact execution order.
Calls 11 models across 4 API providers (OpenAI, Anthropic, Google, Together) and writes per-prompt JSONL files to results/raw_responses/.
python run.py --collect --samples 20 --temps 0.0 0.7| Flag | Default | Description |
|---|---|---|
--samples N |
50 | Samples per model per temperature |
--temps T [T ...] |
0.0 1.0 | Temperature values to sweep |
--prompts ID [ID ...] |
all 60 | Subset of prompt IDs (e.g. AL-01 OD-03 SE-12) |
--models NAME [NAME ...] |
all 11 | Subset of model display names |
Requires: API keys in .env (see Environment Setup).
Produces: results/raw_responses/<PROMPT_ID>.jsonl — one line per (model, temperature, sample).
Downloads human-written Python solutions from the APPS dataset (Hendrycks et al., 2021) and writes them in the same JSONL schema so the diversity proof can compare LLM vs. human pools.
python fetch_human_baseline.py \
--prompts AL-01 AL-03 AL-05 \
--out-dir results/human_baseline \
--max-solutions-per-prompt 30| Flag | Default | Description |
|---|---|---|
--prompts ID [ID ...] |
AL-01 AL-03 AL-05 | Prompt IDs to fetch baselines for |
--out-dir PATH |
results/human_baseline |
Output directory |
--max-solutions-per-prompt N |
30 | Cap human solutions per prompt |
--max-problems-per-prompt N |
5 | Max APPS problems matched per prompt |
Requires: pip install datasets (already in requirements.txt).
Produces: results/human_baseline/<PROMPT_ID>.jsonl.
The paper's core three-pillar analysis on a single temperature. Computes Vendi Scores under three kernels across four pools (shuffled-null, human, inter-LLM, intra-LLM), runs a paired ordering test (LLM vs. human), and evaluates functional convergence on algorithmic prompts.
python proof_homogeneity.py \
--prompts AL-01 AL-03 AL-05 \
--temperature 0.7 \
--human-dir results/human_baseline \
--embedder codet5p \
--device cpu| Flag | Default | Description |
|---|---|---|
--prompts ID [ID ...] |
AL-01 AL-03 AL-05 | Prompt IDs to analyze |
--temperature T |
1.0 | Single temperature to analyze |
--raw-dir PATH |
results/raw_responses |
Directory with LLM JSONL files |
--human-dir PATH |
(none) | Directory with human JSONL files |
--out-dir PATH |
auto-generated | Output directory |
--embedder NAME |
unixcoder | Embedding model: codet5p, unixcoder, codebert, or none |
--device |
cpu | cpu or cuda |
--n-boot N |
1000 | Bootstrap resamples for CIs |
--skip-pillar3 |
false | Skip functional-convergence tests |
--dry-run |
false | Print configuration and exit |
Requires: pip install torch transformers (for neural kernels), pip install apted (for AST kernel; falls back to node-bag Jaccard without it).
Produces (in --out-dir):
| File | Contents |
|---|---|
summary.json |
Full numeric results |
summary.md |
Human-readable report |
fig_pillar1_vendi.png |
Grouped bars: pools x kernels x prompts |
fig_pillar2_ordering.png |
Paired scatter: LLM vs. human Vendi |
fig_pillar3_traces.png |
Stacked trace-cluster bars per prompt |
Wraps v1 and adds temperature sweep, leave-one-family-out ablation, and a mixed-effects regression. This is the script that produces the paper's main diversity figures.
python proof_homogeneity_v2.py \
--prompts PB-01 PB-02 PB-03 \
--temperatures 0.0 0.7 \
--human-dir results/human_baseline \
--embedder codebert \
--device auto| Flag | Default | Description |
|---|---|---|
--prompts ID [ID ...] |
PB-01 PB-02 PB-03 | Prompt IDs |
--temperatures T [T ...] |
1.0 | Temperatures to sweep |
--embedder NAME |
codebert | codet5p, unixcoder, codebert, or none |
--device |
auto | auto, cpu, or cuda |
--skip-pillar3 |
false | Skip functional convergence |
--skip-family-ablation |
false | Skip leave-one-family-out |
--skip-mixed-effects |
false | Skip regression model |
--dry-run |
false | Print config and exit |
Requires: all v1 deps + pip install statsmodels pandas (for mixed-effects model).
Produces (in results/proof_v2/):
| File | Contents |
|---|---|
summary.json |
Master results across all temperatures |
summary.md |
Human-readable v2 report |
fig_temperature_sweep.png |
Vendi (LLM/human) vs. temperature |
fig_family_ablation.png |
Inter-LLM Vendi with each family removed |
fig_mixed_effects.png |
Fixed-effect contrasts vs. human pool |
t<TEMP>/ |
Per-temperature v1 outputs and figures |
Three security pillars: vulnerability rate (S1), cross-model CWE-pattern homogeneity (S2), and slopsquatting (S3).
python security_pillar.py \
--prompts SE-01 SE-02 SE-03 SE-04 SE-05 \
--temperatures 0.0 0.7 \
--bandit \
--pypi-live-check \
--human-dir results/human_baseline \
--out-dir results/proof_security| Flag | Default | Description |
|---|---|---|
--prompts ID [ID ...] |
all available | Prompt IDs to scan |
--temperatures T [T ...] |
auto-detect | Temperatures to include |
--bandit |
false | Also run Bandit SAST and report consensus |
--pypi-live-check |
false | Live PyPI/npm existence check for S3 |
--human-dir PATH |
(none) | Human baseline for comparison |
--out-dir PATH |
auto-generated | Output directory |
--from-cache PATH |
(none) | Reuse cached scan results |
--dry-run |
false | Print config and exit |
Requires: pip install codeshield (primary backend; falls back to regex scanner), pip install bandit (if --bandit flag used).
Produces (in --out-dir):
| File | Contents |
|---|---|
summary.json |
Full security results |
summary.md |
Human-readable report |
fig_pillar_s1_vuln_rates.png |
Per-model + per-CWE + severity-weighted |
fig_pillar_s2_homogeneity.png |
Vendi/N over CWE-pattern signatures |
fig_pillar_s3_slopsquatting.png |
Deterministic-vs-stochastic rates |
cwe_signatures.csv |
Per-(prompt, model, sample) CWE signatures |
slop_findings.jsonl |
Per-sample hallucinated package names |
Runs the older pipeline.py analysis (intra/inter-model diversity, family clustering, temperature effects) using the 4-layer diversity_metrics.py framework. Not used for paper figures but retained for exploratory analysis.
python run.py --analyzeProduces: outputs in results/metrics/ and results/figures/.
┌──────────────────────────────────────────────────────────────┐
│ 1. run.py --collect → results/raw_responses/ │
│ 2. fetch_human_baseline.py → results/human_baseline/ │
│ 3. proof_homogeneity.py → results/proof/v1/ │
│ 4. proof_homogeneity_v2.py → results/proof_v2/ │
│ 5. security_pillar.py → results/proof_security/ │
│ 6. run.py --analyze → results/metrics/ + figures/ │
└──────────────────────────────────────────────────────────────┘
Steps 1-2 produce the data. Steps 3-5 produce the paper's analysis.
Step 6 is optional (legacy exploratory pipeline).
All outputs are written to subdirectories under results/:
| Directory | What's There | How to View |
|---|---|---|
results/raw_responses/ |
One JSONL per prompt, one line per (model, temp, sample) | Any text editor or python -m json.tool |
results/human_baseline/ |
Human JSONL files from APPS | Same as above |
results/proof/v1/ |
v1 diversity proof: summary.json, summary.md, PNG figures |
Open summary.md for a narrative; open PNGs directly |
results/proof_v2/ |
v2 diversity proof: temperature sweep, ablation, regression | Same — summary.md is the entry point |
results/proof_security/ |
Security pillars: summary.md, CSVs, PNGs |
summary.md + cwe_signatures.csv for details |
results/metrics/ |
Legacy pipeline outputs | JSON files |
results/figures/ |
Legacy pipeline plots | PNG files |
The summary.md files are self-contained human-readable reports with all key numbers, statistical tests, and references to the figures.
code_hive/
├── config.py # Model registry (11 models, 6 families), sampling params
├── run.py # CLI entry point (collect / analyze / demo)
├── prompt_suite.py # 60 prompts: 30 diversity + 30 security-eliciting (SE-01..30)
├── multi_model_sampler.py # Async multi-provider API collector (OpenAI, Anthropic, Google, Together)
├── requirements.txt
│
├── proof_homogeneity.py # Three-pillar diversity proof (Vendi Score, 3 kernels, 4 pools)
├── proof_homogeneity_v2.py # V2: temperature sweep, family ablation, mixed-effects model
├── security_pillar.py # Security pillars S1 (vuln rate), S2 (CWE homogeneity), S3 (slopsquatting)
├── security_analysis.py # Regex-based CWE scanner (fallback when CodeShield unavailable)
│
├── fetch_human_baseline.py # APPS dataset keyword-matcher for human reference solutions
├── pypi_check.py # Live PyPI/npm existence checker (Pillar S3)
├── cwe_to_cvss.json # CWE -> CVSS base score severity table
│
├── diversity_metrics.py # Legacy 4-layer metrics (surface, structural, semantic, behavioral)
├── pipeline.py # Legacy intra/inter-model analysis pipeline
│
├── data_prep/ # Dataset curation (Pool A/B from HuggingFace)
├── local_datasets/ # Prompt manifests (JSON)
├── docs/ # Extended documentation
└── results/ # Generated outputs (gitignored)
├── raw_responses/ # Per-prompt JSONL sample files
├── human_baseline/ # Human reference solutions from APPS
├── proof_v2/ # Diversity analysis outputs + figures
└── proof_security/ # Security analysis outputs + figures
The paper's three-kernel Vendi-Score framework measures diversity along independent axes:
| Kernel | Similarity Function | Axis Measured |
|---|---|---|
| Token n-gram Jaccard | Jaccard over 3-gram token sets | Lexical: identifiers, imports, idioms |
| AST tree-edit (APTED) | 1 - normalized tree-edit distance | Structural: control flow, nesting, decomposition |
| CodeT5+ embedding cosine | Cosine of mean-pooled CodeT5+ vectors | Semantic: algorithmic intent and strategy |
Each kernel produces a Vendi Score (effective number of distinct items) normalized by pool size to give per-sample effective uniqueness in [0, 1].
| Pillar | Measurement | Key Finding |
|---|---|---|
| S1 | Per-model vulnerability rate, CVSS-B weighted | 42.9% overall, mean CVSS-B 6.88 |
| S2 | Cross-model CWE-pattern homogeneity (novel) | 100% exact-pattern match on SQL/XML tasks |
| S3 | Slopsquatting / package hallucination | Near-zero on security prompts |
| Category | Count | Description |
|---|---|---|
| OPEN_DESIGN | 5 | "Build X" with many valid designs |
| ALGORITHM | 5 | Multiple valid algorithmic approaches |
| REFACTOR | 5 | Improve given code |
| NAMING | 5 | Tasks where naming is unconstrained |
| CREATIVE_TOOL | 5 | Creative coding tasks |
| SYSTEM_DESIGN | 5 | High-level design decompositions |
| SECURITY_ELICITING | 30 | Neutrally-phrased prompts in security-prone domains (SE-01..30) |
- Effective-sample-size collapse: Vendi/N under three kernels across four pools (null, human, inter-LLM, intra-LLM)
- Paired ordering test: Bootstrap-paired delta showing LLM diversity < human diversity
- Family ablation: Leave-one-family-out robustness check across all six model families
- Temperature sweep: Diversity at T=0.0 vs T=0.7 (gap narrows ~15% but does not close)
- Cross-model vulnerability convergence: Same exploitable (CWE, sink) patterns across providers