Skip to content

Code Origin Detector is a research-grade toolkit that estimates whether a source file was written by a human developer or by an AI assistant, combining language-aware heuristics, handcrafted features, and statistical models to deliver probabilistic verdicts with explainability.

License

Notifications You must be signed in to change notification settings

rasikasrimal/code-origin-detector

Code Origin Detector

A research-grade toolkit for estimating whether a source file was written by a human developer or produced by an AI assistant. The system combines language-aware heuristics, handcrafted features, and statistical models to output probabilistic verdicts with supporting rationale.

Status: early prototype. Ships with a heuristic baseline, stylometry/program-analysis feature scaffolding, a Typer CLI, and a redesigned React dashboard for demonstrations.

Table of Contents

Screenshots

The Code Origin Detector provides an intuitive web interface for analyzing code authorship with real-time feedback and detailed explanations.

Code Origin Detector Dashboard - Main Analysis Interface Code Analysis Settings and Configuration Panel Analysis Results with Heuristic Explanations

Key Capabilities

  • AST and stylometry feature extraction for Python and JavaScript with a pluggable pipeline for additional languages.
  • Interpretable heuristic signals (naming entropy, whitespace density, comment coverage, idiom usage) that double as model features and explanations.
  • Statistical model hooks (logistic regression, random forest) with calibrated probabilities and optional stacking.
  • Command-line interface for single files, directories, and benchmark manifests with JSON or pretty outputs.
  • Dataset utilities for reproducible collection, hashing-based deduplication, and manifest-driven experiments.
README.md
pyproject.toml               # Python package metadata
requirements.txt             # Runtime dependencies for the CLI
.github/workflows/           # Continuous integration workflows
docs/                        # Design notes, prompt catalog, and reports
examples/                    # Sample walkthrough inputs and outputs
frontend/                    # React + Vite dashboard implementation
  public/                    # Static assets served by Vite
  src/                       # UI components, hooks, and utilities
  package.json               # Frontend tooling and scripts
notebooks/                   # Exploration and modelling notebooks
scripts/                     # Data collection and preprocessing jobs
src/                         # Python package (CLI, heuristics, features, models)
  detector/                  # CLI entrypoint and core detection logic
tests/                       # Python unit and integration suites

Getting Started

Backend CLI

python -m pip install -r requirements.txt
python -m pip install -e .
code-origin-detector predict ./path/to/project --include "*.py,*.js" --output-format pretty

The default heuristic_v1 model relies on interpretable rules. To use trained scikit-learn baselines, save a calibrated estimator with helpers in src/detector/models/train.py and point the CLI at the exported Joblib artifact (for example --model artifacts/rf_v1.joblib).

Frontend Dashboard

cd frontend
npm install
npm run dev

Open the Vite URL (default http://localhost:5173) to explore the detector flow, revised copy, and accessibility patterns. The UI runs client-side heuristics only; wire it to APIs as they become available.

Frontend Experience Highlights

  • AnalyzerPanel includes filename editing, drag-and-drop uploads with validation, curated example tiles, model profile selection, and heuristic overlay toggles.
  • ResultPanel surfaces verdict badges, a confidence meter, heuristics (expandable), limitations callouts, copy/download actions, and a selectable history timeline.
  • Tailwind design tokens in tailwind.config.js extend brand/neutral palettes, card shadows, container widths, and motion-safe focus styles for consistency.
  • Analysis settings (model, heuristics, explanation density) and runtime metadata are persisted per prediction for clearer provenance.

Testing and Quality

  • Python: pytest, pytest --cov, ruff check, and mypy (install via pip install -e .[ci]).
  • Frontend: npm run lint, npm test, npm run build, and npm run format.

Data Roadmap

  1. Human corpus: Sample pre-2020 commits from vetted OSS repositories (license and language filters). Remove generated or minified artifacts, cap per-repo contributions, and tag as human-authored.
  2. AI corpus: Generate program variants via scripted prompts in docs/prompts.md, recording task descriptions, temperatures, and rewrite strategies for reproducibility.
  3. Splits and dedup: Apply SHA-256 hashing plus 20-token shingles to remove duplicates. Split train/validation/test by repository and maintain a temporal hold-out (post-2024 code).

Metadata schemas live in data/metadata/schema.json for consistent ingestion across tooling.

Responsible Use

The detector produces advisory signals, not definitive judgments. False positives can occur for human-written code (especially boilerplate or generated scaffolding), and AI-generated code can resemble expert human work. Treat probabilities as guidance to focus manual review, not as an automated gate or policy decision.

Citations and References

Core Technologies

Backend

Frontend

Related Research

For research on code authorship attribution and AI-generated code detection, refer to:

  • Burrows et al., "Stylometry and Code Authorship Attribution", Digital Investigations 2016
  • Caliskan et al., "De-anonymizing Programmers via Code Stylometry", USENIX Security 2015
  • Solaiman et al., "Release Strategies and the Social Impacts of Language Models", arXiv 2019

Citation

If you use Code Origin Detector in your research, please cite:

@software{code_origin_detector,
  title = {Code Origin Detector},
  author = {Code Origin Detector Team},
  year = {2025},
  version = {0.1.0},
  url = {https://github.com/rasikasrimal/code-origin-detector}
}

For the full citation metadata, see CITATION.cff.

About

Code Origin Detector is a research-grade toolkit that estimates whether a source file was written by a human developer or by an AI assistant, combining language-aware heuristics, handcrafted features, and statistical models to deliver probabilistic verdicts with explainability.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •