Skip to content

ShirshenduR/TrueSignal

Repository files navigation

TrueSignal ⚑

Semantic Talent Intelligence & OSINT Verification Engine

TrueSignal is a production-grade hiring intelligence platform designed to eliminate "Resume Inflation" by verifying candidate claims against real-world digital footprints. Built for a high-intensity hackathon, it leverages semantic search, competitive programming APIs, and Large Language Models (LLMs) to provide an ungameable audit of technical talent.


πŸš€ Core Features

1. Holistic OSINT Verification πŸ”

TrueSignal doesn't just read resumes β€” it verifies them. It cross-references PDF claims against:

  • GitHub: Evaluates repository distribution, commit density, and code quality via the PyGithub API.
  • LeetCode & Codeforces: Fetches live competitive programming stats to verify algorithmic proficiency.
  • Identity Verification: Flags discrepancies between resume-extracted links and user-provided social handles (OSINT cross-check).

2. Glass-Box Audit (AI Reasoning) 🧠

Traditional matching is a "black box." TrueSignal provides a Glass-Box Audit powered by Groq LLaMA-3:

  • HR Deep Analysis: Career trajectory and mentorship potential evaluation.
  • Verified Skills Justification: Each score is backed by specific evidence found in OSINT or resume artifacts.
  • Critical Gaps: Identifies exactly what the candidate is missing relative to the Job Description.

3. Enterprise RAG Talent Pool πŸ“š

Store and query your entire applicant history semantically.

  • Persistent Vector Store: Uses SentenceTransformers to chunk, embed, and index candidate resumes.
  • Chat with your DB: Ask questions like "Who in our database has the strongest AWS experience?" to retrieve top matches instantly via Retrieval-Augmented Generation.

4. Zero-Mock Architecture πŸ›‘οΈ

  • No Hallucinations: All fallback "mock data" has been removed. If a candidate cannot be verified, the system applies a deterministic penalty rather than inventing a score.
  • Anti-Bias Engine: Aggressive LLM-based anonymization strips names, gender, and demographics before the AI evaluates technical merit.

5. Batch Comparison Mode πŸ“Š

  • Upload multiple resumes simultaneously.
  • Auto-extract OSINT footprints (GitHub, LeetCode, Codeforces) from each PDF.
  • Rank candidates deterministically on a competitive leaderboard with interactive score distribution charts.

🧬 Technical Architecture & Algorithms

Matching Pipeline β€” matching_algorithm.py

TrueSignal's core matching engine uses a three-layer scoring pipeline that fuses semantic, deterministic, and verification signals:

Layer 1: Dense Vector Cosine Similarity (Semantic Match)

Detail Value
Model all-MiniLM-L6-v2 (384-dim embeddings, ~22M params)
Metric Cosine Similarity
Complexity O(d) per pair, where d = embedding dimension

The Job Description and candidate texts (GitHub summary + resume) are separately encoded into dense 384-dimensional vectors using a pre-trained Sentence-BERT model. Cosine similarity measures the angular distance between these vectors:

cos(ΞΈ) = (A Β· B) / (β€–Aβ€– Γ— β€–Bβ€–)

Why it's efficient: all-MiniLM-L6-v2 is a distilled model optimized for inference speed β€” ~5x faster than full BERT with only a ~1% accuracy loss on STS benchmarks. The model is loaded once via @st.cache_resource and reused across all requests, eliminating cold-start overhead.

Layer 2: Shannon Entropy β€” Polyglot Index

Shannon Entropy quantifies the diversity of a candidate's programming language usage from GitHub:

H(X) = -Ξ£ p(xα΅’) Γ— logβ‚‚(p(xα΅’))
Entropy Score Classification
H < 1.0 Deep Specialist β€” Focused on 1-2 languages
1.0 ≀ H < 2.0 Versatile Developer β€” Balanced across a few languages
H β‰₯ 2.0 Broad Generalist β€” Significant spread across many languages

Why it's efficient: Computed in O(n) where n = number of distinct languages. No ML inference required β€” pure mathematical signal. Gives recruiters an immediate, objective profile classification.

Layer 3: Jaccard Similarity Index β€” Keyword Overlap

Measures exact keyword overlap between the JD and candidate text using set-theoretic intersection:

J(A, B) = |A ∩ B| / |A βˆͺ B|

Words are tokenized using regex (\b[a-zA-Z]{3,}\b), lowercased, and compared as sets.

Why it's efficient: O(n + m) time complexity with Python's built-in hash-set operations. Provides a deterministic, LLM-free baseline that catches keyword matches the neural model might abstract away.

Anti-Fake Multiplier System

TrueSignal applies a verification-based scoring penalty to prevent resume inflation:

Verification Status Multiplier Effect
No GitHub + No DSA 0.4x 60% penalty β€” completely unverifiable
GitHub only, no DSA 0.75x 25% penalty β€” partial verification
GitHub + DSA verified 1.0x Full score + DSA boost applied

DSA Boost Formula:

  • LeetCode: +15 + min(total_solved, 300) / 20 β†’ up to +30 points
  • Codeforces: +15 + min(rating, 2000) / 100 β†’ up to +35 points

Final score: min(100, base_semantic Γ— penalty_multiplier + dsa_boost)


RAG Engine β€” rag_engine.py

The Enterprise RAG (Retrieval-Augmented Generation) system implements a full vector search pipeline:

Chunking Strategy

Resumes are split into overlapping windows for optimal semantic coverage:

Parameter Value
Chunk Size 500 characters
Overlap 100 characters
Strategy Sliding window

Why overlap matters: A 100-char overlap ensures that no sentence is split in half between two chunks. Critical skills mentioned at chunk boundaries are preserved in at least one complete chunk.

Vector Search

  1. Query is embedded using the same all-MiniLM-L6-v2 model.
  2. Cosine similarity is computed against all stored chunk embeddings.
  3. Results below a 0.1 relevance threshold are filtered out.
  4. Top-K chunks are retrieved and passed to the LLM for synthesis.

LLM Synthesis

  • Model: llama-3.3-70b-versatile via Groq API
  • Temperature: 0.2 (low-creativity, factual retrieval)
  • Max Tokens: 800

Why Groq: Groq's LPU (Language Processing Unit) hardware delivers ~10x faster inference than GPU-based LLM hosting, enabling real-time RAG responses within the Streamlit session.


Glass-Box Audit Engine β€” audit_engine.py

The audit engine produces a fully explainable AI evaluation:

Component Detail
Model llama-3.1-8b-instant via Groq
Output Format Strict JSON schema (enforced via response_format)
Temperature 0.1 (near-deterministic for consistency)

Structured Output Fields:

  • confidence_score β€” Integer 0-100
  • skill_justifications[] β€” Per-skill score with evidence citations
  • critical_skills_missing[] β€” Gaps relative to the JD
  • code_quality_assessment β€” Commit message quality evaluation
  • hr_deep_analysis β€” Senior HR-level career trajectory analysis
  • bias_check_status β€” Whether anonymization was applied

Anti-Bias Anonymization

When enabled, an LLM pre-processing step strips:

  • Candidate names, gender markers
  • University/college names
  • Geographic locations
  • All demographic proxies

This ensures decisions rely 100% on technical artifact evidence.


OSINT Data Ingestion

GitHub β€” github_ingestion.py

  • Authenticates via GITHUB_TOKEN (Personal Access Token)
  • Fetches the top 6 most recently updated repos (owner-only, sorted by update time)
  • Extracts: repo name, description, language, stars, commit count, and most recent 3 commit messages (for code quality assessment)
  • Aggregates a language distribution dictionary for Shannon Entropy

LeetCode β€” dsa_ingestion.py

  • Queries LeetCode's unofficial GraphQL API
  • Extracts: Easy / Medium / Hard solve counts, total solved, and global ranking
  • Supports both raw usernames and full profile URLs (auto-cleaned)

Codeforces β€” dsa_ingestion.py

Resume Parser β€” resume_parser.py

  • Extracts raw text from PDF using PyPDF2
  • OSINT Auto-Extraction: Regex patterns automatically identify GitHub, LeetCode, and Codeforces profile URLs embedded in the resume
  • Extracted profiles are cross-checked against user-provided inputs for identity verification

πŸ› οΈ Tech Stack

Layer Technology
Frontend Streamlit (wide layout, dark theme compatible)
LLM Intelligence Groq API β€” LLaMA-3.3 70B (RAG) + LLaMA-3.1 8B (Audit & Anonymization)
Embeddings Sentence-Transformers (all-MiniLM-L6-v2, 384-dim)
Data Ingestion PyGithub, Codeforces REST API, LeetCode GraphQL
Vector Store Custom pickle-based persistent store with cosine similarity search
Deterministic Analytics scikit-learn (cosine similarity), Shannon Entropy, Jaccard Index
PDF Parsing PyPDF2
Report Export FPDF (latin-1 compatible PDF generation)
Visualization Plotly Express & Plotly Graph Objects
Environment python-dotenv

πŸ“ System Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Resume PDF  │──► PyPDF2 Text Extraction ──► OSINT Regex (GitHub/LC/CF URLs)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
                                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GitHub User │──► PyGithub API ────────────► β”‚                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚  MATCHING ENGINE β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚                  β”‚
β”‚  LeetCode ID │──► GraphQL API ─────────────► β”‚  1. Cosine Sim   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚  2. Entropy      β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚  3. Jaccard      β”‚
β”‚ Codeforces ID│──► REST API ────────────────► β”‚  4. Anti-Fake    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                                        β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚  GLASS-BOX AUDIT β”‚
β”‚ Job Descript.│──► Bounded Text (1500 chars)─►│  (Groq LLaMA-3)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
                                                       β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  Dashboard +     β”‚
                                              β”‚  PDF Export       β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Installation & Setup

  1. Clone & Navigate

    git clone <repo-url>
    cd TrueSignal
  2. Environment Configuration Create a .env file in the root:

    GITHUB_TOKEN="your_github_token"
    GROQ_API_KEY="your_groq_api_key"
  3. Install Dependencies

    pip install -r requirements.txt
  4. Launch

    streamlit run app.py

πŸ“ Project Structure

TrueSignal/
β”œβ”€β”€ app.py                  # Main Streamlit application (535 lines)
β”œβ”€β”€ matching_algorithm.py   # Cosine similarity, Shannon entropy, Jaccard index
β”œβ”€β”€ audit_engine.py         # Groq LLM-powered Glass-Box audit & anonymization
β”œβ”€β”€ rag_engine.py           # Enterprise RAG: chunking, vector store, LLM synthesis
β”œβ”€β”€ github_ingestion.py     # GitHub API data fetcher (repos, commits, languages)
β”œβ”€β”€ dsa_ingestion.py        # LeetCode (GraphQL) & Codeforces (REST) stats
β”œβ”€β”€ resume_parser.py        # PDF text extraction & OSINT URL extraction
β”œβ”€β”€ dataset_manager.py      # Sample JDs & skills taxonomy
β”œβ”€β”€ talent_pool.pkl         # Persistent RAG vector database
β”œβ”€β”€ requirements.txt        # Python dependencies
└── .env                    # API keys (GITHUB_TOKEN, GROQ_API_KEY)

πŸ›‘οΈ Anti-Fake & Verification

TrueSignal implements a strict Anti-Fake Multiplier. Candidates with zero verifiable OSINT footprint (no GitHub, no LeetCode, no Codeforces) are penalized by up to 60%, surfacing only those who demonstrate their skills in the real world. This is not a heuristic β€” it is a deterministic, mathematically-enforced scoring constraint.


πŸ“„ License

This project was built for a hackathon. Please reach out before using commercially.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages