Semantic Talent Intelligence & OSINT Verification Engine
TrueSignal is a production-grade hiring intelligence platform designed to eliminate "Resume Inflation" by verifying candidate claims against real-world digital footprints. Built for a high-intensity hackathon, it leverages semantic search, competitive programming APIs, and Large Language Models (LLMs) to provide an ungameable audit of technical talent.
TrueSignal doesn't just read resumes β it verifies them. It cross-references PDF claims against:
- GitHub: Evaluates repository distribution, commit density, and code quality via the PyGithub API.
- LeetCode & Codeforces: Fetches live competitive programming stats to verify algorithmic proficiency.
- Identity Verification: Flags discrepancies between resume-extracted links and user-provided social handles (OSINT cross-check).
Traditional matching is a "black box." TrueSignal provides a Glass-Box Audit powered by Groq LLaMA-3:
- HR Deep Analysis: Career trajectory and mentorship potential evaluation.
- Verified Skills Justification: Each score is backed by specific evidence found in OSINT or resume artifacts.
- Critical Gaps: Identifies exactly what the candidate is missing relative to the Job Description.
Store and query your entire applicant history semantically.
- Persistent Vector Store: Uses
SentenceTransformersto chunk, embed, and index candidate resumes. - Chat with your DB: Ask questions like "Who in our database has the strongest AWS experience?" to retrieve top matches instantly via Retrieval-Augmented Generation.
- No Hallucinations: All fallback "mock data" has been removed. If a candidate cannot be verified, the system applies a deterministic penalty rather than inventing a score.
- Anti-Bias Engine: Aggressive LLM-based anonymization strips names, gender, and demographics before the AI evaluates technical merit.
- Upload multiple resumes simultaneously.
- Auto-extract OSINT footprints (GitHub, LeetCode, Codeforces) from each PDF.
- Rank candidates deterministically on a competitive leaderboard with interactive score distribution charts.
TrueSignal's core matching engine uses a three-layer scoring pipeline that fuses semantic, deterministic, and verification signals:
| Detail | Value |
|---|---|
| Model | all-MiniLM-L6-v2 (384-dim embeddings, ~22M params) |
| Metric | Cosine Similarity |
| Complexity | O(d) per pair, where d = embedding dimension |
The Job Description and candidate texts (GitHub summary + resume) are separately encoded into dense 384-dimensional vectors using a pre-trained Sentence-BERT model. Cosine similarity measures the angular distance between these vectors:
cos(ΞΈ) = (A Β· B) / (βAβ Γ βBβ)
Why it's efficient: all-MiniLM-L6-v2 is a distilled model optimized for inference speed β ~5x faster than full BERT with only a ~1% accuracy loss on STS benchmarks. The model is loaded once via @st.cache_resource and reused across all requests, eliminating cold-start overhead.
Shannon Entropy quantifies the diversity of a candidate's programming language usage from GitHub:
H(X) = -Ξ£ p(xα΅’) Γ logβ(p(xα΅’))
| Entropy Score | Classification |
|---|---|
| H < 1.0 | Deep Specialist β Focused on 1-2 languages |
| 1.0 β€ H < 2.0 | Versatile Developer β Balanced across a few languages |
| H β₯ 2.0 | Broad Generalist β Significant spread across many languages |
Why it's efficient: Computed in O(n) where n = number of distinct languages. No ML inference required β pure mathematical signal. Gives recruiters an immediate, objective profile classification.
Measures exact keyword overlap between the JD and candidate text using set-theoretic intersection:
J(A, B) = |A β© B| / |A βͺ B|
Words are tokenized using regex (\b[a-zA-Z]{3,}\b), lowercased, and compared as sets.
Why it's efficient: O(n + m) time complexity with Python's built-in hash-set operations. Provides a deterministic, LLM-free baseline that catches keyword matches the neural model might abstract away.
TrueSignal applies a verification-based scoring penalty to prevent resume inflation:
| Verification Status | Multiplier | Effect |
|---|---|---|
| No GitHub + No DSA | 0.4x | 60% penalty β completely unverifiable |
| GitHub only, no DSA | 0.75x | 25% penalty β partial verification |
| GitHub + DSA verified | 1.0x | Full score + DSA boost applied |
DSA Boost Formula:
- LeetCode:
+15 + min(total_solved, 300) / 20β up to +30 points - Codeforces:
+15 + min(rating, 2000) / 100β up to +35 points
Final score: min(100, base_semantic Γ penalty_multiplier + dsa_boost)
The Enterprise RAG (Retrieval-Augmented Generation) system implements a full vector search pipeline:
Resumes are split into overlapping windows for optimal semantic coverage:
| Parameter | Value |
|---|---|
| Chunk Size | 500 characters |
| Overlap | 100 characters |
| Strategy | Sliding window |
Why overlap matters: A 100-char overlap ensures that no sentence is split in half between two chunks. Critical skills mentioned at chunk boundaries are preserved in at least one complete chunk.
- Query is embedded using the same
all-MiniLM-L6-v2model. - Cosine similarity is computed against all stored chunk embeddings.
- Results below a 0.1 relevance threshold are filtered out.
- Top-K chunks are retrieved and passed to the LLM for synthesis.
- Model:
llama-3.3-70b-versatilevia Groq API - Temperature: 0.2 (low-creativity, factual retrieval)
- Max Tokens: 800
Why Groq: Groq's LPU (Language Processing Unit) hardware delivers ~10x faster inference than GPU-based LLM hosting, enabling real-time RAG responses within the Streamlit session.
The audit engine produces a fully explainable AI evaluation:
| Component | Detail |
|---|---|
| Model | llama-3.1-8b-instant via Groq |
| Output Format | Strict JSON schema (enforced via response_format) |
| Temperature | 0.1 (near-deterministic for consistency) |
Structured Output Fields:
confidence_scoreβ Integer 0-100skill_justifications[]β Per-skill score with evidence citationscritical_skills_missing[]β Gaps relative to the JDcode_quality_assessmentβ Commit message quality evaluationhr_deep_analysisβ Senior HR-level career trajectory analysisbias_check_statusβ Whether anonymization was applied
When enabled, an LLM pre-processing step strips:
- Candidate names, gender markers
- University/college names
- Geographic locations
- All demographic proxies
This ensures decisions rely 100% on technical artifact evidence.
- Authenticates via
GITHUB_TOKEN(Personal Access Token) - Fetches the top 6 most recently updated repos (owner-only, sorted by update time)
- Extracts: repo name, description, language, stars, commit count, and most recent 3 commit messages (for code quality assessment)
- Aggregates a language distribution dictionary for Shannon Entropy
- Queries LeetCode's unofficial GraphQL API
- Extracts: Easy / Medium / Hard solve counts, total solved, and global ranking
- Supports both raw usernames and full profile URLs (auto-cleaned)
- Queries Codeforces' official REST API
- Extracts: Current rating, max rating, and rank title (e.g., "Expert", "Candidate Master")
- Extracts raw text from PDF using PyPDF2
- OSINT Auto-Extraction: Regex patterns automatically identify GitHub, LeetCode, and Codeforces profile URLs embedded in the resume
- Extracted profiles are cross-checked against user-provided inputs for identity verification
| Layer | Technology |
|---|---|
| Frontend | Streamlit (wide layout, dark theme compatible) |
| LLM Intelligence | Groq API β LLaMA-3.3 70B (RAG) + LLaMA-3.1 8B (Audit & Anonymization) |
| Embeddings | Sentence-Transformers (all-MiniLM-L6-v2, 384-dim) |
| Data Ingestion | PyGithub, Codeforces REST API, LeetCode GraphQL |
| Vector Store | Custom pickle-based persistent store with cosine similarity search |
| Deterministic Analytics | scikit-learn (cosine similarity), Shannon Entropy, Jaccard Index |
| PDF Parsing | PyPDF2 |
| Report Export | FPDF (latin-1 compatible PDF generation) |
| Visualization | Plotly Express & Plotly Graph Objects |
| Environment | python-dotenv |
ββββββββββββββββ
β Resume PDF ββββΊ PyPDF2 Text Extraction βββΊ OSINT Regex (GitHub/LC/CF URLs)
ββββββββββββββββ β
βΌ
ββββββββββββββββ ββββββββββββββββββββ
β GitHub User ββββΊ PyGithub API βββββββββββββΊ β β
ββββββββββββββββ β MATCHING ENGINE β
ββββββββββββββββ β β
β LeetCode ID ββββΊ GraphQL API ββββββββββββββΊ β 1. Cosine Sim β
ββββββββββββββββ β 2. Entropy β
ββββββββββββββββ β 3. Jaccard β
β Codeforces IDββββΊ REST API βββββββββββββββββΊ β 4. Anti-Fake β
ββββββββββββββββ ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
ββββββββββββββββ β GLASS-BOX AUDIT β
β Job Descript.ββββΊ Bounded Text (1500 chars)ββΊβ (Groq LLaMA-3) β
ββββββββββββββββ ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Dashboard + β
β PDF Export β
ββββββββββββββββββββ
-
Clone & Navigate
git clone <repo-url> cd TrueSignal
-
Environment Configuration Create a
.envfile in the root:GITHUB_TOKEN="your_github_token" GROQ_API_KEY="your_groq_api_key"
-
Install Dependencies
pip install -r requirements.txt
-
Launch
streamlit run app.py
TrueSignal/
βββ app.py # Main Streamlit application (535 lines)
βββ matching_algorithm.py # Cosine similarity, Shannon entropy, Jaccard index
βββ audit_engine.py # Groq LLM-powered Glass-Box audit & anonymization
βββ rag_engine.py # Enterprise RAG: chunking, vector store, LLM synthesis
βββ github_ingestion.py # GitHub API data fetcher (repos, commits, languages)
βββ dsa_ingestion.py # LeetCode (GraphQL) & Codeforces (REST) stats
βββ resume_parser.py # PDF text extraction & OSINT URL extraction
βββ dataset_manager.py # Sample JDs & skills taxonomy
βββ talent_pool.pkl # Persistent RAG vector database
βββ requirements.txt # Python dependencies
βββ .env # API keys (GITHUB_TOKEN, GROQ_API_KEY)
TrueSignal implements a strict Anti-Fake Multiplier. Candidates with zero verifiable OSINT footprint (no GitHub, no LeetCode, no Codeforces) are penalized by up to 60%, surfacing only those who demonstrate their skills in the real world. This is not a heuristic β it is a deterministic, mathematically-enforced scoring constraint.
This project was built for a hackathon. Please reach out before using commercially.