A flexible user profile matching system that uses LLM embeddings, HyDE (Hypothetical Document Embeddings), and directional cross-section similarity to create meaningful connections between people. The system matches users where one person's skills can directly address the other's project needs.
Loose ends from the HyDE / directional-matching work, to pick up later (roughly highest-impact first):
- Existing groups need
--force. Embedding cache keys on (user set, section names); the section rename invalidates old caches anyway, but don't trust a staledata/{group}/embeds. - Commit hygiene. This work-in-progress mixes the feature with model repricing, a Modal signature change, and regenerated
README.md/analysis_report.md(the latter looks like a run artifact — decide if it belongs in git or.gitignore). Consider splitting doc regen from code before pushing.
Resolved: starter-topics bullet char unified on •; cross-key parsing consolidated into utils.parse_cross_key (now ->-aware for multi-word sections); error-path intros use the dual-direction format; deploy_modal.py dead group_name param removed.
- Directional Cross-Matching: Asymmetric need-to-skill matching — "how well can B help A?" is computed independently from "how well can A help B?"
- HyDE Vocabulary Bridging: Automatically bridges the semantic gap between needs ("make my installation respond to movement") and skills ("computer vision, sensor integration") using Hypothetical Document Embeddings
- Configurable Sections: Active/inactive flags on extraction sections — switch between use-cases (need/skill matching vs. social connectivity) via config alone
- Multi-Signal Blending: Combines directional embedding similarity with LLM pair scoring
- Smart Budgeting: Configurable LLM call limits and hash-based caching at every step
- B-Matching Algorithm: Ensures fair degree distribution across users
- Directional Introductions: Each person learns what the other can specifically offer their project
- Extensible: All behavior changes live in config — no code edits needed to switch matching modes
-
Setup Environment
cp .env.example .env # Add your OpenRouter API key to .env (https://openrouter.ai/settings/keys) pip install -e .
-
Add User Profiles (one
.txtfile per user; filename becomes the user ID, e.g.alice.txt-> "alice")- Folder mode: keep them in any folder you like and point Choreo at it with
--input(below). - Group mode: place them in
data/{group_name}/raw/and use--group.
- Folder mode: keep them in any folder you like and point Choreo at it with
-
Configure Matching
- Edit
config/config.yamlto adjust models, budgets, weights, and matching parameters - Modify
config/section_prompt.yamlto customize profile sections (withactiveflags) - Update
config/scoring_prompt.yamlandconfig/introduction_prompt.yamlfor prompts - Adjust
config/hyde_prompt.yamlto tune HyDE descriptor generation
- Edit
-
Run Matching
# Folder mode: group name derived from the folder; outputs go inside it. python main.py --input /path/to/folder --force # Group mode: reads data/<group>/raw, writes to data/<group>/. python main.py --group <group_name> --force
-
View Results (under
<folder>/outputs/in folder mode, ordata/{group_name}/outputs/in group mode)- Individual reports:
…/outputs/{user_id}.md - Cohort summary:
…/outputs/cohort.json - Visualizations:
…/outputs/plots/
- Individual reports:
The system implements a 9-step pipeline that transforms raw user profiles into directional, skill-need-aware connections:
- Load raw text files from
data/{group}/raw/(one.txtfile per user) - Each filename becomes a user ID
- Content hashing for change detection
- Use LLM to extract structured sections from each profile
- Only active sections are extracted (controlled via
activeflag insection_prompt.yaml) - Default active sections for need/skill matching:
- Skills: Concrete tools, techniques, and expertise the person can contribute
- Vision: Broader direction — values, long-term interests, collaboration style
- Project: Current project description — what they're building, its state, and next step
- Needs: Project gaps framed in concrete skill vocabulary (wanted-ad style, so they embed close to others' skills)
- Smart caching prevents re-processing unchanged profiles
- Only runs when
cross_section_weightsare configured (e.g.,needs_skills: 0.85) - For each user's needs, an LLM generates a hypothetical skill descriptor — text written in the vocabulary of the target section (skills), describing what the ideal helper's profile would look like
- This bridges the vocabulary gap: "make my installation respond to audience movement" becomes a skill-vocabulary descriptor that will have high embedding similarity with "computer vision, motion sensors, interactive installations"
- Produces a list of
n_descriptorsHyDE phrasings per user (default 1, configurable) - Results are cached per user keyed by source text hash
- Generate vector embeddings for each user's active sections -> 3D tensor
(n_users, n_sections, embedding_dim) - Additionally embed HyDE descriptors into a separate tensor per cross-section pair:
(n_users, n_descriptors, embedding_dim) - Uses OpenRouter embedding models (default:
google/gemini-embedding-2-preview)
- Generate t-SNE plots showing user clusters in embedding space per section
- Same-section similarity (symmetric): cosine similarity within each section (e.g., project-to-project)
- Cross-section similarity (ASYMMETRIC): HyDE-bridged needs vs. regular skills embeddings
cross_sim[i][j]= "how well can j's skills address i's needs" (using i's HyDE descriptor vs. j's skill embedding)cross_sim[j][i]= "how well can i's skills address j's needs" (different value!)- With
n_descriptors > 1, max-pooling finds the best-matching descriptor pair
- Weighted fusion produces a directional fused matrix
- A symmetric version
(dir + dir.T) / 2is derived for b-matching and candidate selection
- Candidate pairs selected using the symmetric similarity matrix
- LLM evaluates each pair holistically — produces a single score per pair (not directional)
- The embedding-level asymmetry handles directionality; the LLM excels at holistic "is this a good match?" judgment
- Batch processing for efficiency
- Blend normalized embedding scores + LLM scores:
final = embed_weight * embed + llm_weight * llm - Run greedy b-matching on symmetric blended scores
- Every user gets between
b_minandb_maxconnections
- For each matched pair, generate directional introductions:
intro_for_a: What person B can specifically offer person A's projectintro_for_b: What person A can specifically offer person B's project
- Plus concrete starter topics for collaboration
- Per-user markdown reports with directional match reasoning
- Cohort summary JSON with network statistics
- Similarity heatmaps for same-section and cross-section matrices
- Score correlation plots (embedding vs. LLM scores)
All models are routed through OpenRouter (use provider/model slugs):
models:
embedding: "google/gemini-embedding-2-preview"
embedding_dimensions: 768 # MRL truncation; null = full native size (3072)
extraction_llm: "google/gemini-3.1-flash-lite"
pair_llm: "google/gemini-3.1-flash-lite"
enable_reasoning: false # true only for reasoning-capable models
instruction_prompt:
goal: "We are matching community residents who are working on their finals projects..."
hyde:
n_descriptors: 1 # HyDE phrasings per source section (ready for >1 in future)
recipe:
section_weights: # Same-section similarity weights (negative = dissimilarity preferred)
skills: -0.10
vision: 0.30
project: 0.10
needs: 0.00
cross_section_weights: # Cross-section similarity weights (DIRECTIONAL)
needs_skills: 0.80 # A's needs vs B's skills
blending:
embed_weight: 0.35
llm_weight: 0.65
matching:
b_min: 2
b_max: 4Each section has an active flag:
sections:
capabilities:
active: false # Deactivated for need/skill mode
guideline: "..."
skills:
active: true # Active for need/skill mode
guideline: "..."To switch from need/skill matching to a symmetric social-connectivity mode, only config changes are needed:
- In
section_prompt.yaml: define/activate the sections you want to match on (e.g. interests, goals, persona) via theiractiveflags, and deactivate the rest. - In
config.yaml: setsection_weightsfor those sections; remove/emptycross_section_weights. - Swap scoring and introduction prompts to match the new framing.
No Python code changes required. When cross_section_weights is empty, no HyDE step runs and the pipeline operates in fully symmetric mode.
main.py # Pipeline orchestration
├── src/ingest.py # Profile loading & validation
├── src/extract.py # LLM section extraction (active-section filtering)
├── src/hyde.py # HyDE descriptor generation (vocabulary bridging)
├── src/embed.py # Multi-section + HyDE embedding generation
├── src/candidate.py # Directional similarity fusion & candidate generation
├── src/score.py # Batched LLM pair scoring
├── src/match.py # Greedy b-matching algorithm
├── src/introduction.py # Directional introduction generation
├── src/report.py # Report generation & templating
├── src/visualize_similarity.py # Similarity heatmaps
├── src/tsne.py # t-SNE visualization
├── src/llm.py # LLM wrapper with caching & cost tracking
├── src/utils.py # Utilities, score normalization, I/O helpers
└── src/cost_tracker.py # API cost tracking
config/
├── config.yaml # Main pipeline configuration
├── section_prompt.yaml # Section definitions with active flags
├── scoring_prompt.yaml # LLM pair scoring prompt
├── introduction_prompt.yaml # Directional introduction prompt
└── hyde_prompt.yaml # HyDE descriptor generation prompt
- Pair IDs: Always alphabetically sorted for stability (
alice_bobnotbob_alice) — seeutils.stable_pair_id() - Caching: Hash-based change detection at every step (extraction, HyDE, embeddings) prevents re-processing
- Lists by default: HyDE descriptors are always stored as lists, even with
n_descriptors=1. Same code path handles 1 or many descriptors. - Directionality preserved: The asymmetric cross-section matrix is never symmetrized during computation — symmetry is an aggregation choice made at the matching layer
- Backward compatible: When
cross_section_weightsis absent/empty, the pipeline behaves identically to the original symmetric mode
- Python 3.9+
- OpenRouter API key (in
.envasOPENROUTER_API_KEY) - See
pyproject.tomlfor full dependency list
# Local
python main.py --group <group_name> --force
# Modal (serverless)
modal deploy deploy_modal.py
modal run deploy_modal.py::run_matching_pipeline --user-profiles-json=profiles.json