Markov Chain Document Relevance β School your documents with Markov chains!
π What is markrel? A fast, interpretable Python library that uses Markov chains to predict document relevance. Like a school of mackerel navigating the seas, markrel traces probabilistic paths through similarity space to find the most relevant documents.
πππ
ππππ
πππππ β Your documents
ππππ swimming through
πππ relevance space!
- π Overview
- β‘ Quick Start
- π Tutorial
- π― Why markrel?
- π Benchmarks
- π§ Installation
- π¨ How It Works
- β Advantages & Use Cases
- β Limitations
- π API Reference
- π License
markrel predicts whether a document is relevant to a query using Markov chains and similarity metrics. It's designed for:
- π Semantic search re-ranking β Filter top-k results with learned relevance
- π§ Document classification β Sort documents by relevance to topics
- π€ Response selection β Pick best answers from candidate pool
- β‘ High-throughput filtering β Process 50K+ documents/second
| Feature | Description |
|---|---|
| π 8 Similarity Metrics | Cosine, Euclidean, Jaccard, Overlap, Dice, Manhattan, Chebyshev, Dot Product |
| π§ Markov Chain Learning | Learns P(relevance) from your data, not generic rules |
| π― 3 Optimization Modes | Tune for F1, Recall, or Precision based on your needs |
| β‘ Fast Inference | 50K+ samples/second after training |
| π§ Embedding Agnostic | Works with BERT, OpenAI, sentence-transformers, or TF-IDF |
| π Interpretable | See exactly why a document was flagged as relevant |
pip install markrelfrom markrel import MarkovRelevanceModel
# Your data: queries, documents, and relevance labels
queries = ["machine learning tutorial", "baking recipes", "neural networks"]
documents = ["intro to ML", "best chocolate cake", "deep learning guide"]
labels = [1, 0, 1] # 1 = relevant, 0 = not relevant
# Create and train (using optimal config from benchmarks)
model = MarkovRelevanceModel(
metrics=["euclidean"], # Best single metric
n_bins=35, # Optimized for F1
bin_strategy="uniform"
)
model.fit(queries, documents, labels)
# Predict relevance
probs = model.predict_proba(
["deep learning", "pasta recipes"],
["neural networks", "italian cooking"]
)
print(probs) # [0.82, 0.15]from sentence_transformers import SentenceTransformer
# Load BGE-M3 (best model per benchmarks)
encoder = SentenceTransformer('BAAI/bge-m3')
# Encode your texts
query_emb = encoder.encode(["what is ML?"])
doc_emb = encoder.encode(["machine learning is..."])
# Train with embeddings (disable TF-IDF)
model = MarkovRelevanceModel(
metrics=["euclidean"],
use_text_vectorizer=False # Use your embeddings
)
model.fit(query_emb, doc_emb, [1])That's it! π You now have a relevance model trained on your data.
Markrel needs (query, document, label) triples:
# Example: Question-Answer Relevance Dataset
queries = [
"What is machine learning?",
"How does photosynthesis work?",
"Best pizza recipe?",
"Explain neural networks",
"Types of pasta?",
]
documents = [
"Machine learning is a subset of AI...",
"Photosynthesis converts sunlight into energy...",
"Authentic Neapolitan pizza requires...",
"Neural networks are computing systems...",
"Popular pasta types include spaghetti...",
]
# Labels: 1 = relevant, 0 = not relevant
labels = [1, 1, 0, 1, 0]Based on our benchmarks, here are recommended configs:
# Option A: Balanced (Best F1)
model = MarkovRelevanceModel(
metrics=["euclidean"],
n_bins=35,
bin_strategy="uniform"
)
# Option B: Catch Everything (Best Recall)
model = MarkovRelevanceModel(
metrics=["euclidean"],
n_bins=7,
bin_strategy="uniform"
)
# Option C: Strict Filtering (Best Precision)
model = MarkovRelevanceModel(
metrics=["cosine", "euclidean"],
n_bins=24,
bin_strategy="uniform"
)# Train on your data
model.fit(queries, documents, labels)
# Inspect what the model learned
print(model.summary())# Get probability scores
probabilities = model.predict_proba(
new_queries,
new_documents
)
# Apply threshold (default 0.5, or optimized threshold from benchmarks)
threshold = 0.251 # F1-optimized threshold
predictions = probabilities >= threshold
# Or use built-in prediction with custom threshold
predictions = model.predict(
new_queries,
new_documents,
threshold=0.251
)For best results, use modern embeddings:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load encoder (BGE-M3 recommended)
encoder = SentenceTransformer('BAAI/bge-m3')
# Large-scale training
train_queries = encoder.encode(train_query_texts)
train_docs = encoder.encode(train_doc_texts)
test_queries = encoder.encode(test_query_texts)
test_docs = encoder.encode(test_doc_texts)
# Train markrel
model = MarkovRelevanceModel(
metrics=["euclidean"],
n_bins=35,
use_text_vectorizer=False # Important!
)
model.fit(train_queries, train_docs, train_labels)
# Batch prediction (fast!)
probs = model.predict_proba(test_queries, test_docs)from markrel import MarkovRelevanceModel
from sentence_transformers import SentenceTransformer
# Load data
emails = ["Urgent: Project deadline moved up", "Weekly team newsletter", "Invoice #1234 payment required"]
queries = ["urgent project emails", "team updates", "billing notifications"]
labels = [1, 0, 1] # Which emails are relevant to which query
# Encode with BGE-M3
encoder = SentenceTransformer('BAAI/bge-m3')
email_emb = encoder.encode(emails)
query_emb = encoder.encode(queries)
# Train relevance classifier
model = MarkovRelevanceModel(
metrics=["euclidean"],
n_bins=35,
use_text_vectorizer=False
)
model.fit(query_emb, email_emb, labels)
# Classify new emails
new_emails = encoder.encode([
"RE: Project timeline discussion",
"Your Amazon order has shipped",
"URGENT: Server outage in production"
])
search_query = encoder.encode(["urgent project emails"])
relevance_scores = model.predict_proba(search_query, new_emails)
print(f"Email relevance scores: {relevance_scores}")
# Output: [0.78, 0.12, 0.91]Traditional document relevance uses:
- Fixed thresholds: "Cosine > 0.7 = relevant" (ignores domain-specific patterns)
- Linear scoring: Assumes similarity linearly predicts relevance
- Black-box models: Can't explain why a document was selected
Markrel uses Markov chains to learn non-linear relevance patterns:
Similarity Score β Bin Mapping β P(Relevance)
0.95 βββ Bin 9 βββ P(rel)=0.92 β Highly relevant
0.75 βββ Bin 7 βββ P(rel)=0.68 β Maybe relevant
0.45 βββ Bin 4 βββ P(rel)=0.23 β Probably not
0.15 βββ Bin 1 βββ P(rel)=0.05 β Not relevant
Each bin learns its own probability from your training data, capturing domain-specific patterns.
Results on 6,165 test samples (4.8% positive class):
| Optimization | F1 | Recall | Precision | Config | Use Case |
|---|---|---|---|---|---|
| Balanced | 0.370 | 0.362 | 0.379 | 35 bins, euclidean | General purpose |
| Recall | 0.091 | 1.000 | 0.048 | 7 bins, euclidean | Catch all relevant |
| Precision | 0.007 | 0.003 | 1.000 | 24 bins, cos+euc | Strict filtering |
| Model | F1 | AUC | Speed |
|---|---|---|---|
| BGE-M3 β | 0.343 | 0.815 | 51K/s |
| RoBERTa-large | 0.323 | 0.828 | 54K/s |
| MiniLM-L6 | 0.322 | 0.799 | 61K/s |
Winner: BGE-M3 for accuracy, MiniLM for speed.
pip install markrelgit clone https://github.com/yourusername/markrel.git
cd markrel
pip install -e .pip install -e ".[dev]"
pytest tests/ -vnumpy >= 1.20.0
scikit-learn >= 1.0.0
Optional for embeddings:
sentence-transformers >= 2.0.0
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π markrel Pipeline β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π₯ INPUT π§ PROCESSING
ββββββββ βββββββββββ
ββββββββββββ βββββββββββββββ
β Query ββββ β Embed with β
β "What β β β BGE-M3 β
β is ML?" β β β (1024-dim) β
ββββββββββββ β βββββββββββββββ
β β
ββββββββββββ β βΌ
β Document ββββ βββββββββββββββ
β "Machine β β Similarity β
β learning β β Computation β
ββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββ
β π Markov β
β Chain β
β β
β Bin 0: 5% β
β Bin 4: 23% β
β Bin 7: 68% β
β Bin 9: 92% β
βββββββββββββββ
β
βΌ
βββββββββββββββ
β P(Relevant) β
β 0.75 β
βββββββββββββββ
β
βΌ
βββββββββββββββ
β β
Relevantβ
βββββββββββββββ
π€ OUTPUT: Probability + Prediction
Unlike fixed thresholds, markrel learns a probability for each similarity bin:
Similarity: 0.0 βββ 0.2 βββ 0.4 βββ 0.6 βββ 0.8 βββ 1.0
β β β β β β
βΌ βΌ βΌ βΌ βΌ βΌ
Bin: [Bin0] [Bin1] [Bin2] [Bin3] [Bin4] [Bin5]
β β β β β β
P(Relevant): 0.05 0.12 0.35 0.68 0.89 0.95
β β β β β β
π« β οΈ β οΈ β β β
Not Rel. Maybe Likely Highly
Relevant Relevant Relevant
| Feature | Benefit |
|---|---|
| π― Domain Adaptable | Learns from YOUR data, not generic assumptions |
| π Non-linear | Captures complex similarityβrelevance patterns |
| π§ Tunable | Optimize for F1, Recall, or Precision |
| β‘ Fast | 50K+ samples/second after training |
| π Interpretable | See P(relevance) per bin; debug predictions |
| π§© Embedding Agnostic | Use BERT, OpenAI, or TF-IDF |
| π¦ Lightweight | No GPU required; pure NumPy |
| Use Case | Why markrel Works |
|---|---|
| π Semantic Search Re-ranking | Fast second-stage filtering of retrieved docs |
| π§ Email Classification | Learn relevance patterns from your mail |
| π Document Similarity | Semantic matching beyond keywords |
| π€ Chatbot Responses | Select best response from candidates |
| β‘ Real-time Filtering | High-throughput with low latency |
| Limitation | Solution |
|---|---|
| Requires labeled data | Use transfer learning or synthetic labels |
| Class imbalance | Use Recall-optimized config for rare positives |
| No native ranking | Pair with BM25 for initial retrieval |
| Single-pair only | Use cross-encoders for document sets |
from markrel import MarkovRelevanceModel
model = MarkovRelevanceModel(
metrics=["euclidean"], # Similarity metrics to use
n_bins=35, # Number of bins (10-50)
bin_strategy="uniform", # "uniform" or "quantile"
smoothing=1.0, # Laplace smoothing
combine_rule="bayesian", # "bayesian" or "mean"
use_text_vectorizer=True # Auto-vectorize text
)Methods:
fit(queries, documents, labels)β Train the modelpredict_proba(queries, documents)β Get relevance probabilities [0-1]predict(queries, documents, threshold=0.5)β Binary predictions {0, 1}summary()β Model statisticsget_metric_probabilities(metric)β Bin probabilities
MIT License β See LICENSE for details.
Markrel = Markov Chain + Relevance
Like a school of mackerel swimming through the ocean, markrel navigates the sea of documents, tracing probabilistic paths to find the most relevant matches. Each fish (document) follows the currents (similarity scores) toward their destination (relevance). πππ
Ready to school your documents? Get started with Quick Start β
π
πππ
πππππ
πππ
π