🐟 markrel

Markov Chain Document Relevance — School your documents with Markov chains!

🐟 What is markrel? A fast, interpretable Python library that uses Markov chains to predict document relevance. Like a school of mackerel navigating the seas, markrel traces probabilistic paths through similarity space to find the most relevant documents.

    🐟🐟🐟
   🐟🐟🐟🐟
  🐟🐟🐟🐟🐟   ← Your documents
   🐟🐟🐟🐟      swimming through
    🐟🐟🐟       relevance space!

🌊 Overview

markrel predicts whether a document is relevant to a query using Markov chains and similarity metrics. It's designed for:

🔍 Semantic search re-ranking — Filter top-k results with learned relevance
📧 Document classification — Sort documents by relevance to topics
🤖 Response selection — Pick best answers from candidate pool
⚡ High-throughput filtering — Process 50K+ documents/second

Key Features

Feature	Description
🐟 8 Similarity Metrics	Cosine, Euclidean, Jaccard, Overlap, Dice, Manhattan, Chebyshev, Dot Product
🧠 Markov Chain Learning	Learns P(relevance) from your data, not generic rules
🎯 3 Optimization Modes	Tune for F1, Recall, or Precision based on your needs
⚡ Fast Inference	50K+ samples/second after training
🔧 Embedding Agnostic	Works with BERT, OpenAI, sentence-transformers, or TF-IDF
📊 Interpretable	See exactly why a document was flagged as relevant

⚡ Quick Start (3 Minutes)

1. Install

pip install markrel

2. Train & Predict

from markrel import MarkovRelevanceModel

# Your data: queries, documents, and relevance labels
queries = ["machine learning tutorial", "baking recipes", "neural networks"]
documents = ["intro to ML", "best chocolate cake", "deep learning guide"]
labels = [1, 0, 1]  # 1 = relevant, 0 = not relevant

# Create and train (using optimal config from benchmarks)
model = MarkovRelevanceModel(
    metrics=["euclidean"],      # Best single metric
    n_bins=35,                  # Optimized for F1
    bin_strategy="uniform"
)
model.fit(queries, documents, labels)

# Predict relevance
probs = model.predict_proba(
    ["deep learning", "pasta recipes"],
    ["neural networks", "italian cooking"]
)
print(probs)  # [0.82, 0.15]

3. Use with Modern Embeddings

from sentence_transformers import SentenceTransformer

# Load BGE-M3 (best model per benchmarks)
encoder = SentenceTransformer('BAAI/bge-m3')

# Encode your texts
query_emb = encoder.encode(["what is ML?"])
doc_emb = encoder.encode(["machine learning is..."])

# Train with embeddings (disable TF-IDF)
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    use_text_vectorizer=False  # Use your embeddings
)
model.fit(query_emb, doc_emb, [1])

That's it! 🎉 You now have a relevance model trained on your data.

📚 Tutorial: Complete Walkthrough

Step 1: Prepare Your Data

Markrel needs (query, document, label) triples:

# Example: Question-Answer Relevance Dataset
queries = [
    "What is machine learning?",
    "How does photosynthesis work?",
    "Best pizza recipe?",
    "Explain neural networks",
    "Types of pasta?",
]

documents = [
    "Machine learning is a subset of AI...",
    "Photosynthesis converts sunlight into energy...",
    "Authentic Neapolitan pizza requires...",
    "Neural networks are computing systems...",
    "Popular pasta types include spaghetti...",
]

# Labels: 1 = relevant, 0 = not relevant
labels = [1, 1, 0, 1, 0]

Step 2: Choose Your Configuration

Based on our benchmarks, here are recommended configs:

# Option A: Balanced (Best F1)
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=35,
    bin_strategy="uniform"
)

# Option B: Catch Everything (Best Recall)
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=7,
    bin_strategy="uniform"
)

# Option C: Strict Filtering (Best Precision)
model = MarkovRelevanceModel(
    metrics=["cosine", "euclidean"],
    n_bins=24,
    bin_strategy="uniform"
)

Step 3: Train the Model

# Train on your data
model.fit(queries, documents, labels)

# Inspect what the model learned
print(model.summary())

Step 4: Make Predictions

# Get probability scores
probabilities = model.predict_proba(
    new_queries,
    new_documents
)

# Apply threshold (default 0.5, or optimized threshold from benchmarks)
threshold = 0.251  # F1-optimized threshold
predictions = probabilities >= threshold

# Or use built-in prediction with custom threshold
predictions = model.predict(
    new_queries,
    new_documents,
    threshold=0.251
)

Step 5: Advanced Usage with Embeddings

For best results, use modern embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load encoder (BGE-M3 recommended)
encoder = SentenceTransformer('BAAI/bge-m3')

# Large-scale training
train_queries = encoder.encode(train_query_texts)
train_docs = encoder.encode(train_doc_texts)
test_queries = encoder.encode(test_query_texts)
test_docs = encoder.encode(test_doc_texts)

# Train markrel
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=35,
    use_text_vectorizer=False  # Important!
)
model.fit(train_queries, train_docs, train_labels)

# Batch prediction (fast!)
probs = model.predict_proba(test_queries, test_docs)

Complete Example: Email Classifier

from markrel import MarkovRelevanceModel
from sentence_transformers import SentenceTransformer

# Load data
emails = ["Urgent: Project deadline moved up", "Weekly team newsletter", "Invoice #1234 payment required"]
queries = ["urgent project emails", "team updates", "billing notifications"]
labels = [1, 0, 1]  # Which emails are relevant to which query

# Encode with BGE-M3
encoder = SentenceTransformer('BAAI/bge-m3')
email_emb = encoder.encode(emails)
query_emb = encoder.encode(queries)

# Train relevance classifier
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=35,
    use_text_vectorizer=False
)
model.fit(query_emb, email_emb, labels)

# Classify new emails
new_emails = encoder.encode([
    "RE: Project timeline discussion",
    "Your Amazon order has shipped",
    "URGENT: Server outage in production"
])
search_query = encoder.encode(["urgent project emails"])

relevance_scores = model.predict_proba(search_query, new_emails)
print(f"Email relevance scores: {relevance_scores}")
# Output: [0.78, 0.12, 0.91]

🎯 Why markrel?

The Problem

Traditional document relevance uses:

Fixed thresholds: "Cosine > 0.7 = relevant" (ignores domain-specific patterns)
Linear scoring: Assumes similarity linearly predicts relevance
Black-box models: Can't explain why a document was selected

The Solution

Markrel uses Markov chains to learn non-linear relevance patterns:

Similarity Score → Bin Mapping → P(Relevance)

   0.95 ──→ Bin 9 ──→ P(rel)=0.92  ✓ Highly relevant
   0.75 ──→ Bin 7 ──→ P(rel)=0.68  ⚠ Maybe relevant
   0.45 ──→ Bin 4 ──→ P(rel)=0.23  ✗ Probably not
   0.15 ──→ Bin 1 ──→ P(rel)=0.05  ✗ Not relevant

Each bin learns its own probability from your training data, capturing domain-specific patterns.

📊 Benchmarks

WikiQA Question-Answer Relevance

Results on 6,165 test samples (4.8% positive class):

Optimization	F1	Recall	Precision	Config	Use Case
Balanced	0.370	0.362	0.379	35 bins, euclidean	General purpose
Recall	0.091	1.000	0.048	7 bins, euclidean	Catch all relevant
Precision	0.007	0.003	1.000	24 bins, cos+euc	Strict filtering

Embedding Model Comparison

Model	F1	AUC	Speed
BGE-M3 ⭐	0.343	0.815	51K/s
RoBERTa-large	0.323	0.828	54K/s
MiniLM-L6	0.322	0.799	61K/s

Winner: BGE-M3 for accuracy, MiniLM for speed.

🔧 Installation

From PyPI (when published)

pip install markrel

From Source

git clone https://github.com/yourusername/markrel.git
cd markrel
pip install -e .

Development Install

pip install -e ".[dev]"
pytest tests/ -v

Dependencies

numpy >= 1.20.0
scikit-learn >= 1.0.0

Optional for embeddings:

sentence-transformers >= 2.0.0

🎨 How It Works

┌────────────────────────────────────────────────────────────────┐
│                    🐟 markrel Pipeline                          │
└────────────────────────────────────────────────────────────────┘

  📥 INPUT                              🔧 PROCESSING
  ────────                              ───────────
  
  ┌──────────┐                       ┌─────────────┐
  │  Query   │──┐                     │  Embed with │
  │  "What   │  │                     │  BGE-M3     │
  │  is ML?" │  │                     │  (1024-dim) │
  └──────────┘  │                     └─────────────┘
                │                           │
  ┌──────────┐  │                           ▼
  │ Document │──┘                     ┌─────────────┐
  │ "Machine │                        │ Similarity  │
  │ learning │                        │ Computation │
  └──────────┘                        └─────────────┘
                                             │
                                             ▼
                                       ┌─────────────┐
                                       │ 📊 Markov   │
                                       │   Chain     │
                                       │             │
                                       │ Bin 0: 5%   │
                                       │ Bin 4: 23%  │
                                       │ Bin 7: 68%  │
                                       │ Bin 9: 92%  │
                                       └─────────────┘
                                             │
                                             ▼
                                       ┌─────────────┐
                                       │ P(Relevant) │
                                       │    0.75     │
                                       └─────────────┘
                                             │
                                             ▼
                                       ┌─────────────┐
                                       │  ✅ Relevant│
                                       └─────────────┘

  📤 OUTPUT: Probability + Prediction

The Markov Chain

Unlike fixed thresholds, markrel learns a probability for each similarity bin:

Similarity:     0.0 ──→ 0.2 ──→ 0.4 ──→ 0.6 ──→ 0.8 ──→ 1.0
                 │       │       │       │       │       │
                 ▼       ▼       ▼       ▼       ▼       ▼
Bin:          [Bin0]  [Bin1]  [Bin2]  [Bin3]  [Bin4]  [Bin5]
                 │       │       │       │       │       │
P(Relevant):    0.05    0.12    0.35    0.68    0.89    0.95
                 │       │       │       │       │       │
               🚫      ⚠️      ⚠️      ✓       ✓       ✓
            Not Rel.    Maybe       Likely       Highly
                        Relevant    Relevant     Relevant

✅ Advantages & Use Cases

✅ Advantages

Feature	Benefit
🎯 Domain Adaptable	Learns from YOUR data, not generic assumptions
📈 Non-linear	Captures complex similarity→relevance patterns
🔧 Tunable	Optimize for F1, Recall, or Precision
⚡ Fast	50K+ samples/second after training
🔍 Interpretable	See P(relevance) per bin; debug predictions
🧩 Embedding Agnostic	Use BERT, OpenAI, or TF-IDF
📦 Lightweight	No GPU required; pure NumPy

✅ Best Use Cases

Use Case	Why markrel Works
🔍 Semantic Search Re-ranking	Fast second-stage filtering of retrieved docs
📧 Email Classification	Learn relevance patterns from your mail
📄 Document Similarity	Semantic matching beyond keywords
🤖 Chatbot Responses	Select best response from candidates
⚡ Real-time Filtering	High-throughput with low latency

❌ Limitations

Limitation	Solution
Requires labeled data	Use transfer learning or synthetic labels
Class imbalance	Use Recall-optimized config for rare positives
No native ranking	Pair with BM25 for initial retrieval
Single-pair only	Use cross-encoders for document sets

📖 API Reference

MarkovRelevanceModel

from markrel import MarkovRelevanceModel

model = MarkovRelevanceModel(
    metrics=["euclidean"],      # Similarity metrics to use
    n_bins=35,                  # Number of bins (10-50)
    bin_strategy="uniform",     # "uniform" or "quantile"
    smoothing=1.0,              # Laplace smoothing
    combine_rule="bayesian",    # "bayesian" or "mean"
    use_text_vectorizer=True    # Auto-vectorize text
)

Methods:

fit(queries, documents, labels) — Train the model
predict_proba(queries, documents) — Get relevance probabilities [0-1]
predict(queries, documents, threshold=0.5) — Binary predictions {0, 1}
summary() — Model statistics
get_metric_probabilities(metric) — Bin probabilities

📄 License

MIT License — See LICENSE for details.

🐟 About the Name

Markrel = Markov Chain + Relevance

Like a school of mackerel swimming through the ocean, markrel navigates the sea of documents, tracing probabilistic paths to find the most relevant matches. Each fish (document) follows the currents (similarity scores) toward their destination (relevance). 🐟🐟🐟

Ready to school your documents? Get started with Quick Start →

     🐟
   🐟🐟🐟
 🐟🐟🐟🐟🐟
   🐟🐟🐟
     🐟

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
data/wikiqa		data/wikiqa
docs		docs
experiments		experiments
markrel		markrel
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
demo.py		demo.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🐟 markrel

📖 Table of Contents

🌊 Overview

Key Features

⚡ Quick Start (3 Minutes)

1. Install

2. Train & Predict

3. Use with Modern Embeddings

📚 Tutorial: Complete Walkthrough

Step 1: Prepare Your Data

Step 2: Choose Your Configuration

Step 3: Train the Model

Step 4: Make Predictions

Step 5: Advanced Usage with Embeddings

Complete Example: Email Classifier

🎯 Why markrel?

The Problem

The Solution

📊 Benchmarks

WikiQA Question-Answer Relevance

Embedding Model Comparison

🔧 Installation

From PyPI (when published)

From Source

Development Install

Dependencies

🎨 How It Works

The Markov Chain

✅ Advantages & Use Cases

✅ Advantages

✅ Best Use Cases

❌ Limitations

📖 API Reference

MarkovRelevanceModel

📄 License

🐟 About the Name

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages