Skip to content

petabyte/markrel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🐟 markrel

Markov Chain Document Relevance β€” School your documents with Markov chains!

Python 3.8+ License: MIT Tests

🐟 What is markrel? A fast, interpretable Python library that uses Markov chains to predict document relevance. Like a school of mackerel navigating the seas, markrel traces probabilistic paths through similarity space to find the most relevant documents.

    🐟🐟🐟
   🐟🐟🐟🐟
  🐟🐟🐟🐟🐟   ← Your documents
   🐟🐟🐟🐟      swimming through
    🐟🐟🐟       relevance space!

πŸ“– Table of Contents


🌊 Overview

markrel predicts whether a document is relevant to a query using Markov chains and similarity metrics. It's designed for:

  • πŸ” Semantic search re-ranking β€” Filter top-k results with learned relevance
  • πŸ“§ Document classification β€” Sort documents by relevance to topics
  • πŸ€– Response selection β€” Pick best answers from candidate pool
  • ⚑ High-throughput filtering β€” Process 50K+ documents/second

Key Features

Feature Description
🐟 8 Similarity Metrics Cosine, Euclidean, Jaccard, Overlap, Dice, Manhattan, Chebyshev, Dot Product
🧠 Markov Chain Learning Learns P(relevance) from your data, not generic rules
🎯 3 Optimization Modes Tune for F1, Recall, or Precision based on your needs
⚑ Fast Inference 50K+ samples/second after training
πŸ”§ Embedding Agnostic Works with BERT, OpenAI, sentence-transformers, or TF-IDF
πŸ“Š Interpretable See exactly why a document was flagged as relevant

⚑ Quick Start (3 Minutes)

1. Install

pip install markrel

2. Train & Predict

from markrel import MarkovRelevanceModel

# Your data: queries, documents, and relevance labels
queries = ["machine learning tutorial", "baking recipes", "neural networks"]
documents = ["intro to ML", "best chocolate cake", "deep learning guide"]
labels = [1, 0, 1]  # 1 = relevant, 0 = not relevant

# Create and train (using optimal config from benchmarks)
model = MarkovRelevanceModel(
    metrics=["euclidean"],      # Best single metric
    n_bins=35,                  # Optimized for F1
    bin_strategy="uniform"
)
model.fit(queries, documents, labels)

# Predict relevance
probs = model.predict_proba(
    ["deep learning", "pasta recipes"],
    ["neural networks", "italian cooking"]
)
print(probs)  # [0.82, 0.15]

3. Use with Modern Embeddings

from sentence_transformers import SentenceTransformer

# Load BGE-M3 (best model per benchmarks)
encoder = SentenceTransformer('BAAI/bge-m3')

# Encode your texts
query_emb = encoder.encode(["what is ML?"])
doc_emb = encoder.encode(["machine learning is..."])

# Train with embeddings (disable TF-IDF)
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    use_text_vectorizer=False  # Use your embeddings
)
model.fit(query_emb, doc_emb, [1])

That's it! πŸŽ‰ You now have a relevance model trained on your data.


πŸ“š Tutorial: Complete Walkthrough

Step 1: Prepare Your Data

Markrel needs (query, document, label) triples:

# Example: Question-Answer Relevance Dataset
queries = [
    "What is machine learning?",
    "How does photosynthesis work?",
    "Best pizza recipe?",
    "Explain neural networks",
    "Types of pasta?",
]

documents = [
    "Machine learning is a subset of AI...",
    "Photosynthesis converts sunlight into energy...",
    "Authentic Neapolitan pizza requires...",
    "Neural networks are computing systems...",
    "Popular pasta types include spaghetti...",
]

# Labels: 1 = relevant, 0 = not relevant
labels = [1, 1, 0, 1, 0]

Step 2: Choose Your Configuration

Based on our benchmarks, here are recommended configs:

# Option A: Balanced (Best F1)
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=35,
    bin_strategy="uniform"
)

# Option B: Catch Everything (Best Recall)
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=7,
    bin_strategy="uniform"
)

# Option C: Strict Filtering (Best Precision)
model = MarkovRelevanceModel(
    metrics=["cosine", "euclidean"],
    n_bins=24,
    bin_strategy="uniform"
)

Step 3: Train the Model

# Train on your data
model.fit(queries, documents, labels)

# Inspect what the model learned
print(model.summary())

Step 4: Make Predictions

# Get probability scores
probabilities = model.predict_proba(
    new_queries,
    new_documents
)

# Apply threshold (default 0.5, or optimized threshold from benchmarks)
threshold = 0.251  # F1-optimized threshold
predictions = probabilities >= threshold

# Or use built-in prediction with custom threshold
predictions = model.predict(
    new_queries,
    new_documents,
    threshold=0.251
)

Step 5: Advanced Usage with Embeddings

For best results, use modern embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load encoder (BGE-M3 recommended)
encoder = SentenceTransformer('BAAI/bge-m3')

# Large-scale training
train_queries = encoder.encode(train_query_texts)
train_docs = encoder.encode(train_doc_texts)
test_queries = encoder.encode(test_query_texts)
test_docs = encoder.encode(test_doc_texts)

# Train markrel
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=35,
    use_text_vectorizer=False  # Important!
)
model.fit(train_queries, train_docs, train_labels)

# Batch prediction (fast!)
probs = model.predict_proba(test_queries, test_docs)

Complete Example: Email Classifier

from markrel import MarkovRelevanceModel
from sentence_transformers import SentenceTransformer

# Load data
emails = ["Urgent: Project deadline moved up", "Weekly team newsletter", "Invoice #1234 payment required"]
queries = ["urgent project emails", "team updates", "billing notifications"]
labels = [1, 0, 1]  # Which emails are relevant to which query

# Encode with BGE-M3
encoder = SentenceTransformer('BAAI/bge-m3')
email_emb = encoder.encode(emails)
query_emb = encoder.encode(queries)

# Train relevance classifier
model = MarkovRelevanceModel(
    metrics=["euclidean"],
    n_bins=35,
    use_text_vectorizer=False
)
model.fit(query_emb, email_emb, labels)

# Classify new emails
new_emails = encoder.encode([
    "RE: Project timeline discussion",
    "Your Amazon order has shipped",
    "URGENT: Server outage in production"
])
search_query = encoder.encode(["urgent project emails"])

relevance_scores = model.predict_proba(search_query, new_emails)
print(f"Email relevance scores: {relevance_scores}")
# Output: [0.78, 0.12, 0.91]

🎯 Why markrel?

The Problem

Traditional document relevance uses:

  • Fixed thresholds: "Cosine > 0.7 = relevant" (ignores domain-specific patterns)
  • Linear scoring: Assumes similarity linearly predicts relevance
  • Black-box models: Can't explain why a document was selected

The Solution

Markrel uses Markov chains to learn non-linear relevance patterns:

Similarity Score β†’ Bin Mapping β†’ P(Relevance)

   0.95 ──→ Bin 9 ──→ P(rel)=0.92  βœ“ Highly relevant
   0.75 ──→ Bin 7 ──→ P(rel)=0.68  ⚠ Maybe relevant
   0.45 ──→ Bin 4 ──→ P(rel)=0.23  βœ— Probably not
   0.15 ──→ Bin 1 ──→ P(rel)=0.05  βœ— Not relevant

Each bin learns its own probability from your training data, capturing domain-specific patterns.


πŸ“Š Benchmarks

WikiQA Question-Answer Relevance

Results on 6,165 test samples (4.8% positive class):

Optimization F1 Recall Precision Config Use Case
Balanced 0.370 0.362 0.379 35 bins, euclidean General purpose
Recall 0.091 1.000 0.048 7 bins, euclidean Catch all relevant
Precision 0.007 0.003 1.000 24 bins, cos+euc Strict filtering

Embedding Model Comparison

Model F1 AUC Speed
BGE-M3 ⭐ 0.343 0.815 51K/s
RoBERTa-large 0.323 0.828 54K/s
MiniLM-L6 0.322 0.799 61K/s

Winner: BGE-M3 for accuracy, MiniLM for speed.


πŸ”§ Installation

From PyPI (when published)

pip install markrel

From Source

git clone https://github.com/yourusername/markrel.git
cd markrel
pip install -e .

Development Install

pip install -e ".[dev]"
pytest tests/ -v

Dependencies

numpy >= 1.20.0
scikit-learn >= 1.0.0

Optional for embeddings:

sentence-transformers >= 2.0.0

🎨 How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    🐟 markrel Pipeline                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  πŸ“₯ INPUT                              πŸ”§ PROCESSING
  ────────                              ───────────
  
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Query   │──┐                     β”‚  Embed with β”‚
  β”‚  "What   β”‚  β”‚                     β”‚  BGE-M3     β”‚
  β”‚  is ML?" β”‚  β”‚                     β”‚  (1024-dim) β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚                           β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚                           β–Ό
  β”‚ Document β”‚β”€β”€β”˜                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ "Machine β”‚                        β”‚ Similarity  β”‚
  β”‚ learning β”‚                        β”‚ Computation β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚
                                             β–Ό
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                       β”‚ πŸ“Š Markov   β”‚
                                       β”‚   Chain     β”‚
                                       β”‚             β”‚
                                       β”‚ Bin 0: 5%   β”‚
                                       β”‚ Bin 4: 23%  β”‚
                                       β”‚ Bin 7: 68%  β”‚
                                       β”‚ Bin 9: 92%  β”‚
                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚
                                             β–Ό
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                       β”‚ P(Relevant) β”‚
                                       β”‚    0.75     β”‚
                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚
                                             β–Ό
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                       β”‚  βœ… Relevantβ”‚
                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  πŸ“€ OUTPUT: Probability + Prediction

The Markov Chain

Unlike fixed thresholds, markrel learns a probability for each similarity bin:

Similarity:     0.0 ──→ 0.2 ──→ 0.4 ──→ 0.6 ──→ 0.8 ──→ 1.0
                 β”‚       β”‚       β”‚       β”‚       β”‚       β”‚
                 β–Ό       β–Ό       β–Ό       β–Ό       β–Ό       β–Ό
Bin:          [Bin0]  [Bin1]  [Bin2]  [Bin3]  [Bin4]  [Bin5]
                 β”‚       β”‚       β”‚       β”‚       β”‚       β”‚
P(Relevant):    0.05    0.12    0.35    0.68    0.89    0.95
                 β”‚       β”‚       β”‚       β”‚       β”‚       β”‚
               🚫      ⚠️      ⚠️      βœ“       βœ“       βœ“
            Not Rel.    Maybe       Likely       Highly
                        Relevant    Relevant     Relevant

βœ… Advantages & Use Cases

βœ… Advantages

Feature Benefit
🎯 Domain Adaptable Learns from YOUR data, not generic assumptions
πŸ“ˆ Non-linear Captures complex similarityβ†’relevance patterns
πŸ”§ Tunable Optimize for F1, Recall, or Precision
⚑ Fast 50K+ samples/second after training
πŸ” Interpretable See P(relevance) per bin; debug predictions
🧩 Embedding Agnostic Use BERT, OpenAI, or TF-IDF
πŸ“¦ Lightweight No GPU required; pure NumPy

βœ… Best Use Cases

Use Case Why markrel Works
πŸ” Semantic Search Re-ranking Fast second-stage filtering of retrieved docs
πŸ“§ Email Classification Learn relevance patterns from your mail
πŸ“„ Document Similarity Semantic matching beyond keywords
πŸ€– Chatbot Responses Select best response from candidates
⚑ Real-time Filtering High-throughput with low latency

❌ Limitations

Limitation Solution
Requires labeled data Use transfer learning or synthetic labels
Class imbalance Use Recall-optimized config for rare positives
No native ranking Pair with BM25 for initial retrieval
Single-pair only Use cross-encoders for document sets

πŸ“– API Reference

MarkovRelevanceModel

from markrel import MarkovRelevanceModel

model = MarkovRelevanceModel(
    metrics=["euclidean"],      # Similarity metrics to use
    n_bins=35,                  # Number of bins (10-50)
    bin_strategy="uniform",     # "uniform" or "quantile"
    smoothing=1.0,              # Laplace smoothing
    combine_rule="bayesian",    # "bayesian" or "mean"
    use_text_vectorizer=True    # Auto-vectorize text
)

Methods:

  • fit(queries, documents, labels) β€” Train the model
  • predict_proba(queries, documents) β€” Get relevance probabilities [0-1]
  • predict(queries, documents, threshold=0.5) β€” Binary predictions {0, 1}
  • summary() β€” Model statistics
  • get_metric_probabilities(metric) β€” Bin probabilities

πŸ“„ License

MIT License β€” See LICENSE for details.


🐟 About the Name

Markrel = Markov Chain + Relevance

Like a school of mackerel swimming through the ocean, markrel navigates the sea of documents, tracing probabilistic paths to find the most relevant matches. Each fish (document) follows the currents (similarity scores) toward their destination (relevance). 🐟🐟🐟


Ready to school your documents? Get started with Quick Start β†’

     🐟
   🐟🐟🐟
 🐟🐟🐟🐟🐟
   🐟🐟🐟
     🐟

About

A fast, interpretable Python library that uses Markov chains to predict document relevance. Like a school of mackerel navigating the seas, markrel traces probabilistic paths through similarity space to find the most relevant documents.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages