Skip to content

gugarosa/opforch

Repository files navigation

OPForch: A PyTorch-Powered Optimum-Path Forest Classifier

Latest release Open issues License

Welcome to OPForch.

Note that this implementation relies purely on the standard LibOPF. Therefore, if one uses our package, please also cite the original LibOPF authors.

OPForch is a PyTorch-based implementation of the Optimum-Path Forest (OPF) classifier, migrated from the original OPFython package. By replacing per-node Python objects with dense tensors and scalar Numba loops with batched tensor operations, OPForch delivers massive speedups while maintaining zero prediction mismatches against the reference implementation.

Key Highlights

Metric Result
Accuracy Parity 0 prediction mismatches across all 4 classifiers
Predict Speedup Up to 484Γ— faster at N=10,000
Fit Speedup Up to 19Γ— faster at N=10,000
Distance Matrix Up to 413Γ— faster (batched tensor vs NΒ² scalar loop)
GPU Acceleration 12.7Γ— additional speedup on RTX 4070 for distance computation
Device Support CPU, CUDA, and Multi-GPU via DeviceManager

Use OPForch if you need:

  • Graph-based classification without hyperparameter tuning
  • Deterministic training with competitive accuracy
  • GPU-accelerated distance computation and prediction
  • A drop-in replacement for OPFython with orders-of-magnitude speedups

OPForch is compatible with: Python 3.8+ and PyTorch 2.0+.


Package Structure

opforch/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ heap.py          # Tensor-backed binary heap
β”‚   β”œβ”€β”€ subgraph.py      # Dense tensor columns (13 state tensors)
β”‚   └── opf.py           # Abstract base (torch.save/load, device)
β”œβ”€β”€ math/
β”‚   β”œβ”€β”€ distance.py      # 47 batched (N,D)Γ—(M,D)β†’(N,M) distance metrics
β”‚   β”œβ”€β”€ general.py       # Accuracy, confusion matrix, normalize, purity
β”‚   └── random.py        # Tensor-based random generators
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ supervised.py        # MST + competition + batched predict
β”‚   β”œβ”€β”€ knn_supervised.py    # KNN density clustering + k-selection
β”‚   β”œβ”€β”€ semi_supervised.py   # Labeled + unlabeled propagation
β”‚   └── unsupervised.py      # Density clustering + normalized cut
β”œβ”€β”€ stream/
β”‚   β”œβ”€β”€ loader.py        # CSV/TXT/JSON β†’ torch.Tensor
β”‚   β”œβ”€β”€ parser.py        # Extract features + labels
β”‚   └── splitter.py      # Train/test split
β”œβ”€β”€ subgraphs/
β”‚   └── knn.py           # KNNSubgraph (torch.topk, vectorized PDF)
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ constants.py     # EPSILON, FLOAT_MAX, status codes
β”‚   β”œβ”€β”€ converter.py     # Binary OPF format converters
β”‚   β”œβ”€β”€ device.py        # DeviceManager (CPU/GPU/multi-GPU)
β”‚   β”œβ”€β”€ exception.py     # Custom exception hierarchy
β”‚   └── logging.py       # Timed rotating file logger
β”œβ”€β”€ report/              # Migration report, benchmarks, and plots
β”œβ”€β”€ examples/            # Usage scripts for all 4 classifiers

Installation

Install from source:

git clone https://github.com/gugarosa/opforch.git
cd opforch
pip install -e .

For GPU support, install PyTorch with CUDA:

pip install torch --index-url https://download.pytorch.org/whl/cu124

Quick Start

Supervised Classification

import torch
from opforch.models import SupervisedOPF
from opforch.stream import loader, parser, splitter

# Load data
data = loader.load_txt("data/boat.txt")
X, Y = parser.parse_loader(data)
X_train, X_test, Y_train, Y_test = splitter.split(X, Y, percentage=0.5)

# Train and predict (CPU)
opf = SupervisedOPF(distance="log_squared_euclidean")
opf.fit(X_train, Y_train)
predictions = opf.predict(X_test)

# GPU β€” just change the device
opf_gpu = SupervisedOPF(distance="euclidean", device="cuda:0")
opf_gpu.fit(X_train.cuda(), Y_train.cuda())
predictions = opf_gpu.predict(X_test.cuda())

Available Classifiers

Classifier Description
SupervisedOPF MST-based prototype detection + cost competition
KNNSupervisedOPF k-NN density clustering with validation-driven k
SemiSupervisedOPF Extends supervised with unlabeled data propagation
UnsupervisedOPF Density-based clustering with normalized cut

All classifiers support fit(), predict(), save(), and load(), and accept a device parameter for CPU/GPU execution.


Benchmarks

Run the benchmark suite to compare performance on your hardware:

# Baseline benchmarks (47 metrics, 4 models, scaling)
python report/benchmark.py

# Extended benchmarks (up to N=10K, GPU, dimensionality)
python report/benchmark_extended.py

# Generate plots
python report/plot_benchmarks.py
python report/plot_extended.py

For the full migration report with detailed analysis, see report/REPORT.md.


Architecture

The key architectural change from OPFython is the elimination of per-node Python objects in favor of dense tensor columns:

OPFython:  subgraph.nodes[i].cost = 5.0        # Python object attribute
OPForch:   subgraph.costs[i] = 5.0             # Tensor element (GPU-ready)

Prediction is fully batched β€” a single tensor operation replaces the O(NΓ—M) Python loop:

dist_matrix = distance_fn(train_features, test_features)      # (N, M)
path_costs = torch.maximum(train_costs[:, None], dist_matrix)  # (N, M)
predictions = train_labels[path_costs.argmin(dim=0)]           # (M,)

For the complete architecture documentation, see ARCHITECTURE.md.


Citation

If you use OPForch to fulfill any of your needs, please cite us:

J. P. Papa, A. X. FalcΓ£o and C. T. N. Suzuki.
Supervised Pattern Classification based on Optimum-Path Forest.
International Journal of Imaging Systems and Technology (2009).

Datasets

Looking for datasets? We have some pre-loaded into OPF file format in the data/ directory. More are available at recogna.tech.


Support

If you ever need to report a bug, talk to us, or suggest improvements, please open an issue. We will do our best to help.


About

🌳 A PyTorch-inspired implementation of the Optimum-Path Forest classifier.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages