TFM-Playground

The purpose of this repository is to provide a fully open source playground for tabular foundation models. It contains a much smaller and simpler implementation of the TabPFNv2 architecture (nanoTabPFN) as well as a training loop, multiple interfaces to load prior data and an evaluation pipeline. We are planning to rapidly extend the repository with more features, prior interfaces and architectures. It is supposed to be a good starting point for students and researchers that are interested in learning about how Tabular foundation models work under the hood.

Clone the repository, afterwards install dependencies via:

pip install -e .

We offer the same interface as TabPFN:

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from tfmplayground import NanoTabPFNClassifier

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize a classifier
clf = NanoTabPFNClassifier()
clf.fit(X_train, y_train)

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", accuracy_score(y_test, predictions))

Our Code

tfmplayground/model.py contains the implementation of the architecture in less than 250 lines of code. tfmplayground/train.py implements a simple training loop in under 100 lines and tfmplayground/priors.py implements a dataloader that allows you to load a dump pre-generated from a prior. We will release multiple dumps of different scales soon. We also offer an interface where you can provide your own get_batch function.

Pretrain your own small nanoTabPFN

First we download 100k pre-generated datasets with 50 datapoints, 3 features and up to 3 classes each from here.

Then you can run:

python pretrain_classification.py --epochs 80 --steps 25 --batchsize 50 --priordump 50x3_3_100k_classification.h5

This should take less than 5 min on a modern NVIDIA GPU (around 10 minutes on Macbook M4 Pro GPU and around 40 min on M4 Pro CPU).

We also offer a pre-generated dataset containing 1.28M tables with 50 datapoints and 3 features each for regression here.

You can pretrain on it using python pretrain_regressor.py.

Step by Step Explanation (Classifier)

First we import our Architecture, Prior interface and training loop, etc.

from tfmplayground.model import NanoTabPFNModel
from tfmplayground.priors import PriorDumpDataLoader
from tfmplayground.train import train
from tfmplayground.utils import get_default_device
from tfmplayground.interface import NanoTabPFNClassifier
from tfmplayground.callbacks import ConsoleLoggerCallback

from torch.nn import CrossEntropyLoss

then we instantiate our model and loss criterion:

model = NanoTabPFNModel(
    num_attention_heads=6,
    embedding_size=192,
    mlp_hidden_size=768,
    num_layers=6,
    num_outputs=10,
)
criterion = CrossEntropyLoss()

then we instantiate our prior:

device = get_default_device()
prior = PriorDumpDataLoader(filename='50x3_3_100k_classification.h5', num_steps=25, batch_size=50, device=device)

and finally train our model:

trained_model, loss = train(
    model=model,
    prior=prior,
    criterion=criterion,
    epochs=80,
    device=device,
    callbacks=[ConsoleLoggerCallback()]
)

Creating your own datasets

Check out the tabularpriors repository to create your own data using publicly available priors.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
tfmplayground		tfmplayground
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pretrain_classification.py		pretrain_classification.py
pretrain_regression.py		pretrain_regression.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TFM-Playground

Our Code

Pretrain your own small nanoTabPFN

Step by Step Explanation (Classifier)

Creating your own datasets

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

automl/TFM-Playground

Folders and files

Latest commit

History

Repository files navigation

TFM-Playground

Our Code

Pretrain your own small nanoTabPFN

Step by Step Explanation (Classifier)

Creating your own datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages