CreditPFN

Continued pretraining of TabPFN (v2.6 / v3) on a curated corpus of real-world credit-risk datasets. The aim is to specialise the tabular foundation model's in-context-learning prior toward the structures, feature distributions, and label noise of credit-risk data, and to test whether a credit-specialised foundation model outperforms generalist TabPFN on downstream PD / LGD tasks.

The whole project is organised as a three-stage pipeline — data → train → eval — where each stage has its own orchestrator, config yaml, and result layout.

1. Overview

Three pipeline stages, each with one config yaml and one CLI script:

Stage	Config	Orchestrator	What it does
Data	`config/data.yaml`	`scripts/data_pipeline.py`	Dedup → register → sanitize → dedup. Writes one sanitized CSV per dataset under `data/processed/`. CPU-only; ~10 minutes for the full 17 PD + 8 LGD corpus.
Train	`config/train.yaml`	`scripts/train_pipeline.py`	Continued pretraining of every `(base × LR × LoRA × query_fraction × accumulate_grad_batches)` tuple in the tunable grid. Reads sanitized CSVs directly, draws a fresh per-epoch subsample for each dataset, then applies TabPFN's official preprocessor (squashing scaler / quantile / SVD) to every step's data before the model sees it. Writes finetuned `.ckpt` files + provenance + per-epoch CSVs. Requires a CUDA GPU.
Eval	`config/eval.yaml`	`scripts/eval_pipeline.py`	K-fold cross-validation of every model on every held-out test dataset (XGBoost, CatBoost, LogReg / LinReg, untuned and trained TabPFN). Writes one CSV per `(model × dataset × fold)`. Requires a CUDA GPU.

The notebooks under notebooks/ consume the outputs of all three stages and drop publication-quality PDFs under output/figures/. See chapter 4 for the per-folder breakdown.

2. Background

TabPFN is a transformer-based tabular foundation model that performs in-context learning over entire tabular datasets in a single forward pass. Each version ships two separate checkpoints:

a classifier used here for Probability of Default (PD), and
a regressor used here for Loss Given Default (LGD).

The two have different weights and must be adapted independently.

Which base checkpoint? Treated as a training-stage hyperparameter, not a decision baked in at the data-pipeline stage. The default sweep covers v3 (newest, synthetic-only) and v2.6 (synthetic-only). The v2.5 family was dropped on 2026-05-21 — its loaded checkpoint exposes module names PEFT cannot suffix-match for LoRA and its internal scaler produces NaN on constant columns. The full inventory plus the citation chain that grounds each provenance claim lives in docs/CHECKPOINTS.md.

Continued pretraining — as introduced for tabular foundation models in Real-TabPFN (Garg et al., 2025, arXiv:2507.03971) — extends the synthetic-prior pretraining of TabPFN with additional training on a curated corpus of real tabular datasets from a target domain. This project applies the same methodology to credit risk, with the specific objectives:

PD — binary classification of whether an obligor will default within a given horizon.
LGD — regression of the fraction of exposure lost given default.

3. Quick start

You almost certainly cannot run the interesting parts of this repository on a laptop. Continued pretraining and the cross-model benchmark both require a CUDA GPU with ≥ 16 GB VRAM plus substantial system RAM. A laptop with a CPU-only Python install can run the data pipeline, the test suite, and open the notebooks against pre-existing outputs — useful for debugging and dataset curation, but nothing else.

The real workflow lives on an HPC cluster. This project was developed against KU Leuven's VSC (Flemish Supercomputer Centre); step-by-step VSC-specific instructions live in docs/VSC_GUIDE.md. The notes below are general and will adapt to any SLURM-managed cluster with a CUDA GPU partition.

3.1 Install

Python 3.11 or 3.12 is required. torch and scikit-learn don't ship Python-3.14 wheels yet, so a fall-back compile from source will fail on most platforms.

python3.12 -m venv .venv --prompt CreditPFN          # Linux / macOS
# py -3.12 -m venv .venv --prompt CreditPFN          # Windows / PowerShell
source .venv/bin/activate                            # Linux / macOS
# .venv\Scripts\activate                             # Windows / PowerShell
pip install --upgrade pip
pip install -r requirements.txt

# TabPFN install gotcha (read this).
# This project's src/train/model.py uses the current Prior Labs API
# (4-tuple load_model_criterion_config return; version="v3"/"v2.6";
# download_if_not_exists kwarg). PyPI's `tabpfn` is pinned at 2.2.x with
# an older API and will TypeError on the first model load. Override:
pip install --upgrade "tabpfn @ git+https://github.com/PriorLabs/tabPFN.git@main"

3.2 Verify the install (local laptop is fine)

These three operate on CPU; a laptop is sufficient. Use them to make sure the project builds and reads its data correctly before submitting any cluster jobs.

pytest -q tests/                                   # ~5 min, no GPU needed
python scripts/data_pipeline.py                    # ~10 min, builds data/processed/
jupyter notebook notebooks/                        # open the data-exploration notebooks

3.3 Real training and eval require a CUDA cluster

The recommended entry point is the chained submitter. It loads config/data.yaml, resolves the storage tier (scratch for fast-purge / data for durable), submits the data job, then submits training and eval as dependents:

# On a cluster login node, after cloning + installing + uploading raw CSVs:
bash scripts/slurm/submit_full_pipeline.sh

To run a single stage by hand, the per-stage SLURM templates live in scripts/slurm/. All of them read the paths.data_source knob in config/data.yaml to decide whether raw and processed data live on fast scratch or durable storage — see chapter 5 for the data layout and docs/VSC_GUIDE.md for VSC-specific paths, partitions, and a failure-mode cheat sheet.

Hydra-style CLI overrides work from any entrypoint (no yaml edits):

python scripts/train_pipeline.py track=lgd train.epochs=30
python scripts/eval_pipeline.py  --method xgboost  --test-dataset 0001.gmsc

4. Repository layout

CreditPFN/
├── README.md
├── requirements.txt
├── src/                      pipeline source code         (see 4.1)
├── config/                   three YAML configs           (see 4.2)
├── scripts/                  CLI + SLURM templates        (see 4.3)
├── notebooks/                exploration + viz notebooks  (see 4.4)
├── tests/                    pytest suite                 (see 4.5)
├── docs/                     long-form documentation      (see 4.6)
├── papers/                   PDF library                  (see 4.7)
├── repositories/             upstream code dumps          (see 4.7)
├── checkpoints/              base + trained TabPFN .ckpt  (see 4.8, gitignored)
├── data/                     raw + processed corpus       (see 4.9, gitignored)
├── output/                   everything the code writes   (see 4.9, gitignored)
└── logs/                     one log file per task        (see 4.9, gitignored)

4.1 `src/` — pipeline source code

Subpackage	Role	Public CLI
`src/data/`	Four data-pipeline stages (dedup pre · register · sanitize · dedup post) plus `preprocessing.py` for per-dataset surgical fixes. Output is one sanitized CSV per dataset under `data/processed/`.	`python -m src.data.<stage>` for any one stage, or `scripts/data_pipeline.py` for the chain.
`src/train/`	The continued-pretraining loop: corpus split (`corpus.py`), the on-the-fly dataloader (`dataloader.py`) that reads sanitized CSVs and draws a fresh random subsample every epoch, TabPFN load/save + LoRA wrapping (`model.py`), training loop with per-epoch monitor (`loop.py`), metrics (`metrics.py`).	`scripts/train_pipeline.py`
`src/model/`	sklearn-style wrappers for every model the eval scores: XGBoost + CatBoost (with Optuna HPO), LogReg / LinReg (default-hyperparam baselines), TabPFN-untuned, TabPFN-trained. Single `base.py::BaselineModel` protocol so the eval loop stays model-agnostic.	importable only
`src/eval/`	The cross-model benchmark: processed-CSV loader, K-fold splitter with inner train/val, comprehensive metrics computation, results-dir routing, skip-existing rerun guard.	`scripts/eval_pipeline.py`
`src/utils/`	Cross-cutting helpers: env-aware path resolver (`paths.py`), one-file-per-task run logging (`run_log.py`), notebook figure sink (`figures.py`), training / eval visualisation helpers (`training_viz.py`, `eval_viz.py`), upstream code refresh (`refresh_repositories.py`).	`python src/utils/refresh_repositories.py`

4.2 `config/` — three YAML configs, one per stage

Every knob lives in one of three YAMLs. Each yaml is the single source of truth for its stage; the corresponding script imports it via OmegaConf and accepts Hydra-style overrides on the CLI (key.nested=value).

File	Drives	Main sections
`config/data.yaml`	`src/data/*` + `scripts/data_pipeline.py`	paths (incl. `data_source: "scratch" \| "data"`), `finetuning.max_rows_per_epoch` + `query_fraction`, dedup detection thresholds, sanitize knobs (max missing rate, FeatureAgglomeration, LGD target clip)
`config/train.yaml`	`src/train/*` + `scripts/train_pipeline.py`	`tunable.*` (sweep axes: base checkpoint × LR × LoRA × query_fraction × accumulate_grad_batches), corpus split (Mode A fractions / Mode B explicit IDs), optimizer + scheduler, LoRA cfg, train loop
`config/eval.yaml`	`src/eval/*` + `scripts/eval_pipeline.py`	enabled baselines, K-fold + inner-val fractions, per-fold Optuna budget, `max_rows_per_model` (per-architecture training-context cap), results dir

What is deliberately not in YAML: anything that never changes across runs (optimizer family AdamW, cosine schedule, metric column order). Those are hardcoded in code — searchable constants near the top of the relevant module.

4.3 `scripts/` — CLI entrypoints and SLURM templates

One orchestrator per pipeline stage, plus SLURM templates for the cluster:

File	What it does
`scripts/data_pipeline.py`	Run all four data stages end-to-end, or just the ones you ask for (`--datasets ...`). Idempotent; `--fresh` rebuilds from scratch.
`scripts/train_pipeline.py`	Iterate the `cfg.tunable` cartesian grid; one trial per call when `--single` or `--trial-index` (SLURM array). Auto-fills missing sanitized CSVs by invoking the data pipeline for just those IDs. Skips trials whose finetuned checkpoint already exists — re-submission is safe.
`scripts/eval_pipeline.py`	Score every model on every test dataset, K-fold CV. Skip-existing by default; `--rerun` to force. Filterable with `--method` / `--test-dataset` / `--task-index`.
`scripts/slurm/*.slurm`	SLURM templates: one per data / train / eval stage, plus `submit_full_pipeline.sh` for the chained submission.
`src/utils/pipeline_clean.py`	Stage-level cleanup utility. `python -m src.utils.pipeline_clean --stages train,eval` wipes the outputs of the named stage(s) so the next re-submit starts those stages from scratch. See section 4.10 for details.

4.4 `notebooks/` — exploration and result visualisations

Four notebooks, two for data exploration (run after the data pipeline) and two for training / eval visualisation (run after the respective pipeline). Every notebook drops its figures as PDFs into output/figures/<notebook-slug>/ via the figure sink helper in src/utils/figures.py; the per-notebook directory is wiped on each re-run, so stale figures never linger. All plotting code lives in the corresponding helper module under src/utils/; the notebook cells contain only function calls so the narrative stays scannable and the logic stays testable.

Notebook	What it shows
`0.0. raw_data_exploration.ipynb`	What did the vendor deliver? Corpus shape-space scatter (features vs rows, per-track → combined, log + linear), per-dataset missing-cell bars, target / class-balance landscape on the raw CSVs.
`0.1. processed_data_exploration.ipynb`	Did sanitisation produce sensible inputs? Same shape + missingness views on the post-sanitize CSVs, plus the 64-feature selection effect and feature-type composition.
`1.0. training_visualization.ipynb`	PD trained variants in one dashboard — per-trial loss / lr / metric curves, cross-trial overlays, LR sweep, LoRA effect, time/accuracy Pareto, convergence diagnostics, leaderboard. `TRACK='pd'`; consumes `output/training/`.
`1.1. training_visualization.ipynb`	LGD counterpart of 1.0 — identical dashboard with `TRACK='lgd'`.
`2.0. final_results.ipynb`	PD eval leaderboard — per-method box plots, per-dataset heatmaps, pairwise win-rate matrix (à la TabPFN-3 Fig 3), trained-vs-untuned scatter (à la Real-TabPFN), fold stability, threshold calibration. `TRACK='pd'`; consumes `output/results/`.
`2.1. final_results.ipynb`	LGD counterpart of 2.0 — identical leaderboard with `TRACK='lgd'`.

Corpus summaries in the data notebooks are memoised so the first cell pays the disk-read cost once and every subsequent plot reads from RAM.

4.5 `tests/` — unit and smoke tests

pytest -q tests/    # ~5 min on a laptop; torch-dependent tests skip when torch is missing

One file per src/ subpackage. Tests lean toward failure-mode coverage over behavioural completeness: a few sharp tests that catch real regressions if a future refactor breaks the contract. Tests requiring a real TabPFN checkpoint on disk are guarded by pytest.importorskip("tabpfn") so the suite stays runnable in a stripped-down CI image.

File	Coverage
`tests/test_data.py`	data pipeline (preprocessing → register → sanitize → dedup) + surgical-fix correctness per dataset
`tests/test_paths.py`	env-aware path resolution (local-vs-cluster routing) + `data_source` cfg knob
`tests/test_train.py`	corpus split (`DatasetRef`), dataloader (`ProcessedDatasetLoader` including per-epoch reshuffle), LR schedule, descriptive name, end-to-end mocked training loop
`tests/test_model.py`	baseline wrappers on synthetic data, model registry
`tests/test_eval.py`	per-cell scoring, K-fold benchmark on synthetic processed CSVs, per-method CSV dirs, rerun-skip with full-fold semantics

4.6 `docs/` — project documentation

File	What it is
`docs/DATA_PIPELINE.md`	Deep-dive on the data stage: one raw CSV's full journey through dedup → register → sanitize, plus the two divergent downstream preprocessing paths (TabPFN vs classical baselines). Companion to `config/data.yaml`.
`docs/CHECKPOINTS.md`	Inventory of every base `.ckpt` we ship (v2.6 / v3): training data (synthetic-only), sample/feature caps, layer counts, licence terms. Cross-referenced to the HF model cards and Grinsztajn et al. 2026 (arXiv:2511.08667), which documents the architecture family.
`docs/LITERATURE.md`	Chronological tour of every paper under `papers/`, with a "For CreditPFN" pointer per paper. The most directly relevant works (Real-TabPFN, TabPFNv2, TabPFN-2.5, TabPFN-3, Rubachev finetuning, TabPFN-Wide) are flagged at the top.
`docs/REPOSITORIES.md`	What each `repositories/*.txt` dump is, why we keep it, and which lines to grep when designing each pipeline stage. Refresh script: `python src/utils/refresh_repositories.py`.
`docs/VSC_GUIDE.md`	VSC-specific deployment guide (KU Leuven's Vlaamse Supercomputer Centre): OnDemand portal, conda env, dataset upload, partition / GPU choice, the SLURM submit chain, failure-mode cheat sheet. Read this only when you're about to deploy on VSC; everything in this README applies to any SLURM cluster.

4.7 `papers/` and `repositories/` — reference material

papers/ — PDF library of every paper we cite. The same set is summarised chronologically in docs/LITERATURE.md with extracted-text versions under papers/text/ for grep-friendly search.
repositories/ — flat-text dumps of the upstream Python packages we depend on (TabPFN, TabPFN extensions, the PFN reference implementation, NanoTabPFN, VSC documentation, …). Catalogued in docs/REPOSITORIES.md; refreshed with python src/utils/refresh_repositories.py. These are read-only references — the project does not import any of them.

4.8 `checkpoints/` — base and trained TabPFN weights (gitignored)

checkpoints/*.ckpt — base weights downloaded from Prior Labs (v2.6, v3 in both classifier and regressor flavours). The inventory and provenance live in docs/CHECKPOINTS.md. The actual .ckpt files are gitignored because they're large; collaborators download them once during environment setup.
checkpoints/trained/{pd,lgd}/*.ckpt — finetuned weights produced by train_pipeline.py. Each is paired with a <file>.ckpt.provenance.json sidecar that records every hyperparameter, the training/test dataset IDs, the GPU used, and the wall-clock time — readable via src.train.model.load_provenance(path).

4.9 Runtime trees: `data/`, `output/`, `logs/` (gitignored)

These directories are populated by the pipeline scripts. They are gitignored because the contents are large and machine-specific.

data/                           # data pipeline input + sanitized output
├── raw/{pd,lgd}/<id>.csv       # hand-curated input corpus (you supply this)
├── processed/{pd,lgd}/         # <id>.sanitized.csv — the on-disk training input
├── dedup/                      # doubles_{track}_{pre,post}.csv (always durable)
└── manifest_{pd,lgd}.csv       # per-track dataset manifest (one row per dataset)

output/                         # everything the code writes (except trained .ckpt)
├── training/
│   ├── manifests/<run>_<track>.csv         one row per trial
│   └── epochs/<track>/<descriptive>.csv    per-epoch (loss, lr, train/test metric)
├── results/<TRACK>/<method>/<run>_<ts>.csv eval-pipeline CSVs (one per task)
└── figures/<notebook-slug>/*.pdf           per-notebook PDF figure dumps

logs/<task>_<ts>[_j<jid>_a<tid>].log        one log file per task (flat dir)

On a laptop, data/ and output/ live under the repo root. On a cluster, they are split between fast and durable storage tiers according to paths.data_source in config/data.yaml:

data_source: "scratch" — data/raw/, data/processed/ on fast scratch storage (subject to monthly purge); dedup files and manifests still on durable storage.
data_source: "data" — everything on durable storage.

The eval results (output/results/), training manifests (output/training/), figures (output/figures/), checkpoints, and logs always live on durable storage regardless of data_source.

5. Re-submitting the pipeline (resume semantics + cleanup)

Every stage is designed to be idempotent — re-submitting picks up where the previous run left off without redoing work it has already done. The skip logic lives in three places, one per stage:

Stage	What is treated as "already done"	What re-submission does
Data	A dataset is skipped if its `data/processed/<track>/<id>.sanitized.csv` exists on disk. Checked per-dataset by `_ensure_processed` in `scripts/train_pipeline.py` and by `scripts/data_pipeline.py` itself.	Datasets already on disk are skipped silently. Missing datasets are sanitized fresh.
Train	A trial is skipped if BOTH `checkpoints/trained/<track>/<descriptive_name>.ckpt` AND `<...>.ckpt.provenance.json` exist. (Added 2026-05-27.)	A "SKIP" row is appended to the manifest for that trial; the SLURM array task exits cleanly without retraining.
Eval	A `(model, dataset)` pair is skipped if every fold has an existing `status="OK"` row in its per-method CSV under `output/results/`. Partial-failure pairs (some folds FAIL) ARE retried.	Pairs already complete are skipped. New trials added to the training manifest after a previous eval run get scored. `--rerun` flag forces fresh scoring.

Logs are NEVER auto-deleted between runs. Each SLURM job creates a new logs/<task>_<ts>_j<jid>_a<tid>.log file (suffixed with the trial's descriptive name on rename — see the training pipeline). They accumulate until you explicitly delete them.

To force a fresh start of one or more stages, use the cleanup utility:

# wipe just the training stage (keeps data + eval intact)
python -m src.utils.pipeline_clean --stages train

# wipe training and eval (keeps the sanitized CSVs)
python -m src.utils.pipeline_clean --stages train,eval

# nuclear option — wipe everything
python -m src.utils.pipeline_clean --stages all

# dry-run first to see what would be deleted
python -m src.utils.pipeline_clean --stages all --dry-run

pipeline_clean deletes every output of the listed stage(s) — sanitized CSVs, dedup files, finetuned checkpoints, manifests, per-epoch CSVs, benchmark CSVs, notebook figures, AND that stage's log files. It NEVER touches data/raw/ (the input corpus) or checkpoints/*.ckpt (the base TabPFN weights — only checkpoints/trained/ is in scope). Full file catalogue lives in the module's docstring at src/utils/pipeline_clean.py.

The util has no heavy dependencies (no omegaconf, no torch) — it runs on a bare login node. It reads config/{data,eval,train}.yaml when available (via omegaconf or PyYAML) and falls back to hardcoded defaults otherwise.

Typical re-submit workflow. You change something in config/train.yaml (different LR sweep, different LoRA targets) and want to retrain. The previous run's checkpoints would now be stale, so:
python -m src.utils.pipeline_clean --stages train,eval
sbatch scripts/slurm/train_pd.slurm
Data stays — you didn't change anything that would invalidate the sanitized CSVs. Eval is cleared too because the trained checkpoints changed, so old eval results would mix freshly-trained scores with stale-base scores.

6. Data pipeline

Four stages, in order. The end-to-end driver is scripts/data_pipeline.py; each stage can also run independently via python -m src.data.<stage>. There is no .npz chunking step — the sanitized CSV is the canonical on-disk training input, and the training loop builds batches on the fly.

#	Module	Reads	Writes
1	`src/data/dedup.py` `--pass pre`	`data/raw/{pd,lgd}/*.csv`	`data/dedup/doubles_{track}_pre.csv`
2	`src/data/register.py`	raw CSVs + `DATASET_METADATA`	`data/manifest_{pd,lgd}.csv`
3	`src/data/sanitize.py`	raw CSVs + manifests	`data/processed/{pd,lgd}/<id>.sanitized.csv`
4	`src/data/dedup.py` `--pass post`	processed CSVs	`data/dedup/doubles_{track}_post.csv`

Plus one importable helper used by stages 2 and 3:

src/data/preprocessing.py — DATASET_METADATA (target column, categorical hints, source) and per-dataset surgical fixes (drop ID columns, decode bespoke string formats, parse "5yrs 3mon" → integer months, remove target-leakage columns). Currently registers 17 PD + 8 LGD datasets. No statistical operations here — no log-transforms, no scaling, no clipping.

Stage descriptions

dedup.py — eight detection methods per pass per track (identifier match, column-name Jaccard + identical shape, row-level pandas hash, column-level hash, rounded-row hash, subset detection, fuzzy column-name match). Diagnostic only: the stage writes a doubles_{track}_{pre,post}.csv report listing every flagged pair with the detection method and confidence label (high / medium / low) but does NOT remove any dataset from the corpus. The first occurrence of a dataset within a track is always considered the canonical one; only subsequent duplicates appear in the report. To act on findings, manually delete a CSV from data/raw/<track>/ and re-run pipeline_clean --stages data followed by sbatch scripts/slurm/data.slurm.
register.py — applies surgical fixes, then computes per-dataset metadata (n_rows / n_cols, missing rate, class balance, target mean/std, content-aware shape hash). Idempotent: re-running updates rows in place and preserves existing IDs not in the current filter.
sanitize.py — surgical fixes, then a dataset-agnostic clean: drop exact-duplicate columns, drop > 90 %-NaN columns, drop constant columns, coerce numeric strings, cast numericals to float32 (out-of-range values become NaN before the cast — no overflow warnings), ±inf → NaN, optional unsupervised feature selection to ≤ 64 real columns (top by scale-free variance, de-correlated; not averaged), label-encode classification targets, clip LGD targets to [0, 1].

The eval pipeline reads the same sanitized CSVs but applies its own K-fold CV split + per-model row cap (see chapter 8). The training loop reads them via src/train/dataloader.py::ProcessedDatasetLoader, which draws a fresh random subsample of finetuning.max_rows_per_epoch rows from each parent dataset every epoch and then runs TabPFN's official preprocessor (TabPFNEnsemblePreprocessor, see src/train/tabpfn_preprocessing.py) on the context split — same squashing-scaler + quantile + SVD + class-permutation pipeline that runs at inference time inside TabPFNClassifier.predict_proba. This ensures training-time inputs match the distribution the model was pretrained on (added 2026-05-27; before this the model was trained on raw heavy-tailed features and inference applied preprocessing, which produced calibration-collapse — see chat).

What sanitize.py deliberately does NOT do

TabPFN's package handles these internally — see docs/REPOSITORIES.md § "Outlier handling":

Step	Why we don't pre-apply it
Outlier winsorisation	TabPFN's `OUTLIER_REMOVAL_STD = 12.0` (classifier) / `None` (regressor) handles outliers with the right semantics. We invoke it from `src/train/tabpfn_preprocessing.py::apply_outlier_clip` just before each model forward.
`PowerTransformer` / `QuantileTransformer` / `RobustScaler` / `SquashingScaler` / SVD	TabPFN's per-estimator inference ensemble cycles through these on every fit. We invoke them per-step in training (`TabPFNEnsemblePreprocessor.fit_transform_ensemble_members`) and TabPFN re-runs them at inference.
NaN imputation	TabPFN's preprocessor handles NaNs natively (learned default + binary indicator) — no pre-imputation needed.
Regression target z-normalisation	Applied per-step inside `tabpfn_preprocessing.build_ensemble_members` on context-only stats, mirroring TabPFN's `znorm_space_bardist_`.
Class label permutation (data augmentation)	Generated per-step inside `TabPFNEnsemblePreprocessor` via `class_shift_method="shuffle"`. The loop unpermutes the logit columns before the CE loss (matches the inference path at `TabPFN .txt:8511-8523`).

7. Training pipeline

A thin orchestrator over src/train/. The single source of truth for hyperparameters is config/train.yaml, in three layers:

Tunable HPs (tunable.* lists at the top) — base checkpoint, learning rate, LoRA on/off, query_fraction, accumulate_grad_batches. Anything genuinely unknown in advance. The full cartesian product is the default sweep (currently 2 bases × 3 LRs × 2 LoRA × 1 qf × 1 acc × 2 epoch-pass-modes = 24 trials per track).
Fixed HPs (single values under train.*) — epochs, AMP, gradient clipping, warmup fraction, per-epoch monitor subsample, n_estimators_finetune (TabPFN ensemble size during training; 2 per the official FinetunedTabPFNClassifier). Follow TabPFN's defaults where those are well-tuned. The per-step subsample size lives in config/data.yaml (finetuning.max_rows_per_epoch, per-version: 10 000 for v3, 6 000 for v2.6; plus an optional v3-only max_cells_per_epoch cell budget).
Hardcoded in code — optimizer family (AdamW), betas ((0.9, 0.999)), scheduler family (linear-warmup → cosine-decay). Never change between runs.

Three invocation modes

# Cartesian product of all tunable lists (default; local sequential).
python scripts/train_pipeline.py

# One trial only — head of every tunable list. Good for smoke tests.
python scripts/train_pipeline.py --single

# One trial picked by index N. Designed for SLURM arrays.
python scripts/train_pipeline.py --trial-index $SLURM_ARRAY_TASK_ID

# How many trials does the current cfg expand to?
python scripts/train_pipeline.py --list-trials

Out-of-range trial indices exit zero cleanly (soft no-op), so an over-sized SLURM array is safe.

Auto-process hook

Before training starts, the pipeline checks that every dataset it needs has a sanitized CSV on disk under data/processed/<track>/. Missing CSVs trigger scripts/data_pipeline.py transparently for just those IDs. Net effect: train_pipeline.py runs from a fresh checkout and fills the processed corpus as needed.

Configurable training datasets

Two paths into the train/test split, both in cfg.corpus:

Mode A — fraction-based (default). train_fraction / test_fraction slice the registered corpus count-wise, deterministic in cfg.seed.
Mode B — explicit lists. Set train_dataset_ids and/or test_dataset_ids to fix specific datasets in one or both buckets:
```
corpus:
  train_dataset_ids: ["0001.gmsc"]
  test_dataset_ids: []
```
IDs unknown to DATASET_METADATA raise a clear error with the full list of valid IDs for the active track — no silent skips.

Worked recipes

Goal	Command
Debug, 1 dataset, 1 HP set	`python scripts/train_pipeline.py --single corpus.train_dataset_ids=[0001.gmsc] train.epochs=3`
Debug, 1 dataset, HP grid	`python scripts/train_pipeline.py corpus.train_dataset_ids=[0001.gmsc] train.epochs=3`
5 specific PD datasets, 1 HP set	`python scripts/train_pipeline.py --single track=pd corpus.train_dataset_ids='[0001.gmsc,0002.taiwan_creditcard,0003.vehicle_loan,0004.lendingclub,0009.bank_status]'`
Full corpus, 1 HP set	`python scripts/train_pipeline.py --single`
Full corpus, full HP grid	`python scripts/train_pipeline.py`
Full corpus, full HP grid, on the cluster	`bash scripts/slurm/submit_full_pipeline.sh` — see `docs/VSC_GUIDE.md`

Hydra-style CLI overrides (key=value) write through the in-memory cfg; they are NOT persisted to config/train.yaml. A debug run does not break a teammate's next full run.

Outputs

Each trial writes:

Artefact	Path
Final-epoch weights	`checkpoints/trained/<track>/<descriptive_name>.ckpt`
Provenance sidecar (HPs, train/test IDs, GPU, walltime, …)	`<descriptive_name>.ckpt.provenance.json`
Manifest row consumed by the eval pipeline	`output/training/manifests/<run_name>_<track>.csv`
Per-epoch CSV (epoch incl. `-1` = pre-FT baseline; train_loss, lr, train/test primary + secondary metric, epoch_time)	`output/training/epochs/<track>/<descriptive_name>.csv`
Full run log (slurm stdout + python logger)	`logs/train_<track>_<ts>[_j<jid>_a<tid>].log`

Filename schema: <run_name>_<track>_<base-stem>_lr<lr>_seed<seed>[_qf<qf>][_acc<K>][_fullpass][_lora].ckpt. Identical re-runs overwrite in place; trials with different HPs land in distinct files.

Trained-checkpoint provenance

Every saved checkpoint at checkpoints/trained/<track>/*.ckpt is paired with a <file>.ckpt.provenance.json sidecar (and an identical copy embedded under the "provenance" key inside the .ckpt itself) recording:

All hyperparameters used (base, lr, weight_decay, betas, scheduler, warmup fraction, epochs, accumulate, grad clip, amp, ctx/query sample sizes, seed, use_lora + LoRA config).
Sorted training-dataset and test-dataset ID lists.
Counts of train/test datasets.
training_time_seconds, GPU name, torch_version, tabpfn_version, ISO-8601 saved_at.

Use src.train.model.load_provenance(path) to read either path without loading the model weights.

Design notes (the why)

Linear-warmup → cosine-decay LR — matches HuggingFace's get_cosine_schedule_with_warmup, which is what TabPFN's FinetunedTabPFNClassifier uses internally. Verified in tests/test_train.py::test_warmup_cosine_schedule_landmarks.
No validation set — with ~17 PD + ~8 LGD datasets, holding out a separate val bucket leaves so few datasets to fit on that the early-stopping signal becomes pure noise. We use fixed-epoch training and pick between hyperparameter settings post hoc on the test set in the eval stage.
Per-epoch monitor eval — at the end of every epoch the loop scores the model on a small subsample of each train- and test-dataset (ROC-AUC for PD, RMSE for LGD). Cheap (~500 rows per chunk) but enough to see whether the model is still improving.
Divergence detection + early abort — when the loss stays constant, or AUC pegs at 0.5 (random), or AMP scaler skips >50 % of recent steps, the training loop aborts and records status=DIVERGED in the manifest. This prevents wasting 3+ hours of GPU on a dead model (observed in train_pd_*qf20_acc1* runs of 2026-05-28 before the safeguard was added).
L2-SP anti-forgetting penalty (optional) — optimizer.l2sp_lambda adds 0.5·λ·‖w − w₀‖² to the loss, penalising drift of the weights away from the synthetic-prior start w₀ (not toward zero, which is what weight_decay does). This is the anti-catastrophic-forgetting term from Real-TabPFN (Garg 2025, §4); on our tiny corpus it protects the pretrained prior's calibration. It is on by default for full-FT trials at Real-TabPFN's λ = 0.003 (set optimizer.l2sp_lambda: 0.0 to disable) and is orthogonal to the LoRA axis but full-FT-only — with LoRA the base weights are frozen and cannot drift, so L2-SP is inert and silently skipped (effectively: full-FT gets L2-SP, LoRA doesn't). The penalty enters only the back-prop loss; the logged CE / NLL curves stay the pure data loss. The effective λ is recorded in each checkpoint's provenance. (λ = 0.003 was tuned for v2 / 20k steps; treat it as a starting point on our setup, not a constant.)

Optimization objective — why CE / NLL, not AUC

PD (classifier): torch.nn.CrossEntropyLoss. Three reasons: (i) CE is differentiable everywhere, AUC isn't (it's a step function in predictions — gradient is 0 almost everywhere); (ii) CE optimizes BOTH rank-ordering AND probability calibration, while AUC optimizes only rank — but credit-risk regulation (Basel III) requires calibrated PD probabilities for expected-loss and capital calculations; (iii) TabPFN was pretrained with CE — switching to a different objective for finetuning would create a train/finetune mismatch that damages the synthetic prior. Differentiable surrogates for AUC (Wilcoxon-style, ROC-Star) exist but are unstable and typically used as a regularizer added to CE, not a replacement. We track AUC as the primary evaluation metric (more interpretable across imbalance levels) but optimize CE.
LGD (regressor): bar-distribution NLL. TabPFN's regressor head emits logits over a discrete histogram (the "bar distribution", Müller et al. 2023). Its NLL −log(density) is what TabPFN was pretrained on, and crucially it trains the uncertainty estimate simultaneously with the point estimate — a Gaussian-NLL alternative would force unimodal homoscedastic predictions; an MSE alternative would throw away the uncertainty quantification that's the whole point of TabPFN's regression head. Regulators care about the full loss distribution, not just the mean. Negative loss values are expected: a bar-distribution NLL = −log(density) can be negative whenever the density exceeds 1.0 (narrow histogram buckets, sharp predictions). RMSE / R² are tracked as evaluation metrics.

Methodology & limitations (statistical validity)

The core methodology is sound — preprocessing mirrors TabPFN's inference path exactly (no train/inference skew), the CE / NLL objectives match what the model was pretrained with, and the eval stage's inner/outer split keeps HPO and threshold-tuning out of the test fold. The honest caveats below are limitations of the experimental design, not bugs:

Small test corpus → wide confidence intervals. A 70/30 dataset-level split of ~17 PD + ~8 LGD datasets leaves only ~5 PD and ~2–3 LGD test datasets. Per-dataset K-fold CV tightens the within-dataset estimate, but the across-dataset mean rests on a handful of datasets — report per-dataset results, not just the pooled mean.
Best-of-24 selection on the test set (winner's curse). With no validation set, the best of 24 trials is picked by test performance; the maximum over 24 noisy estimates is upward-biased. Prefer the per-architecture trained-vs-untuned delta over the absolute best, and report the trial distribution.
Within-domain split ≠ Real-TabPFN's external benchmark. We split one credit corpus into train/test datasets, so they may share vendor quirks. This is the right test of "does credit-specialisation help on credit data," but it measures less external generalisation than a cross-benchmark protocol would.
FeatureAgglomeration is fit on the whole dataset (before the eval CV split), so cluster assignments see rows that later land in test folds. It is unsupervised (never sees y), so there is no label leakage — only a small optimistic bias in feature construction.
No multiple-comparison correction across ~20 metrics. Pre-register the primary metric (roc_auc for PD, rmse for LGD) and treat the rest as diagnostic.

8. Eval pipeline

scripts/eval_pipeline.py scores every model on every held-out test dataset using K-fold cross-validation with an inner train/val split.

Row-cap policy

Both the eval and training pipelines read the same sanitized CSVs (data/processed/{pd,lgd}/<id>.sanitized.csv). The eval applies its caps inside each CV fold, only to the training partition — the held-out test partition is never capped, so the model predicts on every row in one predict_proba call (TabPFN-v3's internal inference_row_chunk_size = 2048 handles arbitrarily large test sets gracefully).

Model family	Train-fold cap	Test fold	HPO subsample
`tabpfn-untuned` / `tabpfn-trained`	`cfg.max_rows_per_model[<v>]` (v3: 1 000 000; v2.x: 100 000)	full	n/a
`xgboost` / `catboost`	none	full	`cfg.hpo.<m>.max_rows = 50 000` (stratified subsample of inner-train; HPO only)
`logreg` / `linreg`	none	full	n/a

CV semantics — 80 / 16 / 20 per fold

Outer K-fold (cfg.cv.n_folds = 5):
    train     → 80% of dataset
    test      → 20%                  ← final metrics computed here

Inner split (cfg.cv.inner_val_fraction = 0.20):
    sub-train → 64% of dataset       ← model fits on this
    val       → 16%                  ← Optuna HPO objective + F1-threshold tuner

Optuna runs once per CV fold (5 studies per (model × dataset) at n_folds=5), each with hpo.<m>.n_trials trials. LogReg / LinReg are intentionally untuned — they're the "what does plain linear modelling do" baseline.

Test-dataset resolution

For tabpfn-trained models the test datasets come from each checkpoint's .provenance.json (each checkpoint scored on its OWN held-out set, recorded at training time). For tabpfn-untuned and classical baselines the test datasets come from the cfg corpus split. Both routes give the same set when seed and fractions match (the default), so the comparison is apples-to-apples by construction.

Comprehensive metrics (one row per model × dataset × fold)

Wide format; NaN where not applicable.

Group	Columns
Classification — threshold-free	`roc_auc`, `log_loss`, `pr_auc`, `brier_score`
Classification — threshold-tuned	`optimal_threshold` (max-F1 on inner-val), then `f1`, `accuracy`, `precision`, `recall`, `specificity`, `balanced_accuracy`, `mcc`, `cohen_kappa` on test
Regression	`rmse`, `mae`, `median_ae`, `mape`, `r2`, `explained_variance`, `pearson_r`, `spearman_r`, `neg_nll` (TabPFN-only)
Bookkeeping	`n_train_rows`, `n_val_rows`, `n_test_rows`, `elapsed_sec`, `status`, `error`, `timestamp`

Re-runs are idempotent

Before scoring, each (model × dataset) pair is checked against existing CSVs under output/results/<TRACK>/<method-dirname>/. Pairs whose all folds are already OK are skipped. So:

First run — scores every baseline + untuned + trained variant.
Re-run after adding a new trained checkpoint — scores only the new checkpoint's pairs; baselines reuse rows from disk.

Force fresh scoring with --rerun. To rescore a single method, delete its directory under output/results/<TRACK>/ and re-submit.

Filters and run modes

# Default — every model × every test dataset.
python scripts/eval_pipeline.py track=pd

# Only one method or one dataset.
python scripts/eval_pipeline.py track=pd --method xgboost --test-dataset 0001.gmsc

# SLURM array: ONE (model × dataset) per task.
N=$(python scripts/eval_pipeline.py --list-tasks track=pd)
sbatch --array=0-$((N - 1))%32 scripts/slurm/eval_pd.slurm

Out-of-range --task-index exits zero cleanly, so an over-sized array doesn't fail.

Results layout

Everything the code writes lives under output/. Trained checkpoints stay under checkpoints/trained/ so they can be wiped independently. See section 4.9 for the full tree.

Method-directory names compress the published checkpoint filenames (tabpfn-v3-classifier-v3_default.ckpt → v3-default); the track-specific "classifier"/"regressor" infix is dropped because the parent PD/ or LGD/ already encodes it. Trained variants append __lr<rate>[__lora] so different HPs / LoRA modes land in different folders.

Every benchmark invocation gets a fresh <timestamp> — earlier runs are never overwritten. Aggregate with pandas:

import pandas as pd, glob
files = glob.glob("output/results/PD/*/creditpfn_*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
df.groupby(["model_name", "model_source"])[
    ["roc_auc", "f1", "log_loss", "rmse"]
].agg(["mean", "std", "count"])

9. References

The full paper library lives under papers/ with a chronological, detailed summary in docs/LITERATURE.md. The most directly relevant works for this project:

Garg et al., 2025. Real-TabPFN — Improving Tabular Foundation Models via Continued Pre-training With Real-World Data. arXiv:2507.03971 — the recipe we follow.
Hollmann et al., 2025. Accurate predictions on small data with a tabular foundation model. (Nature) — TabPFNv2 architecture.
Grinsztajn et al., 2025. TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models. arXiv:2511.08667 — the successor architecture used by our v2.6 / v3 checkpoints.
Grinsztajn et al., 2026. TabPFN-3: Technical Report. arXiv:2605.13986 — current generation, used by our v3-default base checkpoint.
Rubachev et al., 2025. On Finetuning Tabular Foundation Models. — finetuning hyperparameter ranges that anchor our training stage.
Kolberg et al., 2026. TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts. — continued-pretraining for >500-feature data via a feature-widening synthetic prior. Note: TabPFN-Wide argues against dimensionality reduction (it uses FeatureAgglomeration only as a baseline to beat). Our sanitize.py step is unsupervised feature selection (keep real columns, don't average) — an independent, pragmatic feature cap, not derived from this paper.

Local code dumps under repositories/ (catalogued in docs/REPOSITORIES.md) cover the public TabPFN package, the docs site, the v2.5 / v2.6 / v3 HuggingFace model cards (v2.5 kept for scholarly reference; not used in our sweep), NanoTabPFN, the V2-Finetuning recipe, and the underlying PFN framework. Read-only — refresh with python src/utils/refresh_repositories.py.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
config		config
docs		docs
notebooks		notebooks
papers		papers
repositories		repositories
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CreditPFN

Table of contents

1. Overview

2. Background

3. Quick start

3.1 Install

3.2 Verify the install (local laptop is fine)

3.3 Real training and eval require a CUDA cluster

4. Repository layout

4.1 src/ — pipeline source code

4.2 config/ — three YAML configs, one per stage

4.3 scripts/ — CLI entrypoints and SLURM templates

4.4 notebooks/ — exploration and result visualisations

4.5 tests/ — unit and smoke tests

4.6 docs/ — project documentation

4.7 papers/ and repositories/ — reference material

4.8 checkpoints/ — base and trained TabPFN weights (gitignored)

4.9 Runtime trees: data/, output/, logs/ (gitignored)

5. Re-submitting the pipeline (resume semantics + cleanup)

6. Data pipeline

Stage descriptions

What sanitize.py deliberately does NOT do

7. Training pipeline

Three invocation modes

Auto-process hook

Configurable training datasets

Worked recipes

Outputs

Trained-checkpoint provenance

Design notes (the why)

Optimization objective — why CE / NLL, not AUC

Methodology & limitations (statistical validity)

8. Eval pipeline

Row-cap policy

CV semantics — 80 / 16 / 20 per fold

Test-dataset resolution

Comprehensive metrics (one row per model × dataset × fold)

Re-runs are idempotent

Filters and run modes

Results layout

9. References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4.1 `src/` — pipeline source code

4.2 `config/` — three YAML configs, one per stage

4.3 `scripts/` — CLI entrypoints and SLURM templates

4.4 `notebooks/` — exploration and result visualisations

4.5 `tests/` — unit and smoke tests

4.6 `docs/` — project documentation

4.7 `papers/` and `repositories/` — reference material

4.8 `checkpoints/` — base and trained TabPFN weights (gitignored)

4.9 Runtime trees: `data/`, `output/`, `logs/` (gitignored)

Packages