Optimizing Tabular Foundation Models: From Intelligent Sampling to Efficient Attention

Authors: Yentl Collin & Clément Dureuil Context: Deep Learning Semester Project

Overview

This project investigates two complementary strategies for improving Tabular Foundation Models (TFMs) based on nanoTabPFN:

Contribution	Goal	Method	Gain
Part I – Intelligent Context Sampling	Better use of the fixed context budget	GBDT-based learned retrieval	≈4× context compression, +0.18 AUC
Part II – Efficient Attention	Break the O(N²) memory/time bottleneck	ISAB / Sparse / α-Entmax	Linear scaling, +0.04 Acc at k=150

The benchmark dataset is Bank Marketing (OpenML id=1461), a binary classification task with ~10k balanced samples.

Repository Structure

Projet_TabPFN_yentl/
│
├── nanoTabPFN/                      # Core model implementations
│   ├── model.py                     # Classic NanoTabPFN (baseline, O(N²) attention)
│   ├── model_isab.py                # ISAB variant: O(N·M) row attention (inducing points)
│   ├── model_sparse.py              # Top-K sparse row attention: O(N·k)
│   ├── model_splash.py              # α-Entmax attention with FIXED alpha (hyperparameter)
│   ├── model_splash_learned.py      # α-Entmax attention with LEARNED α per head (nn.Parameter)
│   ├── train.py                     # Training loop + PriorDumpDataLoader (HDF5)
│   ├── experiment.ipynb             # Reproduce original nanoTabPFN paper results
│   └── README.md                    # Original nanoTabPFN README
│
├── TabICL/                      # In-Context Learning evaluation framework
│   ├── sampling.py              # Random and k-NN context sampling (baselines)
│   ├── retrival_1.py            # LearnedContextSelector: GBDT-based utility predictor
│   ├── eval.py                  # TabICLEvaluator: runs and plots benchmark results
│   └── main copy.ipynb          # Interactive notebook for ICS experiments
│
├── src/
│   ├── data/
│   │   └── openml_bank_marketing.py   # Data loading, balancing, feature selection
│   └── eval/
│       ├── benchmark.py               # Scalability & performance benchmarks
│       └── metrics.py                 # Binary classification metrics (Acc, AUC, F1, LogLoss)
│
├── Rendu/
│   ├── poster/
│   │   ├── poster.tex           # Beamer poster (Gemini theme)
│   │   └── image/               # Figures: tabpfn.png, learned_retrieval.png,
│   │                            #          isab.png, inference_time.png
│   └── report/
│       └── report.tex           # 5-page academic report (this document)
│
├── main.ipynb                   # Main experiment notebook
├── benchmark.ipynb              # Scalability benchmark notebook
├── pre_training.ipynb           # Pre-training experiments
├── requirements.txt             # Python dependencies
└── README.md                    # This file

Quick Start

1. Install Dependencies

pip install -r requirements.txt
# Additional for Part I (learned retrieval):
pip install tabicl
# Additional for training nanoTabPFN:
pip install schedulefree h5py

requirements.txt includes:

numpy, scipy, pandas — scientific stack
scikit-learn, openml — datasets and ML baselines
torch>=2.0 — deep learning (CPU / Apple Silicon / CUDA compatible)
matplotlib, seaborn — visualization
tqdm, psutil — progress and memory tracking

2. Download the Pre-training Data

cd nanoTabPFN
curl http://ml.informatik.uni-freiburg.de/research-artifacts/nanoTabPFN/300k_150x5_2.h5 \
     --output 300k_150x5_2.h5

This HDF5 file contains 300k synthetic tabular datasets (each 150 rows × 5 features), generated from Gaussian Process / Bayesian Network priors.

3. Pre-train a Model

from nanoTabPFN.model import NanoTabPFNModel, NanoTabPFNClassifier
from nanoTabPFN.train import PriorDumpDataLoader, train, get_default_device

device = get_default_device()   # auto-detects CUDA > MPS > CPU

model = NanoTabPFNModel(
    embedding_size=128,
    num_attention_heads=8,
    mlp_hidden_size=512,
    num_layers=4,
    num_outputs=3
)
prior = PriorDumpDataLoader("nanoTabPFN/300k_150x5_2.h5", num_steps=2500, batch_size=32)
model, history = train(model, prior, lr=4e-3, device=device, steps_per_eval=25)

Pre-training takes ~5–10 minutes on a GPU (A100 / M-series Apple Silicon).

4. Run Inference (sklearn-like API)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

clf = NanoTabPFNClassifier(model, device)
clf.fit(X_train, y_train)
prob = clf.predict_proba(X_test)
print("AUC:", roc_auc_score(y_test, prob[:, 1]))
print("Acc:", accuracy_score(y_test, prob.argmax(axis=1)))

Part I — Intelligent Context Sampling

Motivation

When the training pool $N$ exceeds the context budget $k$, random sampling wastes slots on uninformative rows. Our Learned Retrieval strategy trains a GBDT to predict the marginal utility of each candidate example for a given query, enabling a 4× context compression.

Usage

from TabICL.retrival_1 import LearnedContextSelector
from src.data.openml_bank_marketing import load_data, process_data
from sklearn.model_selection import train_test_split

# 1. Load and preprocess Bank Marketing
X, y = load_data()
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_test = process_data(X_train_raw, X_test_raw, y_train, top_k_features=5, method="mi")

# 2. Train the GBDT selector (offline mining phase — takes a few minutes)
selector = LearnedContextSelector(n_candidates_mining=20)
selector.fit(X_train, y_train, n_training_queries=50)

# 3. At inference time: retrieve k "golden" examples for any query
query = X_test.iloc[0]
X_ctx, y_ctx = selector.select(query, k=16, X_pool=X_train, y_pool=y_train)

Benchmarking (Random vs k-NN vs Learned Retrieval)

from TabICL.eval import TabICLEvaluator
from TabICL.sampling import get_random_context, get_knn_context

evaluator = TabICLEvaluator(X_train, y_train, X_test, y_test, n_eval=200)

evaluator.run("Random",   get_random_context,          context_sizes=[4, 8, 16, 32, 64])
evaluator.run("KNN",      get_knn_context,              context_sizes=[4, 8, 16, 32, 64])
evaluator.run("Learned",  selector.select,              context_sizes=[4, 8, 16, 32, 64])

evaluator.plot_results()   # Plots AUC, Accuracy, Log-Loss vs context size

Part II — Efficient Attention Variants

All variants share the same sklearn-like interface and training pipeline as the classic model.

Available Models

File	Model class	Classifier class	Row-attention	Complexity
`model.py`	`NanoTabPFNModel`	`NanoTabPFNClassifier`	Standard softmax	O(N²)
`model_isab.py`	`ISABTabPFNModel`	`ISABTabPFNClassifier`	Induced Set (M=32)	O(N·M)
`model_sparse.py`	`SparseTabPFNModel`	`SparseTabPFNClassifier`	Top-K mask (k=16)	O(N·k)
`model_splash.py`	`SplashTabPFNModel`	`SplashTabPFNClassifier`	α-entmax, fixed α	O(N²)*
`model_splash_learned.py`	`LearnedSplashTabPFNModel`	`LearnedSplashTabPFNClassifier`	α-entmax, learned α/head	O(N²)*

* Same asymptotic complexity as Classic, but sparser and less noisy distributions. Memory savings require a custom sparse CUDA kernel (not included).

Instantiating an Efficient Variant

# ISAB variant (recommended for large N)
from nanoTabPFN.model_isab import ISABTabPFNModel
model_isab = ISABTabPFNModel(
    embedding_size=128,
    num_attention_heads=8,
    mlp_hidden_size=512,
    num_layers=4,
    num_outputs=3,
    num_inducing_points=32    # M — controls accuracy/speed trade-off
)

# Sparse variant
from nanoTabPFN.model_sparse import SparseTabPFNModel
model_sparse = SparseTabPFNModel(
    embedding_size=128,
    num_attention_heads=8,
    mlp_hidden_size=512,
    num_layers=4,
    num_outputs=3,
    k_neighbors=16            # k — number of attended neighbors per query
)

# α-Entmax, fixed alpha (same for all heads)
from nanoTabPFN.model_splash import SplashTabPFNModel
model_splash = SplashTabPFNModel(
    embedding_size=128,
    num_attention_heads=8,
    mlp_hidden_size=512,
    num_layers=4,
    num_outputs=3,
    alpha=1.5,        # fixed entmax exponent: 1→softmax, 1.5→mild sparse, 2→sparsemax
    entmax_iters=5,   # Halley-bisection iterations for τ root-finding
)

# α-Entmax, one learned α per attention head (nn.Parameter, trained with the model)
from nanoTabPFN.model_splash_learned import LearnedSplashTabPFNModel
model_splash_learned = LearnedSplashTabPFNModel(
    embedding_size=128,
    num_attention_heads=8,
    mlp_hidden_size=512,
    num_layers=4,
    num_outputs=3,
    init_alpha=1.5,   # starting value; each head's α is optimised independently
)

# After training, inspect which heads converged to dense vs. sparse:
print(model_splash_learned.get_learned_alphas())
# {'layer_0': {'feature': tensor([1.43, 1.78, ...]), 'datapoint': tensor([...])}, ...}

All variants use the same train() function from train.py — just swap the model.

Benchmarking

Performance Benchmark (multi-context)

# In benchmark.ipynb or src/eval/benchmark.py
from src.eval.benchmark import run_performance_eval, measure_scalability
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
models_dict = {
    "Classic": model_classic,
    "ISAB":    model_isab,
    "Sparse":  model_sparse,
    "Entmax":  model_splash,
}

# Performance on real datasets
df_perf = run_performance_eval(models_dict, device)
print(df_perf.groupby(["Dataset", "Model"])[["Accuracy", "AUC"]].mean())

# Scalability (time + peak RAM vs N)
df_scale = measure_scalability(models_dict, device, n_ranges=[100, 250, 500, 1000, 1500, 2000])

Metrics Available

src/eval/metrics.py provides a clean, torch-compatible interface:

from src.eval.metrics import compute_binary_metrics

metrics = compute_binary_metrics(y_true, prob_pos=prob[:, 1])
# Returns dict: {"accuracy", "f1", "auc", "logloss"}

Key Results

Method	k=25 Acc	k=25 AUC	k=150 Acc	k=150 AUC	Complexity
Random Forest	0.66	0.72	0.75	0.84	—
XGBoost	0.66	0.74	0.78	0.86	—
ClassicPFN	0.66	0.74	0.78	0.86	O(N²)
ISABPFN	0.68	0.74	0.77	0.85	O(N·M)
SparsePFN	0.70	0.77	0.76	0.85	O(N·k)
EntmaxPFN	0.68	0.74	0.82	0.87	O(N²)
RetrievalPFN	0.84	0.92	0.88	0.94	—

Reproducing the Poster & Report

# Compile with tectonic (recommended — no LaTeX installation required)
cd Rendu/poster  && tectonic poster.tex   # → poster.pdf
cd ../report     && tectonic report.tex   # → report.pdf

# Or with a standard LaTeX installation
pdflatex poster.tex
pdflatex report.tex

References

Hollmann et al. (2023). TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. ICLR 2023.
Pfefferle et al. (2025). nanoTabPFN: A Lightweight Reimplementation of TabPFN. arXiv:2511.03634.
Rubin, Herzig & Berant (2022). Learning To Retrieve Prompts for In-Context Learning. NAACL 2022.
Lee et al. (2019). Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. ICML 2019.
Gonçalves, Treviso & Martins (2025). AdaSplash: Adaptive Sparse Flash Attention. ICML 2025 (Oral).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing Tabular Foundation Models: From Intelligent Sampling to Efficient Attention

Overview

Repository Structure

Quick Start

1. Install Dependencies

2. Download the Pre-training Data

3. Pre-train a Model

4. Run Inference (sklearn-like API)

Part I — Intelligent Context Sampling

Motivation

Usage

Benchmarking (Random vs k-NN vs Learned Retrieval)

Part II — Efficient Attention Variants

Available Models

Instantiating an Efficient Variant

Benchmarking

Performance Benchmark (multi-context)

Metrics Available

Key Results

Reproducing the Poster & Report

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Rendu		Rendu
TabICL		TabICL
nanoTabPFN		nanoTabPFN
src		src
.gitignore		.gitignore
README.md		README.md
benchmark.ipynb		benchmark.ipynb
main.ipynb		main.ipynb
pre_training.ipynb		pre_training.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Optimizing Tabular Foundation Models: From Intelligent Sampling to Efficient Attention

Overview

Repository Structure

Quick Start

1. Install Dependencies

2. Download the Pre-training Data

3. Pre-train a Model

4. Run Inference (sklearn-like API)

Part I — Intelligent Context Sampling

Motivation

Usage

Benchmarking (Random vs k-NN vs Learned Retrieval)

Part II — Efficient Attention Variants

Available Models

Instantiating an Efficient Variant

Benchmarking

Performance Benchmark (multi-context)

Metrics Available

Key Results

Reproducing the Poster & Report

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages