Authors: Yentl Collin & Clément Dureuil Context: Deep Learning Semester Project
This project investigates two complementary strategies for improving Tabular Foundation Models (TFMs) based on nanoTabPFN:
| Contribution | Goal | Method | Gain |
|---|---|---|---|
| Part I – Intelligent Context Sampling | Better use of the fixed context budget | GBDT-based learned retrieval | ≈4× context compression, +0.18 AUC |
| Part II – Efficient Attention | Break the O(N²) memory/time bottleneck | ISAB / Sparse / α-Entmax | Linear scaling, +0.04 Acc at k=150 |
The benchmark dataset is Bank Marketing (OpenML id=1461), a binary classification task with ~10k balanced samples.
Projet_TabPFN_yentl/
│
├── nanoTabPFN/ # Core model implementations
│ ├── model.py # Classic NanoTabPFN (baseline, O(N²) attention)
│ ├── model_isab.py # ISAB variant: O(N·M) row attention (inducing points)
│ ├── model_sparse.py # Top-K sparse row attention: O(N·k)
│ ├── model_splash.py # α-Entmax attention with FIXED alpha (hyperparameter)
│ ├── model_splash_learned.py # α-Entmax attention with LEARNED α per head (nn.Parameter)
│ ├── train.py # Training loop + PriorDumpDataLoader (HDF5)
│ ├── experiment.ipynb # Reproduce original nanoTabPFN paper results
│ └── README.md # Original nanoTabPFN README
│
├── TabICL/ # In-Context Learning evaluation framework
│ ├── sampling.py # Random and k-NN context sampling (baselines)
│ ├── retrival_1.py # LearnedContextSelector: GBDT-based utility predictor
│ ├── eval.py # TabICLEvaluator: runs and plots benchmark results
│ └── main copy.ipynb # Interactive notebook for ICS experiments
│
├── src/
│ ├── data/
│ │ └── openml_bank_marketing.py # Data loading, balancing, feature selection
│ └── eval/
│ ├── benchmark.py # Scalability & performance benchmarks
│ └── metrics.py # Binary classification metrics (Acc, AUC, F1, LogLoss)
│
├── Rendu/
│ ├── poster/
│ │ ├── poster.tex # Beamer poster (Gemini theme)
│ │ └── image/ # Figures: tabpfn.png, learned_retrieval.png,
│ │ # isab.png, inference_time.png
│ └── report/
│ └── report.tex # 5-page academic report (this document)
│
├── main.ipynb # Main experiment notebook
├── benchmark.ipynb # Scalability benchmark notebook
├── pre_training.ipynb # Pre-training experiments
├── requirements.txt # Python dependencies
└── README.md # This file
pip install -r requirements.txt
# Additional for Part I (learned retrieval):
pip install tabicl
# Additional for training nanoTabPFN:
pip install schedulefree h5pyrequirements.txt includes:
numpy,scipy,pandas— scientific stackscikit-learn,openml— datasets and ML baselinestorch>=2.0— deep learning (CPU / Apple Silicon / CUDA compatible)matplotlib,seaborn— visualizationtqdm,psutil— progress and memory tracking
cd nanoTabPFN
curl http://ml.informatik.uni-freiburg.de/research-artifacts/nanoTabPFN/300k_150x5_2.h5 \
--output 300k_150x5_2.h5This HDF5 file contains 300k synthetic tabular datasets (each 150 rows × 5 features), generated from Gaussian Process / Bayesian Network priors.
from nanoTabPFN.model import NanoTabPFNModel, NanoTabPFNClassifier
from nanoTabPFN.train import PriorDumpDataLoader, train, get_default_device
device = get_default_device() # auto-detects CUDA > MPS > CPU
model = NanoTabPFNModel(
embedding_size=128,
num_attention_heads=8,
mlp_hidden_size=512,
num_layers=4,
num_outputs=3
)
prior = PriorDumpDataLoader("nanoTabPFN/300k_150x5_2.h5", num_steps=2500, batch_size=32)
model, history = train(model, prior, lr=4e-3, device=device, steps_per_eval=25)Pre-training takes ~5–10 minutes on a GPU (A100 / M-series Apple Silicon).
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
clf = NanoTabPFNClassifier(model, device)
clf.fit(X_train, y_train)
prob = clf.predict_proba(X_test)
print("AUC:", roc_auc_score(y_test, prob[:, 1]))
print("Acc:", accuracy_score(y_test, prob.argmax(axis=1)))When the training pool
from TabICL.retrival_1 import LearnedContextSelector
from src.data.openml_bank_marketing import load_data, process_data
from sklearn.model_selection import train_test_split
# 1. Load and preprocess Bank Marketing
X, y = load_data()
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_test = process_data(X_train_raw, X_test_raw, y_train, top_k_features=5, method="mi")
# 2. Train the GBDT selector (offline mining phase — takes a few minutes)
selector = LearnedContextSelector(n_candidates_mining=20)
selector.fit(X_train, y_train, n_training_queries=50)
# 3. At inference time: retrieve k "golden" examples for any query
query = X_test.iloc[0]
X_ctx, y_ctx = selector.select(query, k=16, X_pool=X_train, y_pool=y_train)from TabICL.eval import TabICLEvaluator
from TabICL.sampling import get_random_context, get_knn_context
evaluator = TabICLEvaluator(X_train, y_train, X_test, y_test, n_eval=200)
evaluator.run("Random", get_random_context, context_sizes=[4, 8, 16, 32, 64])
evaluator.run("KNN", get_knn_context, context_sizes=[4, 8, 16, 32, 64])
evaluator.run("Learned", selector.select, context_sizes=[4, 8, 16, 32, 64])
evaluator.plot_results() # Plots AUC, Accuracy, Log-Loss vs context sizeAll variants share the same sklearn-like interface and training pipeline as the classic model.
| File | Model class | Classifier class | Row-attention | Complexity |
|---|---|---|---|---|
model.py |
NanoTabPFNModel |
NanoTabPFNClassifier |
Standard softmax | O(N²) |
model_isab.py |
ISABTabPFNModel |
ISABTabPFNClassifier |
Induced Set (M=32) | O(N·M) |
model_sparse.py |
SparseTabPFNModel |
SparseTabPFNClassifier |
Top-K mask (k=16) | O(N·k) |
model_splash.py |
SplashTabPFNModel |
SplashTabPFNClassifier |
α-entmax, fixed α | O(N²)* |
model_splash_learned.py |
LearnedSplashTabPFNModel |
LearnedSplashTabPFNClassifier |
α-entmax, learned α/head | O(N²)* |
* Same asymptotic complexity as Classic, but sparser and less noisy distributions. Memory savings require a custom sparse CUDA kernel (not included).
# ISAB variant (recommended for large N)
from nanoTabPFN.model_isab import ISABTabPFNModel
model_isab = ISABTabPFNModel(
embedding_size=128,
num_attention_heads=8,
mlp_hidden_size=512,
num_layers=4,
num_outputs=3,
num_inducing_points=32 # M — controls accuracy/speed trade-off
)
# Sparse variant
from nanoTabPFN.model_sparse import SparseTabPFNModel
model_sparse = SparseTabPFNModel(
embedding_size=128,
num_attention_heads=8,
mlp_hidden_size=512,
num_layers=4,
num_outputs=3,
k_neighbors=16 # k — number of attended neighbors per query
)
# α-Entmax, fixed alpha (same for all heads)
from nanoTabPFN.model_splash import SplashTabPFNModel
model_splash = SplashTabPFNModel(
embedding_size=128,
num_attention_heads=8,
mlp_hidden_size=512,
num_layers=4,
num_outputs=3,
alpha=1.5, # fixed entmax exponent: 1→softmax, 1.5→mild sparse, 2→sparsemax
entmax_iters=5, # Halley-bisection iterations for τ root-finding
)
# α-Entmax, one learned α per attention head (nn.Parameter, trained with the model)
from nanoTabPFN.model_splash_learned import LearnedSplashTabPFNModel
model_splash_learned = LearnedSplashTabPFNModel(
embedding_size=128,
num_attention_heads=8,
mlp_hidden_size=512,
num_layers=4,
num_outputs=3,
init_alpha=1.5, # starting value; each head's α is optimised independently
)
# After training, inspect which heads converged to dense vs. sparse:
print(model_splash_learned.get_learned_alphas())
# {'layer_0': {'feature': tensor([1.43, 1.78, ...]), 'datapoint': tensor([...])}, ...}All variants use the same train() function from train.py — just swap the model.
# In benchmark.ipynb or src/eval/benchmark.py
from src.eval.benchmark import run_performance_eval, measure_scalability
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
models_dict = {
"Classic": model_classic,
"ISAB": model_isab,
"Sparse": model_sparse,
"Entmax": model_splash,
}
# Performance on real datasets
df_perf = run_performance_eval(models_dict, device)
print(df_perf.groupby(["Dataset", "Model"])[["Accuracy", "AUC"]].mean())
# Scalability (time + peak RAM vs N)
df_scale = measure_scalability(models_dict, device, n_ranges=[100, 250, 500, 1000, 1500, 2000])src/eval/metrics.py provides a clean, torch-compatible interface:
from src.eval.metrics import compute_binary_metrics
metrics = compute_binary_metrics(y_true, prob_pos=prob[:, 1])
# Returns dict: {"accuracy", "f1", "auc", "logloss"}| Method | k=25 Acc | k=25 AUC | k=150 Acc | k=150 AUC | Complexity |
|---|---|---|---|---|---|
| Random Forest | 0.66 | 0.72 | 0.75 | 0.84 | — |
| XGBoost | 0.66 | 0.74 | 0.78 | 0.86 | — |
| ClassicPFN | 0.66 | 0.74 | 0.78 | 0.86 | O(N²) |
| ISABPFN | 0.68 | 0.74 | 0.77 | 0.85 | O(N·M) |
| SparsePFN | 0.70 | 0.77 | 0.76 | 0.85 | O(N·k) |
| EntmaxPFN | 0.68 | 0.74 | 0.82 | 0.87 | O(N²) |
| RetrievalPFN | 0.84 | 0.92 | 0.88 | 0.94 | — |
# Compile with tectonic (recommended — no LaTeX installation required)
cd Rendu/poster && tectonic poster.tex # → poster.pdf
cd ../report && tectonic report.tex # → report.pdf
# Or with a standard LaTeX installation
pdflatex poster.tex
pdflatex report.tex- Hollmann et al. (2023). TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. ICLR 2023.
- Pfefferle et al. (2025). nanoTabPFN: A Lightweight Reimplementation of TabPFN. arXiv:2511.03634.
- Rubin, Herzig & Berant (2022). Learning To Retrieve Prompts for In-Context Learning. NAACL 2022.
- Lee et al. (2019). Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. ICML 2019.
- Gonçalves, Treviso & Martins (2025). AdaSplash: Adaptive Sparse Flash Attention. ICML 2025 (Oral).