Overcoming tabular data sparsity and cold-start problems with Flow Matching-based generative models

Many real‑world machine learning tasks rely on tabular data, yet data sparsity remains a persistent challenge, especially in early‑stage or seasonal applications where collecting sufficient observations can take years. Synthetic data generation can mitigate these limitations by augmenting scarce training data. We evaluate synthetic tabular augmentation using ForestFlow, a computationally efficient Flow Matching-based generative model, across 26 public datasets and six real-world horticultural small- and medium-sized enterprise datasets. We examine (i) whether hyperparameter tuning of ForestFlow’s underlying XGBoost model improves synthetic data quality, and (ii) how different synthetic‑to‑real ratios affect downstream regression and classification. Hyperparameter optimization yields only marginal gains in quality and no meaningful improvement in predictive performance, indicating that ForestFlow’s default settings are sufficient. In contrast, synthetic augmentation substantially boosts performance in low‑data regimes, enabling models to match or exceed baselines with fewer years of observations. All experiments run on standard CPUs, highlighting the method’s practical applicability.

Contributors

This pipeline is developed and maintained by members of the Bioinformatics lab led by Prof. Dr. Dominik Grimm:

Josef Eiglsperger, M.Sc.

Citation

Overcoming Tabular Data Sparsity and Cold-Start Problems with Flow Matching-Based Generative Models.
J Eiglsperger, GH Vu, F Haselbeck, DG Grimm.
Currently under review, 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Docker		Docker
ForestDiffusion		ForestDiffusion
datasets		datasets
results		results
.gitignore		.gitignore
README.md		README.md
optimizer.py		optimizer.py
pipeline.py		pipeline.py
plots_and_evaluation.ipynb		plots_and_evaluation.ipynb
run.py		run.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overcoming tabular data sparsity and cold-start problems with Flow Matching-based generative models

Contributors

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overcoming tabular data sparsity and cold-start problems with Flow Matching-based generative models

Contributors

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages