Skip to content

grimmlab/FrostBreakerFlow

Repository files navigation

Overcoming tabular data sparsity and cold-start problems with Flow Matching-based generative models

Many real‑world machine learning tasks rely on tabular data, yet data sparsity remains a persistent challenge, especially in early‑stage or seasonal applications where collecting sufficient observations can take years. Synthetic data generation can mitigate these limitations by augmenting scarce training data. We evaluate synthetic tabular augmentation using ForestFlow, a computationally efficient Flow Matching-based generative model, across 26 public datasets and six real-world horticultural small- and medium-sized enterprise datasets. We examine (i) whether hyperparameter tuning of ForestFlow’s underlying XGBoost model improves synthetic data quality, and (ii) how different synthetic‑to‑real ratios affect downstream regression and classification. Hyperparameter optimization yields only marginal gains in quality and no meaningful improvement in predictive performance, indicating that ForestFlow’s default settings are sufficient. In contrast, synthetic augmentation substantially boosts performance in low‑data regimes, enabling models to match or exceed baselines with fewer years of observations. All experiments run on standard CPUs, highlighting the method’s practical applicability.

Contributors

This pipeline is developed and maintained by members of the Bioinformatics lab led by Prof. Dr. Dominik Grimm:

Citation

Overcoming Tabular Data Sparsity and Cold-Start Problems with Flow Matching-Based Generative Models.
J Eiglsperger, GH Vu, F Haselbeck, DG Grimm.
Currently under review, 2026.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors