Many real‑world machine learning tasks rely on tabular data, yet data sparsity remains a persistent challenge, especially in early‑stage or seasonal applications where collecting sufficient observations can take years. Synthetic data generation can mitigate these limitations by augmenting scarce training data. We evaluate synthetic tabular augmentation using ForestFlow, a computationally efficient Flow Matching-based generative model, across 26 public datasets and six real-world horticultural small- and medium-sized enterprise datasets. We examine (i) whether hyperparameter tuning of ForestFlow’s underlying XGBoost model improves synthetic data quality, and (ii) how different synthetic‑to‑real ratios affect downstream regression and classification. Hyperparameter optimization yields only marginal gains in quality and no meaningful improvement in predictive performance, indicating that ForestFlow’s default settings are sufficient. In contrast, synthetic augmentation substantially boosts performance in low‑data regimes, enabling models to match or exceed baselines with fewer years of observations. All experiments run on standard CPUs, highlighting the method’s practical applicability.
This pipeline is developed and maintained by members of the Bioinformatics lab led by Prof. Dr. Dominik Grimm:
Overcoming Tabular Data Sparsity and Cold-Start Problems with Flow Matching-Based Generative Models.
J Eiglsperger, GH Vu, F Haselbeck, DG Grimm.
Currently under review, 2026.
