-Our experimental results were obtained using Qwen and Gemini models as the backbone for our LLM ensembles. Both models were accessed via an internal API system at Inter that routed requests to the respective LLM providers. Many commercial LLM providers do not guarantee deterministic outputs even when random seeds are provided. As a result, **exact numerical reproduction of our paper results is not guaranteed**, even when using the same configuration files and seeds. Despite these limitations, our ablation studies demonstrate that CodeEvolve consistently achieves **state-of-the-art results across multiple seeds and experimental runs** on all considered benchmarks. The core algorithmic contributions remain robust to LLM stochasticity.
0 commit comments