|
| 1 | +# Blosc2 vs DuckDB Indexes |
| 2 | + |
| 3 | +This note summarizes the benchmark comparisons we ran between Blosc2 indexes and DuckDB indexing/pruning |
| 4 | +mechanisms on a 10M-row structured dataset. |
| 5 | + |
| 6 | +The goal is not to claim a universal winner, but to document the current observed tradeoffs around: |
| 7 | + |
| 8 | +- index creation time |
| 9 | +- lookup latency |
| 10 | +- total storage footprint |
| 11 | +- sensitivity to query shape |
| 12 | + |
| 13 | + |
| 14 | +## Benchmark Setup |
| 15 | + |
| 16 | +### Dataset |
| 17 | + |
| 18 | +- Rows: `10,000,000` |
| 19 | +- Schema: |
| 20 | + - `id`: indexed field, `float64` |
| 21 | + - `payload`: deterministic nontrivial ramp payload |
| 22 | +- Distribution: `random` |
| 23 | + - true random shuffle of `id` |
| 24 | +- Query widths tested: |
| 25 | + - `50` |
| 26 | + - `1` |
| 27 | + |
| 28 | +### Blosc2 |
| 29 | + |
| 30 | +- Script: `index_query_bench.py` |
| 31 | +- Index kinds: |
| 32 | + - `ultralight` |
| 33 | + - `light` |
| 34 | + - `medium` |
| 35 | + - `full` |
| 36 | +- Default geometry in these runs: |
| 37 | + - `chunks=1,250,000` |
| 38 | + - `blocks=10,000` |
| 39 | + |
| 40 | +### DuckDB |
| 41 | + |
| 42 | +- Script: `duckdb_query_bench.py` |
| 43 | +- Layouts: |
| 44 | + - `zonemap` |
| 45 | + - `art-index` |
| 46 | +- Batch size used while loading: |
| 47 | + - `1,250,000` |
| 48 | + |
| 49 | + |
| 50 | +## Important Context |
| 51 | + |
| 52 | +There are two different DuckDB query shapes that matter a lot: |
| 53 | + |
| 54 | +- range form: |
| 55 | + - `id >= lo AND id <= hi` |
| 56 | +- single-value form: |
| 57 | + - `id = value` |
| 58 | + |
| 59 | +For Blosc2, switching between a collapsed width-1 range and `==` makes almost no practical difference. |
| 60 | + |
| 61 | +For DuckDB, this difference is very important: |
| 62 | + |
| 63 | +- `art-index` was much slower with the range form |
| 64 | +- `art-index` became much faster with the single-value `=` predicate |
| 65 | + |
| 66 | +So any DuckDB comparison must state which predicate shape was used. |
| 67 | + |
| 68 | + |
| 69 | +## Width-50 Comparison |
| 70 | + |
| 71 | +### DuckDB |
| 72 | + |
| 73 | +Command: |
| 74 | + |
| 75 | +```bash |
| 76 | +python duckdb_query_bench.py \ |
| 77 | + --size 10M \ |
| 78 | + --outdir /tmp/duckdb-bench-smoke2 \ |
| 79 | + --dist random \ |
| 80 | + --query-width 50 \ |
| 81 | + --layout all \ |
| 82 | + --repeats 1 |
| 83 | +``` |
| 84 | + |
| 85 | +Observed results: |
| 86 | + |
| 87 | +- `zonemap` |
| 88 | + - build: `1180.630 ms` |
| 89 | + - filtered lookup: `13.326 ms` |
| 90 | + - DB size: `56,111,104` bytes |
| 91 | +- `art-index` |
| 92 | + - build: `2844.010 ms` |
| 93 | + - filtered lookup: `12.419 ms` |
| 94 | + - DB size: `478,687,232` bytes |
| 95 | + |
| 96 | +### Blosc2 |
| 97 | + |
| 98 | +Command: |
| 99 | + |
| 100 | +```bash |
| 101 | +python index_query_bench.py \ |
| 102 | + --size 10M \ |
| 103 | + --outdir /tmp/indexes-10M \ |
| 104 | + --kind light \ |
| 105 | + --query-width 50 \ |
| 106 | + --in-mem \ |
| 107 | + --dist random |
| 108 | +``` |
| 109 | + |
| 110 | +Observed `light` results: |
| 111 | + |
| 112 | +- build: `705.193 ms` |
| 113 | +- cold lookup: `6.370 ms` |
| 114 | +- warm lookup: `6.250 ms` |
| 115 | +- base array size: about `31 MB` |
| 116 | +- `light` index sidecars: about `27 MB` |
| 117 | +- total footprint: about `58 MB` |
| 118 | + |
| 119 | +### Interpretation |
| 120 | + |
| 121 | +For this moderately selective random workload: |
| 122 | + |
| 123 | +- Blosc2 `light` is about `2x` faster than DuckDB `zonemap` |
| 124 | +- Blosc2 `light` has a total footprint similar to DuckDB `zonemap` |
| 125 | +- DuckDB `art-index` is only slightly faster than `zonemap` here, but much larger |
| 126 | + |
| 127 | +This suggests that Blosc2 `light` is more than a simple zonemap. It behaves like an active lossy lookup |
| 128 | +structure rather than only coarse pruning metadata. |
| 129 | + |
| 130 | + |
| 131 | +## Width-1 Comparison: Generic Range Form |
| 132 | + |
| 133 | +### DuckDB |
| 134 | + |
| 135 | +Command: |
| 136 | + |
| 137 | +```bash |
| 138 | +python duckdb_query_bench.py \ |
| 139 | + --size 10M \ |
| 140 | + --outdir /tmp/duckdb-bench-smoke2 \ |
| 141 | + --dist random \ |
| 142 | + --query-width 1 \ |
| 143 | + --layout all \ |
| 144 | + --repeats 3 |
| 145 | +``` |
| 146 | + |
| 147 | +Observed results: |
| 148 | + |
| 149 | +- `zonemap` |
| 150 | + - filtered lookup: `12.612 ms` |
| 151 | +- `art-index` |
| 152 | + - filtered lookup: `13.641 ms` |
| 153 | + |
| 154 | +### Blosc2 |
| 155 | + |
| 156 | +Command: |
| 157 | + |
| 158 | +```bash |
| 159 | +python index_query_bench.py \ |
| 160 | + --size 10M \ |
| 161 | + --outdir /tmp/indexes-10M \ |
| 162 | + --kind all \ |
| 163 | + --query-width 1 \ |
| 164 | + --dist random |
| 165 | +``` |
| 166 | + |
| 167 | +Observed results: |
| 168 | + |
| 169 | +- `light` |
| 170 | + - cold lookup: `1.463 ms` |
| 171 | + - warm lookup: `1.286 ms` |
| 172 | +- `medium` |
| 173 | + - cold lookup: `1.089 ms` |
| 174 | + - warm lookup: `0.986 ms` |
| 175 | +- `full` |
| 176 | + - cold lookup: `0.618 ms` |
| 177 | + - warm lookup: `0.544 ms` |
| 178 | + |
| 179 | +### Interpretation |
| 180 | + |
| 181 | +With the generic range form, Blosc2 is much faster than DuckDB: |
| 182 | + |
| 183 | +- Blosc2 `light` is already about `9x` faster than DuckDB `zonemap` |
| 184 | +- Blosc2 exact indexes (`medium`, `full`) are much faster still |
| 185 | +- DuckDB `art-index` does not show its real point-lookup behavior in this predicate form |
| 186 | + |
| 187 | + |
| 188 | +## Width-1 Comparison: Single-Value Predicate |
| 189 | + |
| 190 | +### DuckDB |
| 191 | + |
| 192 | +Command: |
| 193 | + |
| 194 | +```bash |
| 195 | +python duckdb_query_bench.py \ |
| 196 | + --size 10M \ |
| 197 | + --outdir /tmp/duckdb-bench-smoke2 \ |
| 198 | + --dist random \ |
| 199 | + --query-width 1 \ |
| 200 | + --layout all \ |
| 201 | + --repeats 3 \ |
| 202 | + --query-single-value |
| 203 | +``` |
| 204 | + |
| 205 | +Observed results: |
| 206 | + |
| 207 | +- `zonemap` |
| 208 | + - build: `1193.665 ms` |
| 209 | + - filtered lookup: `8.646 ms` |
| 210 | + - DB size: `56,111,104` bytes |
| 211 | +- `art-index` |
| 212 | + - build: `2849.869 ms` |
| 213 | + - filtered lookup: `0.755 ms` |
| 214 | + - DB size: `478,687,232` bytes |
| 215 | + |
| 216 | +### Blosc2 |
| 217 | + |
| 218 | +Command: |
| 219 | + |
| 220 | +```bash |
| 221 | +python index_query_bench.py \ |
| 222 | + --size 10M \ |
| 223 | + --outdir /tmp/indexes-10M \ |
| 224 | + --kind all \ |
| 225 | + --query-width 1 \ |
| 226 | + --dist random \ |
| 227 | + --query-single-value |
| 228 | +``` |
| 229 | + |
| 230 | +Observed results: |
| 231 | + |
| 232 | +- `light` |
| 233 | + - build: `1225.637 ms` |
| 234 | + - cold lookup: `1.290 ms` |
| 235 | + - warm lookup: `2.351 ms` |
| 236 | + - index sidecars: `27,497,393` bytes |
| 237 | +- `medium` |
| 238 | + - build: `5511.863 ms` |
| 239 | + - cold lookup: `1.081 ms` |
| 240 | + - warm lookup: `0.964 ms` |
| 241 | + - index sidecars: `37,645,201` bytes |
| 242 | +- `full` |
| 243 | + - build: `10954.844 ms` |
| 244 | + - cold lookup: `0.603 ms` |
| 245 | + - warm lookup: `0.525 ms` |
| 246 | + - index sidecars: `29,888,673` bytes |
| 247 | + |
| 248 | +### Interpretation |
| 249 | + |
| 250 | +Once DuckDB is allowed to use the more planner-friendly single-value predicate: |
| 251 | + |
| 252 | +- `art-index` becomes very fast |
| 253 | +- `art-index` is now faster than Blosc2 `light` |
| 254 | +- Blosc2 `full` still remains slightly faster than DuckDB `art-index` on this measured point-lookup case |
| 255 | + |
| 256 | +However, the storage costs are very different: |
| 257 | + |
| 258 | +- DuckDB `art-index` database size: about `478.7 MB` |
| 259 | +- DuckDB zonemap baseline size: about `56.1 MB` |
| 260 | +- estimated ART overhead over baseline: about `422.6 MB` |
| 261 | +- Blosc2 `full` base + index footprint: about `31 MB + 29.9 MB = 60.9 MB` |
| 262 | + |
| 263 | +So for true point lookups: |
| 264 | + |
| 265 | +- DuckDB `art-index` is competitive on latency |
| 266 | +- Blosc2 `full` is still faster in the measured run |
| 267 | +- Blosc2 `full` is much smaller overall |
| 268 | +- DuckDB `art-index` is much faster to build than Blosc2 `full` |
| 269 | + |
| 270 | + |
| 271 | +## Blosc2 Light vs DuckDB Zonemap |
| 272 | + |
| 273 | +This is the cleanest cross-system comparison, because both are lossy pruning structures rather than exact |
| 274 | +secondary indexes. |
| 275 | + |
| 276 | +Main observations: |
| 277 | + |
| 278 | +- storage footprint is in roughly the same ballpark |
| 279 | + - DuckDB zonemap DB: about `56 MB` |
| 280 | + - Blosc2 base + `light`: about `58 MB` |
| 281 | +- Blosc2 `light` lookup speed is much better |
| 282 | + - width `50`: about `6.25 ms` vs `13.33 ms` |
| 283 | + - width `1`: about `1.3-1.5 ms` vs `8.6-12.6 ms` |
| 284 | + |
| 285 | +Conclusion: |
| 286 | + |
| 287 | +- DuckDB zonemap is closer in spirit to Blosc2 `light` than DuckDB ART is |
| 288 | +- but Blosc2 `light` is a materially stronger lookup structure on these workloads |
| 289 | + |
| 290 | + |
| 291 | +## Blosc2 Full vs DuckDB ART |
| 292 | + |
| 293 | +This is the most relevant exact-index comparison. |
| 294 | + |
| 295 | +Main observations: |
| 296 | + |
| 297 | +- point-lookup latency |
| 298 | + - DuckDB `art-index`: `0.755 ms` |
| 299 | + - Blosc2 `full`: `0.603 ms` cold, `0.525 ms` warm |
| 300 | +- build time |
| 301 | + - DuckDB `art-index`: `2849.869 ms` |
| 302 | + - Blosc2 `full`: `10954.844 ms` |
| 303 | +- footprint |
| 304 | + - DuckDB `art-index` DB: about `478.7 MB` |
| 305 | + - Blosc2 `full` base + index: about `60.9 MB` |
| 306 | + |
| 307 | +Conclusion: |
| 308 | + |
| 309 | +- DuckDB ART wins on build time |
| 310 | +- Blosc2 `full` wins on storage efficiency |
| 311 | +- Blosc2 `full` was slightly faster on the measured point lookup |
| 312 | +- DuckDB ART is much more sensitive to predicate shape |
| 313 | + |
| 314 | + |
| 315 | +## Why `--query-single-value` Matters More in DuckDB |
| 316 | + |
| 317 | +Observed behavior: |
| 318 | + |
| 319 | +- Blosc2: |
| 320 | + - width-1 range form and `==` are nearly equivalent in performance |
| 321 | +- DuckDB: |
| 322 | + - width-1 range form was much slower than `id = value` |
| 323 | + |
| 324 | +Practical implication: |
| 325 | + |
| 326 | +- Blosc2 benchmarks are fairly robust to whether a point lookup is written as `==` or as a collapsed range |
| 327 | +- DuckDB benchmarks must distinguish those two forms explicitly, otherwise ART performance is understated |
| 328 | + |
| 329 | + |
| 330 | +## Caveats |
| 331 | + |
| 332 | +- These results come from one hardware/software setup and one dataset shape. |
| 333 | +- DuckDB stores table data and indexes in one DB file, so payload and index bytes cannot be separated as cleanly |
| 334 | + as in Blosc2. |
| 335 | +- DuckDB zonemap is built-in table pruning metadata, not a separately managed index. |
| 336 | +- Blosc2 and DuckDB are not identical systems: |
| 337 | + - Blosc2 benchmark operates over compressed array storage and explicit index sidecars |
| 338 | + - DuckDB benchmark operates over a columnar SQL engine with its own optimizer behavior |
| 339 | + |
| 340 | + |
| 341 | +## Current Takeaways |
| 342 | + |
| 343 | +1. Blosc2 `light` is very competitive against DuckDB zonemap-like pruning. |
| 344 | +2. Blosc2 `light` offers much faster selective lookups than DuckDB zonemap at a similar total storage cost. |
| 345 | +3. DuckDB `art-index` becomes strong only when queries are written as true equality predicates. |
| 346 | +4. Blosc2 `full` compares very well against DuckDB `art-index` on point lookups: |
| 347 | + - slightly faster in the measured run |
| 348 | + - much smaller on disk |
| 349 | + - slower to build |
| 350 | +5. Query-shape sensitivity is a major difference: |
| 351 | + - small for Blosc2 |
| 352 | + - large for DuckDB ART |
0 commit comments