Skip to content

Commit 0c5711c

Browse files
committed
Comparison with DuckDB and moved benchmarks to bench/indexing
1 parent 4da9547 commit 0c5711c

5 files changed

Lines changed: 1398 additions & 2 deletions

File tree

Lines changed: 352 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,352 @@
1+
# Blosc2 vs DuckDB Indexes
2+
3+
This note summarizes the benchmark comparisons we ran between Blosc2 indexes and DuckDB indexing/pruning
4+
mechanisms on a 10M-row structured dataset.
5+
6+
The goal is not to claim a universal winner, but to document the current observed tradeoffs around:
7+
8+
- index creation time
9+
- lookup latency
10+
- total storage footprint
11+
- sensitivity to query shape
12+
13+
14+
## Benchmark Setup
15+
16+
### Dataset
17+
18+
- Rows: `10,000,000`
19+
- Schema:
20+
- `id`: indexed field, `float64`
21+
- `payload`: deterministic nontrivial ramp payload
22+
- Distribution: `random`
23+
- true random shuffle of `id`
24+
- Query widths tested:
25+
- `50`
26+
- `1`
27+
28+
### Blosc2
29+
30+
- Script: `index_query_bench.py`
31+
- Index kinds:
32+
- `ultralight`
33+
- `light`
34+
- `medium`
35+
- `full`
36+
- Default geometry in these runs:
37+
- `chunks=1,250,000`
38+
- `blocks=10,000`
39+
40+
### DuckDB
41+
42+
- Script: `duckdb_query_bench.py`
43+
- Layouts:
44+
- `zonemap`
45+
- `art-index`
46+
- Batch size used while loading:
47+
- `1,250,000`
48+
49+
50+
## Important Context
51+
52+
There are two different DuckDB query shapes that matter a lot:
53+
54+
- range form:
55+
- `id >= lo AND id <= hi`
56+
- single-value form:
57+
- `id = value`
58+
59+
For Blosc2, switching between a collapsed width-1 range and `==` makes almost no practical difference.
60+
61+
For DuckDB, this difference is very important:
62+
63+
- `art-index` was much slower with the range form
64+
- `art-index` became much faster with the single-value `=` predicate
65+
66+
So any DuckDB comparison must state which predicate shape was used.
67+
68+
69+
## Width-50 Comparison
70+
71+
### DuckDB
72+
73+
Command:
74+
75+
```bash
76+
python duckdb_query_bench.py \
77+
--size 10M \
78+
--outdir /tmp/duckdb-bench-smoke2 \
79+
--dist random \
80+
--query-width 50 \
81+
--layout all \
82+
--repeats 1
83+
```
84+
85+
Observed results:
86+
87+
- `zonemap`
88+
- build: `1180.630 ms`
89+
- filtered lookup: `13.326 ms`
90+
- DB size: `56,111,104` bytes
91+
- `art-index`
92+
- build: `2844.010 ms`
93+
- filtered lookup: `12.419 ms`
94+
- DB size: `478,687,232` bytes
95+
96+
### Blosc2
97+
98+
Command:
99+
100+
```bash
101+
python index_query_bench.py \
102+
--size 10M \
103+
--outdir /tmp/indexes-10M \
104+
--kind light \
105+
--query-width 50 \
106+
--in-mem \
107+
--dist random
108+
```
109+
110+
Observed `light` results:
111+
112+
- build: `705.193 ms`
113+
- cold lookup: `6.370 ms`
114+
- warm lookup: `6.250 ms`
115+
- base array size: about `31 MB`
116+
- `light` index sidecars: about `27 MB`
117+
- total footprint: about `58 MB`
118+
119+
### Interpretation
120+
121+
For this moderately selective random workload:
122+
123+
- Blosc2 `light` is about `2x` faster than DuckDB `zonemap`
124+
- Blosc2 `light` has a total footprint similar to DuckDB `zonemap`
125+
- DuckDB `art-index` is only slightly faster than `zonemap` here, but much larger
126+
127+
This suggests that Blosc2 `light` is more than a simple zonemap. It behaves like an active lossy lookup
128+
structure rather than only coarse pruning metadata.
129+
130+
131+
## Width-1 Comparison: Generic Range Form
132+
133+
### DuckDB
134+
135+
Command:
136+
137+
```bash
138+
python duckdb_query_bench.py \
139+
--size 10M \
140+
--outdir /tmp/duckdb-bench-smoke2 \
141+
--dist random \
142+
--query-width 1 \
143+
--layout all \
144+
--repeats 3
145+
```
146+
147+
Observed results:
148+
149+
- `zonemap`
150+
- filtered lookup: `12.612 ms`
151+
- `art-index`
152+
- filtered lookup: `13.641 ms`
153+
154+
### Blosc2
155+
156+
Command:
157+
158+
```bash
159+
python index_query_bench.py \
160+
--size 10M \
161+
--outdir /tmp/indexes-10M \
162+
--kind all \
163+
--query-width 1 \
164+
--dist random
165+
```
166+
167+
Observed results:
168+
169+
- `light`
170+
- cold lookup: `1.463 ms`
171+
- warm lookup: `1.286 ms`
172+
- `medium`
173+
- cold lookup: `1.089 ms`
174+
- warm lookup: `0.986 ms`
175+
- `full`
176+
- cold lookup: `0.618 ms`
177+
- warm lookup: `0.544 ms`
178+
179+
### Interpretation
180+
181+
With the generic range form, Blosc2 is much faster than DuckDB:
182+
183+
- Blosc2 `light` is already about `9x` faster than DuckDB `zonemap`
184+
- Blosc2 exact indexes (`medium`, `full`) are much faster still
185+
- DuckDB `art-index` does not show its real point-lookup behavior in this predicate form
186+
187+
188+
## Width-1 Comparison: Single-Value Predicate
189+
190+
### DuckDB
191+
192+
Command:
193+
194+
```bash
195+
python duckdb_query_bench.py \
196+
--size 10M \
197+
--outdir /tmp/duckdb-bench-smoke2 \
198+
--dist random \
199+
--query-width 1 \
200+
--layout all \
201+
--repeats 3 \
202+
--query-single-value
203+
```
204+
205+
Observed results:
206+
207+
- `zonemap`
208+
- build: `1193.665 ms`
209+
- filtered lookup: `8.646 ms`
210+
- DB size: `56,111,104` bytes
211+
- `art-index`
212+
- build: `2849.869 ms`
213+
- filtered lookup: `0.755 ms`
214+
- DB size: `478,687,232` bytes
215+
216+
### Blosc2
217+
218+
Command:
219+
220+
```bash
221+
python index_query_bench.py \
222+
--size 10M \
223+
--outdir /tmp/indexes-10M \
224+
--kind all \
225+
--query-width 1 \
226+
--dist random \
227+
--query-single-value
228+
```
229+
230+
Observed results:
231+
232+
- `light`
233+
- build: `1225.637 ms`
234+
- cold lookup: `1.290 ms`
235+
- warm lookup: `2.351 ms`
236+
- index sidecars: `27,497,393` bytes
237+
- `medium`
238+
- build: `5511.863 ms`
239+
- cold lookup: `1.081 ms`
240+
- warm lookup: `0.964 ms`
241+
- index sidecars: `37,645,201` bytes
242+
- `full`
243+
- build: `10954.844 ms`
244+
- cold lookup: `0.603 ms`
245+
- warm lookup: `0.525 ms`
246+
- index sidecars: `29,888,673` bytes
247+
248+
### Interpretation
249+
250+
Once DuckDB is allowed to use the more planner-friendly single-value predicate:
251+
252+
- `art-index` becomes very fast
253+
- `art-index` is now faster than Blosc2 `light`
254+
- Blosc2 `full` still remains slightly faster than DuckDB `art-index` on this measured point-lookup case
255+
256+
However, the storage costs are very different:
257+
258+
- DuckDB `art-index` database size: about `478.7 MB`
259+
- DuckDB zonemap baseline size: about `56.1 MB`
260+
- estimated ART overhead over baseline: about `422.6 MB`
261+
- Blosc2 `full` base + index footprint: about `31 MB + 29.9 MB = 60.9 MB`
262+
263+
So for true point lookups:
264+
265+
- DuckDB `art-index` is competitive on latency
266+
- Blosc2 `full` is still faster in the measured run
267+
- Blosc2 `full` is much smaller overall
268+
- DuckDB `art-index` is much faster to build than Blosc2 `full`
269+
270+
271+
## Blosc2 Light vs DuckDB Zonemap
272+
273+
This is the cleanest cross-system comparison, because both are lossy pruning structures rather than exact
274+
secondary indexes.
275+
276+
Main observations:
277+
278+
- storage footprint is in roughly the same ballpark
279+
- DuckDB zonemap DB: about `56 MB`
280+
- Blosc2 base + `light`: about `58 MB`
281+
- Blosc2 `light` lookup speed is much better
282+
- width `50`: about `6.25 ms` vs `13.33 ms`
283+
- width `1`: about `1.3-1.5 ms` vs `8.6-12.6 ms`
284+
285+
Conclusion:
286+
287+
- DuckDB zonemap is closer in spirit to Blosc2 `light` than DuckDB ART is
288+
- but Blosc2 `light` is a materially stronger lookup structure on these workloads
289+
290+
291+
## Blosc2 Full vs DuckDB ART
292+
293+
This is the most relevant exact-index comparison.
294+
295+
Main observations:
296+
297+
- point-lookup latency
298+
- DuckDB `art-index`: `0.755 ms`
299+
- Blosc2 `full`: `0.603 ms` cold, `0.525 ms` warm
300+
- build time
301+
- DuckDB `art-index`: `2849.869 ms`
302+
- Blosc2 `full`: `10954.844 ms`
303+
- footprint
304+
- DuckDB `art-index` DB: about `478.7 MB`
305+
- Blosc2 `full` base + index: about `60.9 MB`
306+
307+
Conclusion:
308+
309+
- DuckDB ART wins on build time
310+
- Blosc2 `full` wins on storage efficiency
311+
- Blosc2 `full` was slightly faster on the measured point lookup
312+
- DuckDB ART is much more sensitive to predicate shape
313+
314+
315+
## Why `--query-single-value` Matters More in DuckDB
316+
317+
Observed behavior:
318+
319+
- Blosc2:
320+
- width-1 range form and `==` are nearly equivalent in performance
321+
- DuckDB:
322+
- width-1 range form was much slower than `id = value`
323+
324+
Practical implication:
325+
326+
- Blosc2 benchmarks are fairly robust to whether a point lookup is written as `==` or as a collapsed range
327+
- DuckDB benchmarks must distinguish those two forms explicitly, otherwise ART performance is understated
328+
329+
330+
## Caveats
331+
332+
- These results come from one hardware/software setup and one dataset shape.
333+
- DuckDB stores table data and indexes in one DB file, so payload and index bytes cannot be separated as cleanly
334+
as in Blosc2.
335+
- DuckDB zonemap is built-in table pruning metadata, not a separately managed index.
336+
- Blosc2 and DuckDB are not identical systems:
337+
- Blosc2 benchmark operates over compressed array storage and explicit index sidecars
338+
- DuckDB benchmark operates over a columnar SQL engine with its own optimizer behavior
339+
340+
341+
## Current Takeaways
342+
343+
1. Blosc2 `light` is very competitive against DuckDB zonemap-like pruning.
344+
2. Blosc2 `light` offers much faster selective lookups than DuckDB zonemap at a similar total storage cost.
345+
3. DuckDB `art-index` becomes strong only when queries are written as true equality predicates.
346+
4. Blosc2 `full` compares very well against DuckDB `art-index` on point lookups:
347+
- slightly faster in the measured run
348+
- much smaller on disk
349+
- slower to build
350+
5. Query-shape sensitivity is a major difference:
351+
- small for Blosc2
352+
- large for DuckDB ART

0 commit comments

Comments
 (0)