Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
{
"search_directories": [
"../big-ann-benchmarks/data/wikipedia_cohere" ],
Comment thread
harsha-simhadri marked this conversation as resolved.
"jobs": [
{
"type": "graph-index-build-pq",
"content": {
"index_operation": {
"source":{
"index-source": "Build",
"data_type": "float32",
"data": "wikipedia_base.bin.crop_nb_1000000",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if we could run these through the dry run functionality to ensure they don't get out of date. The one problem would be file-path validation would fail without the existence of the data/query/groundtruth files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into dry run checks.

The goals of the new files is to illustrate the parameter choices to be used for larger datasets. The params in config files for test datasets are insufficient for larger datasets. If on the other hand you are oppposed to adding these entries here, please let me know and I can drop it, and not spend time on dry run checks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatedly, it looks like there is no example config that builds a PQ index using the example datasets in this repo, although there is one for spherical. Is that something we should change?

Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding examples is great and would really help reduce friction for people trying out the library! My concern is around ensuring these examples are up-to-date, so in the future users don't run into the unfortunate situation where they try an example and it immediately fails for reasons other than the files not being in the correct place.

Apologies if my original comment didn't leave much direction. The main issue with the existing dry-run functionality is that file paths are also validated (at least for existence) as part of the dry-run check, which makes it harder to use it out-of-the-box.

The issue boils down to this check here. We want a way to turn it off when running a CI validation run, then we'd be good. The problem is how to turn it off cleanly in a way that is somewhat general. Maybe an additional "--no-environment" flag in addition to --dry-run to skip system dependent environment checks and making that available in the Checker struct? That would provide a nice place for generally disabling all kinds of system dependent checks for CI purposes.

"distance": "inner_product",
"start_point_strategy": "medoid",
"max_degree": 32,
"l_build": 100,
"alpha": 1.2,
"backedge_ratio": 1.0,
"num_threads": 32,
"multi_insert": {
"batch_size": 128,
"batch_parallelism": 32,
"intra_batch_candidates": "all"
}
},
"search_phase": {
"search-type": "topk",
"queries": "wikipedia_query.bin",
"groundtruth": "wikipedia-1M",
"reps": 1,
"num_threads": [
32
],
"runs": [
{
"search_n": 10,
"search_l": [
100,
150,
200,
300
],
"recall_k": 10
}
]
}
},
"num_pq_chunks": 192,
"seed": 13076402859301299683,
"max_fp_vecs_per_prune": 0,
"use_fp_for_search": false
}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
{
"search_directories":[
"../big-ann-benchmarks/data/wikipedia_cohere"
],
"jobs": [
{
"type": "graph-index-build-spherical-quantization",
"content": {
"build": {
"data_type": "float32",
"data": "wikipedia_base.bin.crop_nb_1000000",
"distance": "inner_product",
"start_point_strategy": "medoid",
"max_degree": 32,
"l_build": 100,
"alpha": 1.2,
"backedge_ratio": 1.0,
"num_threads": 32,
"multi_insert": null
},
"search_phase": {
"search-type": "topk",
"queries": "wikipedia_query.bin",
"groundtruth": "wikipedia-1M",
"reps": 5,
"num_threads": [
32
],
"runs": [
{
"search_n": 10,
"search_l": [
50,
100,
150,
200,300
],
"recall_k": 10
}
]
},
"seed": 12648430,
"transform_kind": {
"padding_hadamard": "same"
},
"query_layouts": [
"same_as_data",
"full_precision"
],
"num_bits": 2,
"pre_scale": {
"some": 0.00390625
}
}
}
]
}
Loading