-
Notifications
You must be signed in to change notification settings - Fork 417
add config examples to use PQ and SQ indexing and search for wiki-1M with Cohere embeddings #1047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
a623513
bd3e73f
fdcbff1
10c4f6f
b32f1ad
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| { | ||
| "search_directories": [ | ||
| "../big-ann-benchmarks/data/wikipedia_cohere" ], | ||
| "jobs": [ | ||
| { | ||
| "type": "graph-index-build-pq", | ||
| "content": { | ||
| "index_operation": { | ||
| "source":{ | ||
| "index-source": "Build", | ||
| "data_type": "float32", | ||
| "data": "wikipedia_base.bin.crop_nb_1000000", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be nice if we could run these through the dry run functionality to ensure they don't get out of date. The one problem would be file-path validation would fail without the existence of the data/query/groundtruth files.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will look into dry run checks. The goals of the new files is to illustrate the parameter choices to be used for larger datasets. The params in config files for test datasets are insufficient for larger datasets. If on the other hand you are oppposed to adding these entries here, please let me know and I can drop it, and not spend time on dry run checks
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Relatedly, it looks like there is no example config that builds a PQ index using the example datasets in this repo, although there is one for spherical. Is that something we should change?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think adding examples is great and would really help reduce friction for people trying out the library! My concern is around ensuring these examples are up-to-date, so in the future users don't run into the unfortunate situation where they try an example and it immediately fails for reasons other than the files not being in the correct place. Apologies if my original comment didn't leave much direction. The main issue with the existing The issue boils down to this check here. We want a way to turn it off when running a CI validation run, then we'd be good. The problem is how to turn it off cleanly in a way that is somewhat general. Maybe an additional "--no-environment" flag in addition to |
||
| "distance": "inner_product", | ||
| "start_point_strategy": "medoid", | ||
| "max_degree": 32, | ||
| "l_build": 100, | ||
| "alpha": 1.2, | ||
| "backedge_ratio": 1.0, | ||
| "num_threads": 32, | ||
| "multi_insert": { | ||
| "batch_size": 128, | ||
| "batch_parallelism": 32, | ||
| "intra_batch_candidates": "all" | ||
| } | ||
| }, | ||
| "search_phase": { | ||
| "search-type": "topk", | ||
| "queries": "wikipedia_query.bin", | ||
| "groundtruth": "wikipedia-1M", | ||
| "reps": 1, | ||
| "num_threads": [ | ||
| 32 | ||
| ], | ||
| "runs": [ | ||
| { | ||
| "search_n": 10, | ||
| "search_l": [ | ||
| 100, | ||
| 150, | ||
| 200, | ||
| 300 | ||
| ], | ||
| "recall_k": 10 | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "num_pq_chunks": 192, | ||
| "seed": 13076402859301299683, | ||
| "max_fp_vecs_per_prune": 0, | ||
| "use_fp_for_search": false | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| { | ||
| "search_directories":[ | ||
| "../big-ann-benchmarks/data/wikipedia_cohere" | ||
| ], | ||
| "jobs": [ | ||
| { | ||
| "type": "graph-index-build-spherical-quantization", | ||
| "content": { | ||
| "build": { | ||
| "data_type": "float32", | ||
| "data": "wikipedia_base.bin.crop_nb_1000000", | ||
| "distance": "inner_product", | ||
| "start_point_strategy": "medoid", | ||
| "max_degree": 32, | ||
| "l_build": 100, | ||
| "alpha": 1.2, | ||
| "backedge_ratio": 1.0, | ||
| "num_threads": 32, | ||
| "multi_insert": null | ||
| }, | ||
| "search_phase": { | ||
| "search-type": "topk", | ||
| "queries": "wikipedia_query.bin", | ||
| "groundtruth": "wikipedia-1M", | ||
| "reps": 5, | ||
| "num_threads": [ | ||
| 32 | ||
| ], | ||
| "runs": [ | ||
| { | ||
| "search_n": 10, | ||
| "search_l": [ | ||
| 50, | ||
| 100, | ||
| 150, | ||
| 200,300 | ||
| ], | ||
| "recall_k": 10 | ||
| } | ||
| ] | ||
| }, | ||
| "seed": 12648430, | ||
| "transform_kind": { | ||
| "padding_hadamard": "same" | ||
| }, | ||
| "query_layouts": [ | ||
| "same_as_data", | ||
| "full_precision" | ||
| ], | ||
| "num_bits": 2, | ||
| "pre_scale": { | ||
| "some": 0.00390625 | ||
| } | ||
| } | ||
| } | ||
| ] | ||
| } |
Uh oh!
There was an error while loading. Please reload this page.