add config examples to use PQ and SQ indexing and search for wiki-1M with Cohere embeddings by harsha-simhadri · Pull Request #1047 · microsoft/DiskANN

harsha-simhadri · 2026-05-10T17:28:48Z

add config examples to use PQ and SQ indexing and search for wiki-1M with Cohere embeddings

Copilot

Pull request overview

Adds new benchmark configuration examples intended to demonstrate product quantization (PQ) and spherical quantization workflows for the wikipedia-1M Cohere embedding dataset.

Changes:

Added a PQ graph-index build+search example JSON for wikipedia-1M.
Added an exhaustive spherical-quantization example JSON (currently named as a wiki1M graph-index example).

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.

File	Description
diskann-benchmark/example/graph-index-spherical-quantization-wiki1M.json	Adds an exhaustive spherical quantization benchmark config (but currently references siftsmall test data / exhaustive tag).
diskann-benchmark/example/graph-index-product-quantization-wiki1M.json	Adds a PQ graph-index benchmark config for wikipedia-1M (currently uses an unregistered job `type`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

The PR accidentally renamed spherical-exhaustive.json to graph-index-spherical-quantization-wiki1M.json, breaking the spherical_quantization_intergration test which references the old name. - Restore spherical-exhaustive.json with its original content - Update graph-index-spherical-quantization-wiki1M.json with proper content Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codecov-commenter · 2026-05-11T00:43:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.47%. Comparing base (3d3ed4c) to head (d7c3485).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1047      +/-   ##
==========================================
- Coverage   90.60%   89.47%   -1.13%     
==========================================
  Files         461      461              
  Lines       85494    85559      +65     
==========================================
- Hits        77462    76558     -904     
- Misses       8032     9001     +969

Flag	Coverage Δ
miri	`89.47% <ø> (-1.13%)`	⬇️
unittests	`89.32% <ø> (-1.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 46 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arrayka · 2026-05-11T05:06:16Z

@@ -0,0 +1,55 @@
+{
+  "search_directories": [
+  "../big-ann-benchmarks/data/wikipedia_cohere"  ],


Minor: In the spirit of self‑service, consider including guidance on how to download the dataset, so that even an AI agent can follow the steps and retrieve the required data.

where do you prefer I add this?

added to benchmark readme. please mark as resolved if this is acceptable

hildebrandmw

Thanks. While a pre-built self serve like this is nice, I worry that if we proliferate these kinds of auxiliary files without a CI check for compatibility, they will continually break as we evolve diskann-benchmark. Is this just a convenience for not using the inputs functionality of diskann-benchmark and if so, is there something we can do the to CLI instead to make it more ergonomic?

hildebrandmw · 2026-05-11T17:46:03Z

+          "source":{
+            "index-source": "Build",
+            "data_type": "float32",
+            "data": "wikipedia_base.bin.crop_nb_1000000",


Would be nice if we could run these through the dry run functionality to ensure they don't get out of date. The one problem would be file-path validation would fail without the existence of the data/query/groundtruth files.

I will look into dry run checks.

The goals of the new files is to illustrate the parameter choices to be used for larger datasets. The params in config files for test datasets are insufficient for larger datasets. If on the other hand you are oppposed to adding these entries here, please let me know and I can drop it, and not spend time on dry run checks

Relatedly, it looks like there is no example config that builds a PQ index using the example datasets in this repo, although there is one for spherical. Is that something we should change?

I think adding examples is great and would really help reduce friction for people trying out the library! My concern is around ensuring these examples are up-to-date, so in the future users don't run into the unfortunate situation where they try an example and it immediately fails for reasons other than the files not being in the correct place.

Apologies if my original comment didn't leave much direction. The main issue with the existing dry-run functionality is that file paths are also validated (at least for existence) as part of the dry-run check, which makes it harder to use it out-of-the-box.

The issue boils down to this check here. We want a way to turn it off when running a CI validation run, then we'd be good. The problem is how to turn it off cleanly in a way that is somewhat general. Maybe an additional "--no-environment" flag in addition to --dry-run to skip system dependent environment checks and making that available in the Checker struct? That would provide a nice place for generally disabling all kinds of system dependent checks for CI purposes.

add config examples for wiki-1M with Cohere embeddings

b75ea66

harsha-simhadri requested review from a team and Copilot May 10, 2026 17:28

harsha-simhadri enabled auto-merge (squash) May 10, 2026 17:29

Copilot started reviewing on behalf of harsha-simhadri May 10, 2026 17:29 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread diskann-benchmark/example/graph-index-product-quantization-wiki1M.json Outdated

harsha-simhadri and others added 4 commits May 10, 2026 15:48

Potential fix for pull request finding

5368f2f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

update sq

4b7b3ac

update sq

40a16e2

arrayka approved these changes May 11, 2026

View reviewed changes

metajack approved these changes May 11, 2026

View reviewed changes

hildebrandmw reviewed May 11, 2026

View reviewed changes

harsha-simhadri added 2 commits May 11, 2026 12:13

multi-insert null

2c1111d

add download instructions

d7c3485

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add config examples to use PQ and SQ indexing and search for wiki-1M with Cohere embeddings#1047

add config examples to use PQ and SQ indexing and search for wiki-1M with Cohere embeddings#1047
harsha-simhadri wants to merge 7 commits into
mainfrom
harshasi/add_wiki1M_examples

harsha-simhadri commented May 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov-commenter commented May 11, 2026 •

edited

Loading

Uh oh!

arrayka May 11, 2026

Uh oh!

harsha-simhadri May 11, 2026

Uh oh!

harsha-simhadri May 11, 2026

Uh oh!

hildebrandmw left a comment

Uh oh!

hildebrandmw May 11, 2026

Uh oh!

harsha-simhadri May 11, 2026

Uh oh!

magdalendobson May 11, 2026

Uh oh!

hildebrandmw May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

harsha-simhadri commented May 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arrayka May 11, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri May 11, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri May 11, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

hildebrandmw May 11, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri May 11, 2026

Choose a reason for hiding this comment

Uh oh!

magdalendobson May 11, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov-commenter commented May 11, 2026 •

edited

Loading

hildebrandmw May 11, 2026 •

edited

Loading