Skip to content
64 changes: 64 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
When performing a code review, check that:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can copilot instructions point to the existing AGENTS.md files where applicable? It looks like a lot of instructions are manually duplicated here, so we risk them getting out of sync.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.
Do we really need .github/copilot-instructions.md? if so, how is it semantically different from /AGENTS.md? Why can't we keep all agent-related instructions in /AGENTS.md?


## SemVer and API Compatibility

The workspace obeys SemVer. Removing or changing public API signatures (functions, types, re-exports) requires a version bump and a description of the suggested migration impact in the PR description.

## Dependency and Build Hygiene
- Check for changes that introduce new dependencies or increase build times. If a dependency is added, it must be justified in the PR description.


## Error Handling

- Do not introduce `panic!` paths for recoverable errors — propagate with `Result` instead.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably deserves a mention/move to the top-level AGENTS.md

- Keep error types small. Avoid large enums/structs that blow up the stack; look for ways to reduce field sizes (e.g., compute derivable fields, use enums instead of `&'static str`).
- Prefer `ANNError::new(ANNErrorKind::…, e)` over the old `log_*`-style constructors, which force eager string formatting and double-log errors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There is not double-log from these APIs (the name is a misnomer). The biggest issue is eager string formatting.

- When using `thiserror`, rely on `#[from]` for automatic `Error::source` chaining — do not format the inner error in the `#[error("…")]` display string.
- Include relevant context values (e.g., the kind, key, or dimension) in error messages for debuggability.

## Documentation

- Doc comments and README examples must match actual API signatures and serialized shapes.
- Stale examples that fail to compile or deserialize are treated as bugs.
- Do not leave dead references to APIs that no longer exist.
- When changing a function signature or removing a parameter, update all doc comments that mention the old signature.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More suggestions on documentation:

  • Doc and inline comments should describe what the code does. Do not describe behavior by contrasting it with other code, except when referencing documented external behavior.
  • Avoid comments that simply restate what is already clear from function signatures or where clauses.
  • Do not list functions or types in module docs that rustdoc already documents.
  • Module-level docs should describe the purpose and structure of the module, not its contents.


## Constants and Assumptions

- Do not hardcode magic values — make them configurable with sensible defaults and document the rationale.
- Add assertions for invariants that callers or maintainers would otherwise have to discover by reading the implementation.

## SIMD and Platform Portability

- Do not assume specific SIMD lane widths (e.g., `f32s::LANES == 8`). Code must be correct on AVX2, AVX-512, and ARM/NEON.
- When touching architecture-specific intrinsics, verify cross-platform behavior per `diskann-wide/README.md`.

## Testing
- Do not drop existing unit tests without a strong reason.
- Keep test helpers close to the code they exercise, typically in a `mod tests` at the bottom of the file or in an adjacent test module, guarded with `#[cfg(test)]`.
- Do not add tests for derived traits (`Clone`, `Debug`, `PartialEq`) or enums unless they have explicit behavior beyond the derive.
- Test edge cases like empty inputs (e.g., empty iterators) to lock in defined behavior and prevent divide-by-zero or NaN results.

## Rayon and Parallelism

- Never use the global Rayon thread pool. Always execute parallel work within the provided `RayonThreadPool` or `RayonThreadPoolRef`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guidance is more for diskann-providers/diskann-disk than lower level crates. Lower level crates like diskann-quantization should use the dynamically scoped thread pool, but advertise this using the Parallelism enum.

- Preserve deadlock-avoidance intent when modifying nested parallel loops.
- Be aware that combining blocking synchronization (e.g., mutex acquisition) with Rayon work-stealing can cause deadlocks.

## Unsafe Code and Safety

- Every `unsafe` block must have a `// SAFETY:` comment directly above it explaining why the operation is sound.
- Safety comments must be specific and verifiable — state the concrete precondition that makes the operation safe (e.g., `// SAFETY: i + width <= len ensures this read is in-bounds`). Do not use vague justifications like `// SAFETY: this is safe`.
- Safety contracts on `unsafe fn` signatures must be internally consistent — if the documented precondition says `scratch.len() >= n`, ensure the implementation does not write beyond `n` elements (e.g., due to rounding up to a panel/block size).
- Prefer safe abstractions over raw `unsafe` when possible. Use `unsafe` only when there is a measurable performance benefit or when interfacing with FFI/intrinsics.
- For pointer arithmetic, prefer `offset_from` to express bounds rather than `wrapping_add` unless wrapping behavior is intentionally needed — and document why.
- When calling SIMD intrinsics or FFI, list the specific preconditions being satisfied (alignment, length, non-null, valid initialization).

## License Headers
- Each file has a license header.
```
/*
* Copyright (c) Microsoft Corporation.
* Licensed under the MIT license.
*/
```
11 changes: 0 additions & 11 deletions .github/instructions.md

This file was deleted.

182 changes: 182 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# DiskANN Repository - Agent Onboarding Guide
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly follow next structure in this document:

  1. How code is organized (Crate Organization)
  2. How to contribute (aka How we write code) - and move all stuff from copilot-instructions to this section.
  3. How to validate code before submitting a PR + briefly mention that CI will perform extensive validation in .github/workflows/ci.yml.


**Last Updated**: 2026-05-04 (based on v0.50.1, Rust 1.92)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a lot of previous commentary on testing is removed and not reproduced anywhere else. Is this purposeful? It seems like useful information that shouldn't be lost.

Same comment on code coverage. Should we add something on both topics to the "Always" list?



## Crate Organization

**Tier 1: Foundation**
- `diskann-wide/` - Low-level SIMD, bit manipulation, type width abstractions
- `diskann-vector/` - Vector primitives and operations
- `diskann-platform/` - Platform-specific utilities

**Tier 2: Core Libraries**
- `diskann-linalg/` - Linear algebra operations
- `diskann-utils/` - Shared utilities (Reborrow, MatrixView traits)
- `diskann-quantization/` - Vector quantization

**Tier 3: Algorithm & Providers**
- `diskann/` - Core indexing logic
- `diskann-providers/` - Hodge-podge of stuff, will be dismantled
- `diskann-disk/` - Disk-based provider for the index with io_uring support
- `diskann-label-filter/` - Inverted index for filtered search
- `diskann-garnet/` - Garnet (Redis-compatible) Provider and FFI endpoints for vector sets

**Tier 4: Infrastructure & Tools**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Infrastructure" is a bit of a vague name here, it kind of seems like every tier contains something that could be called infrastructure. Maybe "Benchmarks and Tools" would be better?

- `diskann-benchmark-runner/` - Test runner infrastructure
- `diskann-benchmark-core/` - Benchmark framework
- `diskann-benchmark-simd/` - SIMD-specific benchmarks
- `diskann-benchmark/` - Benchmark definitions and runners
- `diskann-tools/` - CLI utilities (autotuner, etc.)
- `vectorset/` - Garnet client for benchmarking vector set workloads

---

### Internal Dependencies


- Tier 1 and Tier 2 crates may be added as dependencies of any internal crate
- `diskann` may be added as a dependency of any equal or higher tier internal crate except those below
- Do not add Tier 3 crates as dependencies of these Tier 4 crates:
- `diskann-benchmark-runner`
- `diskann-benchmark-core` (`diskann` is allowed)
- `diskann-benchmark-simd`

---

## Boundaries

### Never

- Modify files in `diskann/tests/generated/` by hand — these are auto-generated baselines. Regenerate with `DISKANN_TEST=overwrite`.
- Modify `rust-toolchain.toml`, `.github/workflows/`, or `.codecov.yml` without explicit approval.
- Use the global Rayon thread pool — use `RayonThreadPool`/`RayonThreadPoolRef` (enforced by `clippy.toml` disallowed methods).
- Use `rand::thread_rng` — use the project's `random.rs` utilities instead (enforced by `clippy.toml`).
- Use `vfs::PhysicalFS::new` or `VirtualStorageProvider::new_physical()` in tests — use `VirtualStorageProvider::new_overlay()`.
Comment on lines +49 to +55
- Remove or weaken existing tests without a strong, documented reason.
- Commit secrets, credentials, or API keys.

### Ask First

- Adding new workspace dependencies to `Cargo.toml` — justify the addition.
- Changing public API signatures in any `diskann-*` crate — requires SemVer analysis (may need a major version bump).
- Modifying tier dependency rules (e.g., adding a Tier 3 dependency to a Tier 4 benchmark crate).
- Changing `clippy.toml` or `rustfmt.toml` lint/formatting configuration.

### Always

- Include the MIT license header in every new source file.
- Run `cargo fmt --all` and `cargo clippy --workspace --all-targets -- -Dwarnings` before committing.
- Update doc comments when changing function signatures or removing parameters.
- Add a `// SAFETY:` comment above every `unsafe` block with specific, verifiable preconditions.

---




## Error Handling

There are three regimes of error handling and the strategy to use depends on the regime.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like these categories may conform neatly to the tiers discussed above, although maybe there are exceptions. If so it could be good to state that explicitly.


### Low-Level

Low-level crates should use bespoke, precise, non-allocating error types.
Use `thiserror` for boilerplate. Chain with `std::error::Error::source`.
`diskann::ANNError` is not a suitable low-level error type.

```rust
// ✅ Good — bespoke error type with thiserror, uses #[from] for source chaining
#[derive(Debug, thiserror::Error)]
pub enum SgemmError {
#[error("dimension overflow: {expected_rows}×{expected_cols} exceeds usize")]
DimensionOverflow { expected_rows: usize, expected_cols: usize },

#[error(transparent)]
Allocation(#[from] AllocatorError),
}

// ❌ Bad — single crate-level enum, formats inner error in display string
#[derive(Debug, thiserror::Error)]
pub enum MyLibError {
#[error("sgemm failed: {0}")] // Don't format the inner error here
Sgemm(#[source] SgemmError),
#[error("io failed: {0}")]
Io(#[from] std::io::Error),
// ... 20 more variants — too broad
}
```

### Mid-Level (diskann algorithms)

Use `diskann::ANNError` and its context machinery. This type:

- Is 16 bytes with niche optimization for `Option<ANNError>`, allowing return in registers.
- Records stack trace of its first construction under `RUST_BACKTRACE=1`.
- Precisely records file and line of creation via `#[track_caller]`.
- Has context layering machinery to add additional information as an error is passed up the stack.
- Is backed internally by `anyhow::Error`.

When converting to `ANNError`, use `#[track_caller]` for better source reporting.
Prefer `ANNError::new(ANNErrorKind::…, e)` over the old `log_*`-style constructors, which force eager string formatting and double-log errors.

```rust
// ✅ Good — deferred formatting, precise kind, track_caller
#[track_caller]
fn process_vectors(data: &[f32]) -> Result<(), ANNError> {
validate(data).map_err(|e| ANNError::new(ANNErrorKind::IndexError, e))?;
Ok(())
}

// ❌ Bad — eager string formatting, double-logs on creation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment from Mark about double-logging

fn process_vectors(data: &[f32]) -> Result<(), ANNError> {
validate(data).map_err(|e| ANNError::log_index_error(format!("validation failed: {e}")))?;
Ok(())
}
```

Traits with associated error types should consider constraining with `diskann::error::ToRanked` instead of `Into<ANNError>` if non-critical errors should be supported.
`ANNError` is the mid-level propagated error type; use `ToRanked` and `RankedError` to distinguish transient/recoverable failures from fatal ones.

### High Level (tooling)

At this level `anyhow::Error` is appropriate for binaries and CLI entry points.
Note that some tooling helpers still use `ANNError` for compatibility.

### Do Not

Do not use a single crate-level error enum. Problems:

- Provides no documentation on how an individual function could fail
- Encourages **worse** error messages than bespoke types
- Generates large structs that blow up the stack
- Branch-heavy `Drop` implementations which bloat code


## Document unsafe usage

```rust
// ✅ Good — specific, verifiable precondition
// SAFETY: `i + width <= len` ensures this read is in-bounds.
let val = unsafe { ptr.add(i).read() };

// ✅ Good — FFI with listed preconditions
// SAFETY: `a` and `b` are non-null, properly aligned, and do not alias.
// `m`, `n`, `k` match the actual matrix dimensions.
unsafe { ffi::sgemm(m, n, k, a.as_ptr(), b.as_ptr(), c.as_mut_ptr()) };

// ❌ Bad — vague, unverifiable
// SAFETY: this is safe
let val = unsafe { ptr.add(i).read() };
```


## CI Pipeline

CI workflow is defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml). Key jobs include:
- Format and clippy checks
- Tests on multiple platforms (Linux, Windows)
- [Code coverage](.codecov.yml)
- Architecture compatibility (SDE)

**Note**: CI uses `cargo-nextest` for running tests. See [`.cargo/nextest.toml`](.cargo/nextest.toml) for test configuration (timeouts, retries, etc.).
Loading
Loading