- Development testing: Use the
sparocatalog (small monorepo, fast iteration) - Final verification: Use the
rushstackcatalog (large monorepo, hours to crawl) - Qdrant collection: Named
monodex(notrushstack)
monodex is a semantic search indexer for Rush monorepos, using Qdrant vector database with local embeddings. It supports label-based semantic indexing where each label defines a queryable fileset (like a Git commit or branch head) within a catalog.
A label is a named, queryable fileset within a catalog. Examples:
catalog = rushstacklabel = mainlabel = feature/login-flow
A search is scoped by both catalog and label.
Catalog names are strict kebab-case identifiers:
^[a-z0-9]+(?:-[a-z0-9]+)*$
- Length 1–64 characters
- Lowercase ASCII alphanumeric words separated by single
- - No leading, trailing, or consecutive
-
Examples:
- ✅ Valid:
my-repo,frontend,backend-api - ❌ Invalid:
My-Repo(uppercase),foo--bar(consecutive-),foo_(trailing separator)
Label names are Git-like identifiers:
^[a-z0-9]+(?:[./=-][a-z0-9]+)*$
- Length 1–128 characters
- Lowercase ASCII alphanumeric words separated by
.,/,-, or= - No leading, trailing, or consecutive separators
Examples:
- ✅ Valid:
main,feature/login-flow,release/v1.2.3,branch=main - ❌ Invalid:
feature_login(underscore),FOO(uppercase),foo//bar(consecutive separators)
Reserved characters: The characters :, @, +, #, whitespace, and control characters are forbidden in both catalog and label names. They are reserved for future grammar extensions (see below).
The = character in labels: The = character is permitted but not interpreted. A user who types --label branch=main gets a label literally named branch=main. This permits users to adopt a kind=payload naming convention in their own automation and ensures compatibility when Monodex later adds native typed-label support.
Planned grammar extensions: The full identifier syntax (see issue #25) includes:
- Typed labels:
kind=payload(e.g.,branch=main,commit=abc123,tag=v1.2.3) - Cross-catalog references:
@catalog:label - Path references:
label:path,@catalog:label:path - Additional reserved characters:
+and#
None of this grammar is parsed today. The reserved characters are rejected now to prevent breaking changes when these features land.
The label_id is the fully qualified storage key <catalog>:<label>. This is an internal representation used only in:
- Qdrant payload fields (
active_label_ids,label_id) - UUID derivation for label-metadata points
- Internal log/debug output
Users never type or see the qualified form directly. The CLI accepts --catalog and --label as two separate flags.
Key principle:
- Chunks = immutable content (text, embeddings, metadata)
- Labels = mutable membership (which chunks belong to which queryable fileset)
When a label is refreshed to a new commit:
- Existing chunks remain (immutable)
- Membership (
active_label_ids) is updated (mutable) - Orphaned chunks (no labels) can be garbage collected
This separation allows efficient branch switching without re-embedding identical content.
For Git-backed code catalogs, crawling reads from Git objects, not the working tree:
- Enumerate files from the commit tree (
git ls-tree) - Read file content from Git blobs (
git cat-file --batch) - Deterministic and reproducible
- Ignores uncommitted working tree changes
Identical file content may require re-crawl when contextual identity changes:
blob_id: identity of raw Git file content (provenance, diagnostics)chunk_id: identity of the indexed semantic artifact (depends on context)
We optimize for switching between Git branches with overlap, NOT for generalized pattern matching or data compression. Path renames that affect breadcrumb context will create new chunks.
Qdrant is the only authoritative state store:
- Label metadata lives in Qdrant
- File completion state lives in Qdrant
- Label membership lives in Qdrant
No Git refs, JSON sidecars, or SQLite in this phase.
pub struct PointPayload {
pub text: String,
pub source_type: String, // "code"
// Label membership
pub catalog: String,
pub label_id: String, // Transitional: the initiating label. Prefer active_label_ids.
pub active_label_ids: Vec<String>, // All labels this chunk belongs to (authoritative)
// Implementation identity
pub embedder_id: String, // e.g., "jina-embeddings-v2-base-code:v1"
pub chunker_id: String, // e.g., "typescript-partitioner:v1"
// Provenance
pub blob_id: String, // Git blob SHA
pub content_hash: String, // Hash of chunk text
// File identity
pub file_id: String, // Semantic file identity (for grouping chunks)
// Path context (for retrieval without Git)
pub relative_path: String,
pub package_name: String,
pub source_uri: String, // Useful for locating in Git/GitHub, but NOT a key
// Chunk metadata
pub chunk_ordinal: usize, // 1-indexed position in file
pub chunk_count: usize,
pub start_line: usize,
pub end_line: usize,
// Semantic context
pub symbol_name: Option<String>,
pub chunk_type: String, // AST node type: function, class, method, etc.
pub chunk_kind: String, // content, imports, changelog, config
pub breadcrumb: Option<String>, // Human-readable: package:File.ts:Symbol
// Sentinel for incremental crawl
pub file_complete: bool, // Only true on chunk_ordinal=1
}Field notes:
source_uri: Best-effort display/debug locator for Git/GitHub links. Not guaranteed stable or canonical. Not a key.chunk_ordinal: Renamed fromchunk_numberfor clarity. Always usechunk_ordinal.file_id: Semantic file identity for grouping chunks. Used for sentinel checks and file-level operations.label_id: Transitional field. Preferactive_label_idsfor label membership queries.
Label metadata is stored as special points in the main Qdrant collection:
pub struct LabelMetadata {
pub source_type: String, // "label-metadata"
pub catalog: String,
pub label_id: String, // e.g., "rushstack:main" (internal storage form)
pub label: String, // e.g., "main" (bare label name)
pub commit_oid: String, // Resolved commit SHA
pub source_kind: String, // "git-commit"
pub crawl_complete: bool,
pub updated_at_unix_secs: u64,
}Point ID: The label_id string is converted to a UUID via string_to_uuid() for Qdrant compatibility, allowing deterministic lookup. Both upsert_label_metadata() and get_label_metadata() use this same conversion.
Vector: Metadata points store a zero-vector of exactly 768 dimensions (matching the collection's vector size): [0.0; 768]. Qdrant requires vectors for all points, but these points are never used in similarity search. The dimension MUST match the collection's configured vector size to avoid insertion errors.
Why single collection: Using the main collection (rather than a separate metadata collection) avoids managing multiple Qdrant collections and keeps all state in one place. The tradeoff is mixing vector-bearing chunk records with metadata-only records. This is acceptable because:
- The
source_typediscriminator clearly separates them - Metadata points are few (one per label) compared to millions of chunks
- Query code filters by
source_typewhen needed
ID semantics note: Both chunks and labels use UUID-shaped strings derived deterministically from their content:
- Chunks:
string_to_uuid(format!("{}:{}", file_id, chunk_ordinal)) - Labels:
string_to_uuid(label_id)
This provides uniformity in point ID format while maintaining deterministic lookup for both types.
Point IDs for code chunks use deterministic hashes:
pub fn compute_file_id(
embedder_id: &str,
chunker_id: &str,
blob_id: &str,
relative_path: &str,
) -> StringThe file ID represents a semantic version of a file. Individual chunks are identified by (file_id, chunk_ordinal).
Point ID formula:
point_id = string_to_uuid(format!("{}:{}", file_id, chunk_ordinal))This allows upsert-by-ID semantics: if the same file content at the same path is crawled under multiple labels, we update active_label_ids rather than creating duplicates.
Important: Reuse only occurs when ALL of these match:
- Same content (blob_id)
- Same path (relative_path)
- Same implementation (embedder_id, chunker_id)
Path changes will produce new chunks even if content is identical. This is intentional: semantic context outweighs deduplication.
File Identity (computed once per file):
file_id = hash(embedder_id + chunker_id + blob_id + relative_path)
Chunk Identity (file_id + ordinal):
chunk identity = (file_id, chunk_ordinal)
This is a file-oriented model: the ID identifies a semantic version of a file, and chunk ordinal selects within it.
- Depends on implementation semantics (embedder_id, chunker_id)
- Depends on content (blob_id)
- Depends on path context (relative_path affects breadcrumb)
- Stable across sessions and machines
Explicitly stated: Path and breadcrumb context are semantically meaningful. This design:
- Does NOT optimize for reuse across path moves
- Does optimize for switching between Git branches with overlapping files
- Accepts that path renames will create new chunks
If a file moves from libraries/foo/src/A.ts to libraries/bar/src/A.ts, the breadcrumb changes from @scope/foo:A.ts:Symbol to @scope/bar:A.ts:Symbol. These are different semantic contexts, so different chunks is correct behavior.
Useful for provenance, diagnostics, and future optimization opportunities. But not the full identity of the indexed artifact because context matters.
The sentinel is chunk 1 of a file:
- Point ID = hash of (file_id, chunk_ordinal=1)
file_complete = trueonly on chunk 1- Existence check = direct lookup of sentinel point ID
A file is considered fully indexed when:
- Sentinel exists (chunk_ordinal = 1)
- Sentinel has
file_complete = true chunk_counton sentinel indicates total chunks
This preserves resumable crawl semantics.
Source-defined identifiers for the embedder and chunker:
const EMBEDDER_ID: &str = "jina-embeddings-v2-base-code:v1";
const CHUNKER_ID: &str = "typescript-partitioner:v1";If behavior changes in a way that should invalidate reuse, the constant changes. These are not user-authored config values.
- Resolve
--committo a full 40-character commit SHA (e.g., usinggit rev-parse) - Compute
label_id = <catalog>:<label> - Upsert label metadata with
crawl_complete = false(in-progress state)- This marks the crawl as in-progress before any work begins
- Run
git ls-tree -r -z <commit>to enumerate all files - Filter for files that pass catalog's path filtering rules
- For each file, obtain
blob_id,relative_path
Build a fast package lookup for the commit:
pub struct PackageIndex {
pub package_name_by_dir: HashMap<String, String>,
}See "Git Package Index" section for implementation details.
For each file:
- Resolve
package_nameusing package index - Read content from Git blob
- Chunk content using the implementation identified by
chunker_id - Compute chunk payloads with path/package/breadcrumb context
- Derive chunk identity using embedder_id, chunker_id, blob_id, relative_path, chunk_ordinal
For each file:
- Compute
file_id - Lookup sentinel point by (file_id, chunk_ordinal=1):
- Point ID =
string_to_uuid(format!("{}:{}", file_id, 1)) - Query Qdrant for chunk with
file_idANDchunk_ordinal = 1ANDsource_type = "code"
- Point ID =
- If sentinel exists and
file_complete = true:- Skip re-embedding
- Retrieve all chunks for file by filtering on
file_id(withsource_type = "code") - Add label to
active_label_idsfor each chunk (if not present)
- If sentinel does not exist or not complete:
- Read content from Git blob
- Chunk and embed all chunks
- Compute point ID for each chunk:
hash(file_id + chunk_ordinal) - Upsert all chunks
- Mark sentinel
file_complete = true - Add label to
active_label_idsfor each chunk
Critical: This step runs ONLY after a fully successful crawl completion. Partial crawls must NOT trigger reassignment.
- Track all file IDs touched during crawl (in a HashSet)
- Scan all chunks where
active_label_idscontains the label- Filter:
source_type = "code"(exclude metadata points)
- Filter:
- For each chunk:
- Extract the
file_idfield from the payload - If file_id NOT in touched set:
- Remove label from
active_label_ids - If
active_label_idsbecomes empty, delete the chunk
- Remove label from
- Extract the
Failure behavior: If the crawl is interrupted or fails:
- Do NOT run reassignment
- Label may temporarily have stale chunks (acceptable)
- Next successful crawl will clean up
When crawl completes successfully:
- Mark
crawl_complete = true - Store resolved commit OID
- Store update timestamp
Given a Git commit, efficiently build a mapping from directory paths to package names:
HashMap<String, String>
// "libraries/node-core-library" -> "@rushstack/node-core-library"- Enumerate all
package.jsonentries withgit ls-tree -r -z - Batch-read blob contents with
git cat-file --batch - Parse JSON to extract
"name"field
pub fn build_package_index_for_commit(
repo_root: &std::path::Path,
commit: &str,
) -> anyhow::Result<PackageIndex>Key details:
- Keys are repo-relative directory paths (e.g.,
"libraries/node-core-library") - This ensures portability and independence from filesystem location
- For repo-root package.json, key is empty string
""
Git protocol:
git ls-tree -r -z <commit>returns NUL-delimited entries:<mode> <type> <object_id>\t<path>\0git cat-file --batchreturns for each blob:<oid> <type> <size>\n<raw bytes>\n
For a file path like libraries/node-core-library/src/JsonFile.ts, check directories in order:
libraries/node-core-library/srclibraries/node-core-library(match found here)libraries""(repo root)
Return first match. This reproduces "nearest ancestor package.json governs the file".
pub fn find_package_name_from_index(
relative_path: &str,
package_index: &PackageIndex,
) -> Option<&str>- One full tree enumeration
- One long-lived
git cat-file --batchprocess - No per-file
git show - No filesystem traversal
The use command sets a default catalog and label context to avoid repeating flags:
monodex use --catalog rushstack --label mainAfter running this, subsequent commands use the default context:
# Instead of:
monodex search --catalog rushstack --label main --text "query"
# You can run:
monodex search --text "query"Default context storage: Stored in ~/.config/monodex/context.json:
{
"catalog": "rushstack",
"label": "main"
}Priority: Explicit --catalog / --label flags override default context.
monodex search --catalog rushstack --label main --text "how does package lookup work?"Qdrant filter:
source_type == "code"
AND catalog == "rushstack"
AND active_label_ids CONTAINS "rushstack:main"
Important: All search queries must filter source_type = "code" to exclude label metadata points from results.
monodex view --id <file_id>[:<selector>]Selector syntax:
:N— single chunk N:N-M— chunks N through M:N-end— chunk N through last
Chunks are filtered by active_label_ids and sorted by chunk_ordinal.
File reconstruction: To reconstruct an entire file, view all chunks using the file_id without a selector. Order chunks by chunk_ordinal to reconstruct the original file content.
Filtering: View queries must filter source_type = "code" to exclude label metadata points.
Note: Path-based view (querying by --path instead of --id) is intentionally deferred to a later phase. The primary workflow is search → view using file IDs from search results.
monodex crawl --catalog rushstack --label main --commit HEAD| Command | Purpose |
|---|---|
use |
Set default catalog/label context |
crawl |
Index a commit into a label |
search |
Semantic search within a label |
view |
View chunks by file ID |
All commands respect the default context set by use, but explicit flags override defaults.
| Property | Value |
|---|---|
| Max tokens | 8192 |
| Dimensions | 768 |
| Model size | ~612MB (FP32 ONNX) |
| License | Apache 2.0 |
| Trained on | Code + documentation |
Divide a file into chunks that fit the embedding budget (6000 chars), splitting only at meaningful AST boundaries.
Chunk Land (sizing/selection):
- File as sequence of line ranges
- Measures size, knows budget
- Simple bookkeeping
AST Land (structure/meaning):
- Walks syntax tree
- Provides candidate split points at semantic boundaries
- No opinions about sizes
- Start with one chunk = entire file
- While any chunk exceeds budget:
- Find meaningful split points from AST
- Split at the point that best balances sizes
- Done
- No marker: Good AST split
:[degraded-ast-split]: AST split with poor geometry (tiny chunks):[fallback-split]: No AST split found, used line-based recovery (failure mode)
The try_for_each(...)? pattern exits early without flushing remaining chunks. Need cleanup wrapper to ensure:
stop_flagis set- Channels are closed
- Uploader thread joins
- Remaining chunks flush
upload_batch() errors are only logged, not propagated. Need retry or abort logic.
Crawl accumulates points for 60 seconds with no size limit. Need batch constraints.
For changed files, existing chunks are deleted before re-indexing. If chunking/embedding/upload fails, file is permanently missing. Consider "replace after success" pattern.
Files missing chunk 1 are invisible to catalog view but orphaned chunks remain. Need garbage collection.
Inline cleanup (during crawl):
- Label reassignment removes stale label membership
- Chunks with empty
active_label_idsare deleted - Runs automatically after successful crawl
Offline GC (separate command, future):
monodex gc --catalog rushstack- Scan for chunks with empty
active_label_ids - Delete orphaned chunks
- Report storage recovered
- Useful for cleanup after interrupted crawls or manual operations
| Metric | Estimate |
|---|---|
| Files per catalog | ~200,000 |
| Chunks per catalog | ~600,000 |
| Chunks per file | 1-20 (avg 3) |
| Embedding time | ~12ms per chunk (parallel) |
| Full crawl time | ~15-30 minutes |
~/.config/monodex/config.json:
{
"qdrant": {
"url": "http://localhost:6333",
"collection": "monodex"
},
"catalogs": {
"sparo": {
"type": "monorepo",
"path": "/path/to/sparo"
},
"rushstack": {
"type": "monorepo",
"path": "/path/to/rushstack"
}
}
}Note: Use sparo for development testing. rushstack is for final verification only.
File extension: All config files use .json extension (not .jsonc) per Rush Stack conventions.
catalogis a user-defined name in config- For Git operations, the
pathfield points to the repository root - Future:
repo_idcould be derived from Git remote for cross-machine identity
Working directory crawling indexes uncommitted changes from the filesystem rather than Git objects. This is useful for:
- Indexing work-in-progress before committing
- Comparing uncommitted changes with committed code
- AI assistants that need to understand the current state of the codebase
Working directory files use a different identity model than commit-based files:
| Property | Commit-Based | Working Directory |
|---|---|---|
blob_id |
Git blob SHA | sha256:<hash> (content hash) |
commit_oid |
Resolved commit SHA | "" (empty string) |
source_kind |
"git-commit" |
"working-directory" |
Key insight: The file_id is computed from (embedder_id, chunker_id, blob_id, relative_path). For working directory files, the "blob_id" is actually a content hash. This means:
- Same content at same path → same
file_id(can share chunks) - Different content at same path → different
file_id(new chunks) - Same content at different path → different
file_id(breadcrumb context matters)
LabelMetadata {
source_kind: "working-directory".to_string(),
commit_oid: "".to_string(), // No commit
crawl_complete: true,
// ... other fields
}Working directory labels are mutable:
- Re-crawling updates indexed content based on current filesystem state
- Content hash changes trigger new chunks
- Label reassignment removes stale chunks
Commit-based labels are immutable (for a given commit):
- Re-crawling the same commit is idempotent
- Same commit always produces same chunks
# Index working directory
monodex crawl --catalog rushstack --label working --working-dir
# Search working directory content
monodex search --text "uncommitted feature" --label rushstack:working
# Compare with committed code
monodex search --text "same query" --label rushstack:main
monodex search --text "same query" --label rushstack:workingCrawl policy (file types, exclusions, overrides) is externalized from Rust code into a JSON config file. This enables:
- Per-repo customization without code changes
- Easy sharing of configs between repos or teams
- Deterministic, debuggable behavior
File: monodex-crawl.json (JSON format, .json extension per Rush Stack conventions)
{
"version": 1,
"fileTypes": {
".ts": "typescript",
".tsx": "typescript",
".md": "markdown",
".yaml": "lineBased"
},
"patternsToExclude": [
"node_modules/",
"dist/",
"build/",
"lib/",
"*.snap",
"*.test.ts",
"*.spec.ts",
"package-lock.json",
"pnpm-lock.yaml",
"yarn.lock"
],
"patternsToKeep": [
"src/",
"test/"
]
}| Field | Required | Description |
|---|---|---|
version |
Yes | Config schema version (must be 1) |
fileTypes |
Yes | Map of file suffix → chunking strategy |
patternsToExclude |
Yes | Array of glob patterns for paths to skip |
patternsToKeep |
Yes | Array of glob patterns that override exclusion |
shouldCrawl = matchesFileType
&& (matchesPatternsToKeep || !matchesPatternsToExclude)
Key properties:
fileTypesis the primary filter (allowlist)patternsToKeeponly overrides exclusion, does NOT force unsupported file types- No multi-layer include/exclude semantics (single tier only)
Valid strategy names (from src/engine/config.rs):
| Strategy | File Types | Description |
|---|---|---|
typescript |
.ts, .tsx |
AST-based semantic chunking |
markdown |
.md |
Heading-based chunking |
lineBased |
.txt, .css, .scss, .yml, .yaml |
Generic line-based chunking |
- Patterns use Rust glob semantics via
globsetcrate - Matching is against repo-relative paths (not absolute)
- Path separator is
/ - Paths must be normalized before matching
- Matching is case-sensitive (v1)
- Invalid patterns → config validation error
Exactly one config is used. No merging. Precedence:
- Repo-local config:
<repo-root>/monodex-crawl.json - User-global config:
~/.config/monodex/crawl.json - Built-in default: Embedded in binary (same JSON format)
Strict validation (no silent fallback):
- Required fields must be present
- Unknown fields → error
- Incorrect types → error
- Unsupported
version→ error - Unknown strategy names → error
- Invalid glob patterns → error
The same crawl config applies to both:
- Commit-based crawling (
--commit) - Working directory crawling (
--working-dir)
Working directory is treated as a "degenerate commit" - same filtering rules apply.
Given config:
{
"fileTypes": { ".ts": "typescript" },
"patternsToExclude": ["*.test.ts"],
"patternsToKeep": ["src/"]
}| Path | Result | Reason |
|---|---|---|
src/utils.test.ts |
Crawled | Matches patternsToKeep (overrides exclude) |
lib/utils.test.ts |
Skipped | Matches patternsToExclude, no keep override |
src/utils.ts |
Crawled | No exclusion match |
lib/utils.ts |
Crawled | No exclusion match |
- GitHub Issues
- Zulip Discussions
- Meeting Notes
Long-term architecture: Qdrant + SQLite, but not required for this phase.
Query chunks by path instead of file ID:
monodex view --catalog rushstack --label main --path libraries/node-core-library/src/JsonFile.tsDeferred until use cases are clearer. Primary workflow is search → view.