Skip to content

Commit 3fcdf54

Browse files
committed
docs: add new codebase article
1 parent 4b50c23 commit 3fcdf54

9 files changed

Lines changed: 1840 additions & 2 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ We carefully select and integrate the following excellent open source components
196196
### 🛠️ Practice Projects
197197

198198
- [Implementing Deep Search](doc/en/tutorial/function-call-deep-search.md)
199-
- [Codebase Feature Technical Deep Analysis](doc/en/tutorial/codebase-technical-deep-dive.md)
199+
- [Codebase Feature Technical Deep Analysis](doc/en/tutorial/codebase-implementation.md)
200200
- Custom Embedding Model Integration
201201
- Extending MCP Protocol Support
202202
- Implementing Custom Search Engines

README_ZH.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ VoidMuse致力于通过**开源组件整合**的方式,以**最低的开发成
199199
### 🛠️ 实践项目
200200

201201
- [实现深度搜索](doc/tutorial/function%20call的实践-实现深度搜索.md)
202-
- [Codebase功能但实现](doc/tutorial/codebase功能的技术深度解析.md)
202+
- [Codebase功能但实现](doc/zh/tutorial/codebase实现.md)
203203
- 自定义Embedding模型集成
204204
- 扩展MCP协议支持
205205
- 实现自定义搜索引擎
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
## 1. Preface
2+
Why build this open-source project?
3+
4+
AI has progressed rapidly over the past two years, bringing concepts like embeddings, RAG, and function calling. We’re curious about their real value in an actual project and how to apply them in practice. The best way to learn is by doing, so we created the VoidMuse project to implement an AI IDE coding plugin—similar in spirit to Cursor but limited to a plugin form, closer to the open-source Cline plugin—so we can practice the ideas we want to learn.
5+
6+
Why not extend Cline directly? Because Cline’s code is complex, and the cost to practice on top of it is high. Starting from scratch with a simpler project is sufficient to achieve our goals.
7+
8+
This article explains the real implementation of the codebase feature. We fully implemented calling the codebase on the chat page to retrieve relevant code and inject it into the LLM context. It includes many optimization details, such as pitfalls in choosing an embedding model and the limitations of pure vector search. We use hybrid search to improve retrieval accuracy. After reading, you’ll have a deeper understanding of how the codebase is actually implemented.
9+
10+
![Codebase usage in IntelliJ](../../img/tutorial/codebase/codebase效果.png)
11+
12+
## 2. Full Pipeline
13+
![Full pipeline screenshot](../../img/tutorial/codebase/codebase-流程.png)
14+
15+
The codebase feature has two key parts:
16+
1. Build the codebase index: vectorize all code files. This involves chunking, embeddings, and file selection strategies.
17+
2. Search the codebase index: perform vector search + text search. We adopt hybrid search to improve accuracy and use prompt weighting to boost keyword importance.
18+
19+
Below are the detailed implementation steps for both parts.
20+
21+
## 3. Building the Codebase Index
22+
![Indexing flow screenshot](../../img/tutorial/codebase/codebase-建立索引流程.png)
23+
24+
### 3.1 Remove Files
25+
A repository contains many files: system hidden files, dependencies (e.g., `node_modules`), and protocol-generated large files. For example, Java projects using protobuf (pb) can generate huge Java files. These are not helpful for code search, so we remove them in advance.
26+
27+
Filtering strategy:
28+
1. `.gitignore`: Most repos have a `.gitignore` listing files and directories to ignore. We remove files per this list; most dependency files are excluded at this stage.
29+
2. Additionally exclude “pb auto-generated files”: These protobuf-generated Java files can be huge and unhelpful for search. Our project uses pb quite a bit, so we explicitly remove them.
30+
3. Exclude empty and large files: We exclude files larger than `1MB`.
31+
32+
Reference code: [`CheckAutoIndexingTask.startCheckAll`](../../../extensions/intellij/src/main/java/com/voidmuse/idea/plugin/codebase/CheckAutoIndexingTask.java)
33+
34+
### 3.2 Code Chunking
35+
Files typically contain multiple functions, and their length varies; some can have thousands of lines. Embedding models have context length limits, so we must chunk files and vectorize each chunk separately.
36+
37+
We considered two chunking strategies (inspired by the `continue` plugin):
38+
1. Structure-aware chunking via AST (functions/classes): Transform code into structured data that provides richer features and captures complete function boundaries.
39+
2. Character or line-based chunking: The `continue` plugin’s fallback method.
40+
41+
In `VoidMuse`, we implemented the simplest line-based approach: split files into random-sized chunks of 35–65 lines. This has an obvious drawback: a complete function may be split across multiple chunks.
42+
43+
Note on embedding model pitfalls: Initially we wanted Chinese support and chose the open-source `bge-large-zh-v1.5` model, but its context limit is 512 tokens. After chunking, it’s easy to exceed this limit, leading to poor embedding accuracy. We later switched to `Qwen3-Embedding-0.6B`, which runs locally, has a 32,000-token context window, and a 1536-dimensional embedding, enabling more expressive and accurate representations.
44+
45+
### 3.3 Embedding and Storing in Lucene
46+
Each chunk is embedded separately, then stored in a Lucene index. The Lucene index contains the chunk text, vector, file path, line numbers, etc.
47+
48+
We use Lucene to store the index. Lucene is a Java-based full-text search engine with strong performance and Chinese support. We call Lucene’s APIs directly for retrieval and leverage its text + vector search capabilities to improve accuracy.
49+
50+
Reference code: [`LuceneVectorStore.java`](../../../extensions/intellij/src/main/java/com/voidmuse/idea/plugin/codebase/vector/LuceneVectorStore.java)
51+
52+
## 4. Searching the Codebase Index
53+
![Search flow screenshot](../../img/tutorial/codebase/codebase-检索流程.png)
54+
55+
### 4.1 Optimize the Query
56+
In testing, we found that directly embedding queries didn’t work very well. Before embedding, we optimize queries to be more suitable for vector search, including repeating key terms 3 times to boost their weight (a common trick from long-prompts: repeating an important rule increases its impact).
57+
58+
For example, if the user’s query is “Find information related to order number OrderID generation”, we optimize it to:
59+
`OrderID OrderID OrderID order number generation order number generation rules order number generation logic order number generation docs` (repeat 3 times) to boost emphasis on `OrderID`.
60+
61+
Prompt for optimization: [`codebaseOptimizePrompt.txt`](../../../gui/src/config/prompts/codebaseOptimizePrompt.txt)
62+
Optimization pipeline in GUI: [`IDEService.ts`](../../../gui/src/api/IDEService.ts)`buildWithCodebaseContext`
63+
64+
### 4.2 Embedding the Query
65+
This step simply embeds the (optimized) query; details are the same as above.
66+
67+
### 4.3 Hybrid Search (the key)
68+
We initially tried vector-only search, but it has limitations:
69+
70+
1. Semantic understanding limits of embeddings
71+
- Semantic generalization: Embedding models (e.g., BERT, GPT) capture semantic information, but code requires high precision. They may generalize class or method names, hurting exact matches.
72+
- Loss of symbol precision: Embedding models may overlook the exactness of symbols (class/method names), especially when they have specific meanings. Even when a query specifies a symbol, the model may miss it.
73+
74+
2. Context dependency of embeddings
75+
- Context bias: When queries contain many details (class/method names), embeddings may over-rely on those contexts and miss the broader structure or functionality.
76+
- Overfitting: Overly detailed queries can cause embeddings to fixate on narrow details and miss broader relevance, producing overly narrow results.
77+
78+
3. The structured nature of code
79+
- Importance of symbol search: Symbols in codebases follow structured naming. Pure vector search does not fully leverage this structure, reducing effectiveness.
80+
- Necessity of hybrid search: Combine symbol/text search with embeddings. Symbol search gives exact matches; embedding search captures semantics. Together they balance precision and semantic understanding.
81+
82+
Summary: Code search should use hybrid search—symbol/text + vector.
83+
84+
Fortunately, Lucene supports text + vector search out of the box, and we leverage it directly.
85+
86+
Concrete code: [`LuceneVectorStore.hybridSearch`](../../../extensions/intellij/src/main/java/com/voidmuse/idea/plugin/codebase/vector/LuceneVectorStore.java)
87+
Key features:
88+
- Search text (semantic similarity) + vectors. For example, a query like “What is the concrete implementation of twoStageHybridSearch” will search for chunks containing `twoStageHybridSearch` and chunks similar to the embedded sentence, then combine scores.
89+
- Expand the candidate set: Increase recall by fetching a larger candidate pool. If you expect 10 results, fetch 30 candidates first, then rerank.
90+
- Rerank: Compute mixed scores for retrieved chunks and take the top-k.
91+
92+
## 5. Further Reading: Why Cline Doesn’t Use Vector Search
93+
Cline is an open-source plugin, but unlike Cursor, it does not index the codebase and does not use vector search.
94+
95+
Instead, Cline uses another approach: it reads relevant files by following dependency chains (e.g., `import`), similar to how humans navigate code by jumping to dependencies. Downsides include reading lots of files when dependencies are deep, consuming a lot of context, and knowing when to stop. This is based on AST parsing. We’ve used similar strategies for autocomplete, because using dependency context helps predict the next token more accurately.
96+
97+
It’s interesting: Cursor extracts relevance via project-wide vector search (a technical correlation approach), whereas Cline extracts relevance via dependency traversal (a human reading approach).
98+
99+
## 6. Summary
100+
In short, the codebase feature is RAG: vectorize content and retrieve via embeddings. To get good results, you need to optimize many details: chunking strategies, query keyword weighting, and adding text search. Through this practice, we gained a solid understanding of how codebase features in AI coding tools like Cursor work. Their engineering is certainly more complex, but the overall flow is now clear.
101+
102+
Next, we hope to explore memory capabilities. LLM context is limited, and starting a new chat loses accumulated information. Memory can retain key information from previous conversations and use it in subsequent chats to produce more accurate answers.
103+
104+
## 7. References
105+
- Why Cline doesn’t index your codebase (Chinese): https://zhuanlan.zhihu.com/p/1919489523823407360
106+
- Official explanation: https://cline.bot/blog/why-cline-doesnt-index-your-codebase-and-why-thats-a-good-thing
107+
- Discussion on pros/cons of Cursor’s codebase and Cline’s mechanism: https://news.ycombinator.com/item?id=44106944
108+
- Hybrid search: https://www.pinecone.io/blog/hybrid-search/
80.2 KB
Loading
44.3 KB
Loading
127 KB
Loading
112 KB
Loading

0 commit comments

Comments
 (0)