Skip to content

Commit 2961215

Browse files
authored
Merge pull request #585 from Goodnight77/docs/fix-documentation-typos
docs: fix typos, grammatical errors
2 parents 9cfe576 + 8cfa9ee commit 2961215

10 files changed

Lines changed: 40 additions & 40 deletions

docs/articles/Vector-Indexes.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Running AI applications depends on vectors, often called [embeddings](https://su
88

99
![What is a vector index](../assets/use_cases/vector_indexes/vector_index1.png)
1010

11-
Vector indexing, by creating groups of matching elements, speeds up similarity search - which calculate vector closeness using metrics like Euclidean or Jacobian distance. (In small datasets where accuracy is more important than efficiency, you can use K-Nearest Neighbors to pinpoint your query's closest near neighbors. As datasets get bigger and efficiency becomes an issue, an [Approximate Nearest Neighbor](https://superlinked.com/vectorhub/building-blocks/vector-search/nearest-neighbor-algorithms) (ANN) approach will *very quickly* return accurate-enough results.)
11+
Vector indexing, by creating groups of matching elements, speeds up similarity search - which calculate vector closeness using metrics like Euclidean or Jaccard distance. (In small datasets where accuracy is more important than efficiency, you can use K-Nearest Neighbors to pinpoint your query's closest near neighbors. As datasets get bigger and efficiency becomes an issue, an [Approximate Nearest Neighbor](https://superlinked.com/vectorhub/building-blocks/vector-search/nearest-neighbor-algorithms) (ANN) approach will *very quickly* return accurate-enough results.)
1212

1313
Vector indexes are crucial to efficient, relevant, and accurate search in various common applications, including Retrieval Augmented Generation ([RAG](https://superlinked.com/vectorhub/articles/advanced-retrieval-augmented-generation)), [semantic search in image databases](https://superlinked.com/vectorhub/articles/retrieval-from-image-text-modalities) (e.g., in smartphones), large text documents, advanced e-commerce websites, and so on.
1414

@@ -77,9 +77,9 @@ IVF_SQ makes sense when dealing with medium to large datasets where memory effic
7777

7878
### DiskANN
7979

80-
Most ANN algorithms - including those above - are designed for in-memory computation. But when you're dealing with *big data*, in-memory computation can be a bottleneck. Disk-based ANN ([DiskANN](https://suhasjs.github.io/files/diskann_neurips19.pdf)) is built to leverage Solid-State Drives' (SSDs') large memory and high-speed capabilities. DiskANN indexes vectors using the Vamana algorithm, a graph-based indexing structure that minimizes the number of sequential disk reads required during, by creating a graph with a smaller search "diameter" - the max distance between any two nodes (representing vectors), measured as the least number of hops (edges) to get from one to the other. This makes the search process more efficient, especially for the kind of large-scale datasets that are stored on SSDs.
80+
Most ANN algorithms - including those above - are designed for in-memory computation. But when you're dealing with *big data*, in-memory computation can be a bottleneck. Disk-based ANN ([DiskANN](https://suhasjs.github.io/files/diskann_neurips19.pdf)) is built to leverage Solid-State Drives' (SSDs') large memory and high-speed capabilities. DiskANN indexes vectors using the Vamana algorithm, a graph-based indexing structure that minimizes the number of sequential disk reads required, by creating a graph with a smaller search "diameter" - the max distance between any two nodes (representing vectors), measured as the least number of hops (edges) to get from one to the other. This makes the search process more efficient, especially for the kind of large-scale datasets that are stored on SSDs.
8181

82-
By using a SSD to store and search its graph index, DiskANN can be cost-effective, scalable, and efficient.
82+
By using an SSD to store and search its graph index, DiskANN can be cost-effective, scalable, and efficient.
8383

8484
### SPTAG-based Approximate Nearest Neighbor Search (SPANN)
8585

docs/articles/advanced_retrieval_augmented_generation.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ embed_model = HuggingFaceEmbedding(model_name="mixedbread-ai/mxbai-embed-large-v
109109
Settings.embed_model = embed_model
110110
```
111111

112-
Specifically, we selected "mixedbread-ai/mxbai-embed-large-v1", a model that strikes a balance between retrieval accuracy and computational efficiency, according to recent performance evaluations in the Huggingface [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
112+
Specifically, we selected "mixedbread-ai/mxbai-embed-large-v1", a model that strikes a balance between retrieval accuracy and computational efficiency, according to recent performance evaluations in the Hugging Face [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
113113

114114
### Indexing
115115

@@ -164,7 +164,7 @@ Another way to enhance retrieval accuracy is through [hybrid search](https://sup
164164

165165
This hybrid approach captures both the semantic richness of embeddings and the direct match precision of keyword search, leading to improved relevance in retrieved documents.
166166

167-
So far we've seen how careful preretrieval (data preparation, chunking, embedding, indexing) and retrieval (hybrid search) can help improve RAG retrieval results. What about _after_ we've done our retrieval?
167+
So far we've seen how careful pre-retrieval (data preparation, chunking, embedding, indexing) and retrieval (hybrid search) can help improve RAG retrieval results. What about _after_ we've done our retrieval?
168168

169169
## Post-retrieval
170170

@@ -259,7 +259,7 @@ display(Markdown(f"<b>{response}</b>"))
259259

260260
"Based on the context provided, the dangers of hallucinations in the context of machine learning and natural language processing are that they can lead to inaccurate or incorrect results, particularly in customer support and content creation. These hallucinations, which are false pieces of information generated by a generative model, can have disastrous consequences in use cases where there's more at stake than simple internet searches. In short, machine hallucinations can be dangerous because they can lead to false information being presented as fact, which can have serious consequences in real-world applications."
261261

262-
Our advanced RAG pipeline result appears to be relatively precise, avoid hallucinations, and effectively integrate retrieved context into generated output. Note: generation is not a fully deterministic process, so if you run this code yourself, you may receive slightly different output.
262+
Our advanced RAG pipeline result appears to be relatively precise, avoids hallucinations, and effectively integrates retrieved context into generated output. Note: generation is not a fully deterministic process, so if you run this code yourself, you may receive slightly different output.
263263

264264
## Conclusion
265265

docs/articles/airbnb-search-benchmarking.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction & Motivation
44

5-
Imagine you are searching for the ideal Airbnb for a weekend getaway. You open the website and adjust sliders and checkboxes but still encounter lists of options that nearly match your need but never are never truly what you are looking for. Although it is straightforward to specify a filter such as: "price less than two hundred dollars", rigid tags and thresholds for more complex search queries, make it a much more difficult task to figure out what the user is looking for.
5+
Imagine you are searching for the ideal Airbnb for a weekend getaway. You open the website and adjust sliders and checkboxes but still encounter lists of options that nearly match your need but are never truly what you are looking for. Although it is straightforward to specify a filter such as: "price less than two hundred dollars", rigid tags and thresholds for more complex search queries, make it a much more difficult task to figure out what the user is looking for.
66

77
Converting a mental image of a luxury apartment near the city's finest cafés or an affordable business-ready suite with good reviews into numerical filters often proves frustrating. Natural language is inherently unstructured and must be transformed into numerical representations to uncover user intent. At the same time, the rich structured data associated with each listing must also be encoded numerically to reveal relationships between location, comfort, price, and reviews.
88

@@ -49,7 +49,7 @@ def create_text_description(row):
4949
"""Create a unified text description from listing attributes."""
5050
text = f"{row['listing_name']} is a {row['accommodation_type']} "
5151
text += f"For {row['max_guests']} guests. "
52-
text += f"It costs ${row['price']} per night with a rating of {row['rating']} with {row['review_count']} nymber of reviews. "
52+
text += f"It costs ${row['price']} per night with a rating of {row['rating']} with {row['review_count']} number of reviews. "
5353
text += f"Description: {row['description']} "
5454
text += f"Amenities include: {', '.join(row['amenities_list'])}"
5555
return text
@@ -259,7 +259,7 @@ If neither of the two approaches produces satisfactory results on structured dat
259259
<figcaption>Figure 10: Hybrid search results for "luxury places with good reviews"</figcaption>
260260
</figure>
261261

262-
The results indicate that hybrid search effectively balances semantic understanding with keyword precision. By combining vector search's ability to grasp concepts like "luxury" with BM25's strength in finding exact term matches, the hybrid approach delivers more comprehensive results. However, the fundamental limitations remain: the system still cannot reliably interpret numerical constraints (Figure 11) or make sophisticated judgments about what constitutes "good reviews" in terms of both rating quality and quantity. Additionaly, finding the optimal alpha value for the weighted combination requires careful tuning and may need adjustment based on specific use cases or datasets. Implementing hybrid search also requires maintaining two separate index structures and ensuring proper score normalization and fusion. This suggests that while hybrid search improves upon its component approaches, we need a more advanced solution to truly understand structured data attributes and their relationships.
262+
The results indicate that hybrid search effectively balances semantic understanding with keyword precision. By combining vector search's ability to grasp concepts like "luxury" with BM25's strength in finding exact term matches, the hybrid approach delivers more comprehensive results. However, the fundamental limitations remain: the system still cannot reliably interpret numerical constraints (Figure 11) or make sophisticated judgments about what constitutes "good reviews" in terms of both rating quality and quantity. Additionally, finding the optimal alpha value for the weighted combination requires careful tuning and may need adjustment based on specific use cases or datasets. Implementing hybrid search also requires maintaining two separate index structures and ensuring proper score normalization and fusion. This suggests that while hybrid search improves upon its component approaches, we need a more advanced solution to truly understand structured data attributes and their relationships.
263263

264264
<figure style="text-align: center; margin: 20px 0;">
265265
<img src="../assets/use_cases/airbnb_search/hybrid_filter.png" alt="Hybrid Search Results" style="width: 100%;">
@@ -324,7 +324,7 @@ The cross-encoder reranking results demonstrate a notable improvement in result
324324
Most impressively, for the numerical constraints query, the cross-encoder makes progress in understanding specific requirements. Despite the first result exceeding the price constraint (2632 > 2000), the reranking correctly identifies more listings matching the "5 guests" requirement and prioritizes them appropriately. This shows the effectiveness of using cross-encoders, since they re-calculate the similarity between the query and the documents after the initial retrieval based on vector search. In other words, the model can make finer distinctions when examining query-document pairs together rather than separately. However, the cross-encoder still does not perfectly understand all numerical constraints. Additionally, despite the improvements, cross-encoder reranking has significant computational drawbacks. It requires evaluating each query-document pair individually through a transformer-based model, which increases latency and resource requirements. Especially as the candidate pool grows, making the search challenging to scale for large datasets or real-time applications with strict performance requirements. These takeaways suggest that while this approach represents a significant improvement, a more structured approach to handling multi-attribute data could yield better results.
325325

326326
<figure style="text-align: center; margin: 20px 0;">
327-
<img <img src="../assets/use_cases/airbnb_search/cross_filter.png" alt="Cross-Encoder Results for Numerical Query" style="width: 100%;">
327+
<img src="../assets/use_cases/airbnb_search/cross_filter.png" alt="Cross-Encoder Results for Numerical Query" style="width: 100%;">
328328
<figcaption>Figure 15: Cross-encoder results for numerical constraints query</figcaption>
329329
</figure>
330330

@@ -349,7 +349,7 @@ During offline indexing, each listing is passed through a BERT model to produce
349349
<figcaption>Figure 17: Colbert Multi-Vector Retrieval</figcaption>
350350
</figure>
351351

352-
Here is how we implment the multi-vecotr search by ColBERT:
352+
Here is how we implement the multi-vector search by ColBERT:
353353

354354
```python
355355
class ColBERTSearch:
@@ -462,7 +462,7 @@ At query time, Superlinked uses a large language model to interpret the user’s
462462

463463
To ensure that non-negotiable constraints are respected, Superlinked first applies hard filters to eliminate listings that do not meet specific criteria, such as guest capacity or maximum price. Only the listings that pass these filters are considered in the final ranking stage. The system then performs a weighted nearest neighbors search, comparing the multi-attribute embeddings of these candidates against the weighted query representation to rank them by overall relevance. This combination of modality-aware encoding, constraint filtering, and weighted ranking allows Superlinked to produce accurate, context-aware results that reflect both the structure of the underlying data and the nuanced preferences of the user.
464464

465-
Here is how we implment the Superlinked for our Airbnb search:
465+
Here is how we implement the Superlinked for our Airbnb search:
466466

467467
We first need to define a schema that captures the structure of our dataset. The schema outlines both the fields we'll use for embedding and those we'll use for filtering:
468468

0 commit comments

Comments
 (0)