Skip to content

Commit 3a41f4a

Browse files
committed
Docs
1 parent bd5645d commit 3a41f4a

4 files changed

Lines changed: 307 additions & 0 deletions

File tree

secretary/crdt.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
* [CRDTs: The Hard Parts](https://www.youtube.com/watch?v=x7drE24geUw)
66
* [CRDTs and the Quest for Distributed Consistency](https://www.youtube.com/watch?v=B5NULPSiOGw)
77
* [A CRDT Primer: Defanging Order Theory](https://www.youtube.com/watch?v=OOlnp2bZVRs)
8+
* [Conflict-Free Replicated Data Types (CRDT) for Distributed JavaScript Apps.](https://www.youtube.com/watch?v=M8-WFTjZoA0)
89

910
* [Loro Is Local-First State With CRDT](https://www.youtube.com/watch?v=NB7HRfyufLk)
1011
* [How Yjs works from the inside out](https://www.youtube.com/watch?v=0l5XgnQ6rB4)

secretary/kademilia.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
* [Kademilia Paper](https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf)
55
* [Distributed Hash Tables with Kademlia](https://codethechange.stanford.edu/guides/guide_kademlia.html#supporting-dynamic-leaves-and-joins)
66

7+
* [P2P Networks](https://www.youtube.com/playlist?list=PLL8woMHwr36F-1h7BE92ynHHOE3zebGpA)
8+
* [Kademlia: A Peer-to-Peer Information System Based on the XOR Metric](https://www.youtube.com/watch?v=NxhZ_c8YX8E&list=PLL8woMHwr36F-1h7BE92ynHHOE3zebGpA&index=9)
79
* [Kademlia, Explained](https://www.youtube.com/watch?v=1QdKhNpsj8M)
810
* [Kademlia - a Distributed Hash Table implementation | Paper Dissection and Deep-dive](https://www.youtube.com/watch?v=_kCHOpINA5g&list=PLsdq-3Z1EPT1rNeq2GXpnivaWINnOaCd0&index=7)
911
* [Playlist](https://www.youtube.com/playlist?list=PLiYqQVdgdw_sSDkdIZzDRQR9xZlsukIxD)
@@ -19,6 +21,9 @@
1921
* https://github.com/pdelong/Kademlia
2022
* https://github.com/prettymuchbryce/kademlia
2123

24+
* [Consistent Hashing with Bounded Loads](https://research.google/blog/consistent-hashing-with-bounded-loads/)
25+
* https://github.com/buraksezer/consistent
26+
2227
###
2328

2429
You’re looking for a Kademlia DHT implementation in Golang. I’ll explain the key components and then provide an implementation outline.

secretary/lsmtree.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
Object stores typically do not use B-trees like databases. Instead, they use hash-based indexing or LSM-trees (Log-Structured Merge Trees) depending on the use case. Here’s why:
2+
3+
1. Hash-Based Indexing (Common for Object Stores)
4+
• Object stores like Amazon S3, MinIO, Ceph, and Swift use a distributed key-value store where objects are accessed using a unique key (often a hash of the object name).
5+
• Hash maps allow for fast lookups (O(1) complexity) but do not support range queries efficiently.
6+
• Metadata (such as object name, size, timestamp) is often stored in a metadata store, which may use databases like PostgreSQL, MySQL, or embedded key-value stores.
7+
8+
2. LSM-Trees (Used in Some Object Stores)
9+
• Some object stores use LSM-Trees (Log-Structured Merge Trees) to handle high write throughput efficiently.
10+
• Example: RocksDB, LevelDB, which are used in systems like Ceph’s BlueStore.
11+
12+
3. B-Trees (More Common in Databases)
13+
• B+ trees are optimized for range queries and ordered data, which is common in databases but not in object stores.
14+
• Object stores generally don’t need range queries on keys because they retrieve objects by unique identifiers.
15+
16+
When Might Object Stores Use B-Trees?
17+
18+
Some hybrid systems or metadata layers within object stores might use B-Trees:
19+
• Metadata storage: If the metadata layer is implemented in a traditional database (PostgreSQL, MySQL, etc.), it may use B+ trees for indexing.
20+
• Filesystems used by Object Stores: If the object store is built on top of a filesystem (e.g., XFS, ext4), those filesystems may use B-trees internally.
21+
22+
Conclusion
23+
• B-Trees are great for databases with ordered key access and range queries.
24+
• Object stores prioritize fast lookups and scalability, favoring hash maps or LSM-Trees.
25+
• Some object store metadata layers might use B-Trees if they rely on traditional databases.
26+
27+
Are you thinking about designing your own object store?
28+
29+
30+
31+
32+
Internals of LSM-Tree vs. B-Tree: A Deep Dive
33+
34+
Both LSM-Trees (Log-Structured Merge-Trees) and B-Trees are used for indexing data, but they are optimized for different workloads. Understanding their internals helps in choosing the right one for a database, key-value store, or file system.
35+
36+
1. Internals of B-Tree
37+
38+
A B-Tree (Balanced Tree) is a self-balancing tree data structure optimized for disk-based storage. It maintains sorted keys and allows efficient point lookups, range queries, insertions, and deletions.
39+
40+
1.1. Structure of a B-Tree
41+
• A B-Tree of order d (degree d) has:
42+
• A root node (may have fewer keys)
43+
• Internal nodes (branching factor of up to 2d)
44+
• Leaf nodes (contain actual data or pointers to data)
45+
• Each node contains up to 2d keys and 2d+1 child pointers.
46+
• Keys in a node are sorted, making binary search within a node possible.
47+
48+
1.2. Operations in B-Tree
49+
50+
1.2.1. Search (O(log N))
51+
• Start at the root, perform binary search within the node.
52+
• If the key is found, return it.
53+
• If not found, follow the correct child pointer and repeat.
54+
55+
1.2.2. Insertion (O(log N))
56+
1. Search for the correct leaf node to insert the key.
57+
2. If there is space, insert it.
58+
3. If the leaf node is full:
59+
• Split the node into two.
60+
• Push the middle key to the parent.
61+
• If the parent is full, repeat the split upwards (recursively).
62+
63+
1.2.3. Deletion (O(log N))
64+
1. Find the key in the leaf node.
65+
2. If removing it causes an underflow (too few keys), borrow a key from a sibling.
66+
3. If borrowing is not possible, merge the node with a sibling.
67+
4. If the parent gets underfilled, merge upwards recursively.
68+
69+
1.3. Characteristics of B-Tree
70+
• Disk-efficient: Minimizes disk reads by keeping nodes large (typically 4KB, matching disk page sizes).
71+
• Well-suited for range queries due to the sorted structure.
72+
• Balanced: Ensures O(log N) time complexity for operations.
73+
• Mutable: Supports in-place updates without rewriting entire nodes.
74+
75+
2. Internals of LSM-Tree (Log-Structured Merge Tree)
76+
77+
The LSM-Tree is optimized for high write throughput by deferring and batching writes instead of modifying disk structures in-place.
78+
79+
2.1. Structure of LSM-Tree
80+
81+
Instead of modifying data in-place like B-Trees, LSM-Trees follow a write-append strategy with multiple levels of sorted structures.
82+
1. MemTable (Memory Table)
83+
• An in-memory sorted data structure (usually a Red-Black Tree or Skip List).
84+
• Writes go here first.
85+
• Fast inserts, but limited in size.
86+
2. SSTables (Sorted String Tables) on Disk
87+
• When the MemTable fills up, it is flushed to disk as an immutable sorted file (SSTable).
88+
• SSTables are sorted and allow efficient range scans.
89+
3. Compaction Process
90+
• Multiple SSTables are periodically merged (compacted) into larger SSTables, removing old versions of keys.
91+
• This reduces read amplification.
92+
93+
2.2. Operations in LSM-Tree
94+
95+
2.2.1. Insertion (O(1) amortized)
96+
1. Write to the MemTable (fast, in-memory).
97+
2. When the MemTable is full, it is flushed to disk as an SSTable.
98+
3. Periodic compaction merges SSTables to optimize read efficiency.
99+
100+
2.2.2. Search (O(log N) or worse due to multiple SSTables)
101+
1. Check the MemTable first (fast, in-memory).
102+
2. If not found, search recent SSTables on disk.
103+
3. If still not found, search older SSTables.
104+
4. Bloom filters are used to avoid unnecessary SSTable scans.
105+
106+
2.2.3. Deletion (Tombstones)
107+
1. Instead of deleting immediately, a tombstone (delete marker) is written.
108+
2. The actual data is removed later during compaction.
109+
110+
2.3. Characteristics of LSM-Tree
111+
• Optimized for high write throughput (batching and append-only writes).
112+
• Immutable SSTables prevent fragmentation and reduce write amplification.
113+
• Compaction reduces read latency but adds extra background work.
114+
• Higher read amplification compared to B-Trees (must search multiple SSTables).
115+
116+
3. B-Tree vs. LSM-Tree: Key Differences
117+
118+
Feature B-Tree LSM-Tree
119+
Write Speed Slower (in-place updates, multiple disk I/Os) Faster (writes to MemTable, append-only SSTables)
120+
Read Speed Faster (single lookup, O(log N)) Slower (may scan multiple SSTables, higher read amplification)
121+
Range Queries Efficient (sorted, contiguous leaves) Less efficient (data spread across SSTables)
122+
Disk Usage More fragmentation (frequent updates) More compact (compaction removes old versions)
123+
Compaction Overhead No compaction needed Requires background compaction (CPU, I/O overhead)
124+
Write Amplification Higher (multiple disk writes per update) Lower (append-only, batch writes)
125+
Read Amplification Lower (fewer disk reads) Higher (may read multiple SSTables)
126+
Best for Balanced workloads (mix of reads & writes) Write-heavy workloads (logging, key-value stores)
127+
128+
4. When to Use Which?
129+
130+
Use Case Best Choice
131+
Relational Databases (PostgreSQL, MySQL, etc.) B-Tree (supports transactions, indexing, and range queries)
132+
Key-Value Stores (RocksDB, LevelDB, etc.) LSM-Tree (handles high write throughput efficiently)
133+
File Systems (XFS, NTFS, ext4, etc.) B-Tree (supports random access and metadata storage)
134+
Log Storage (Cassandra, ScyllaDB, HBase, etc.) LSM-Tree (high write performance and durability)
135+
Distributed Databases (Bigtable, CockroachDB, etc.) LSM-Tree (optimized for distributed writes and merges)
136+
137+
5. Conclusion
138+
• B-Trees are best when reads and writes are balanced, supporting low-latency reads and efficient range scans.
139+
• LSM-Trees are best when writes dominate, using batching and compaction to optimize disk usage.
140+
• Many modern databases use hybrid approaches, such as B-Trees for metadata and LSM-Trees for logs.
141+
142+
Would you like a deeper dive into a specific area, such as compaction strategies, optimizations, or real-world implementations? 🚀

secretary/tfidf.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
* https://github.com/blevesearch/bleve
2+
3+
4+
TF-IDF and BM25: Concepts & Differences
5+
6+
TF-IDF and BM25 are information retrieval techniques used to rank documents based on their relevance to a query. They are commonly used in search engines, chatbots, and text analysis.
7+
8+
1. TF-IDF (Term Frequency - Inverse Document Frequency)
9+
10+
TF-IDF is a statistical measure that evaluates how important a word is in a document relative to a collection (corpus) of documents.
11+
12+
1.1. How TF-IDF Works
13+
1. TF (Term Frequency)
14+
• Measures how often a word appears in a document.
15+
• Formula:
16+
17+
• Example:
18+
• “apple” appears 3 times in a document of 100 words → TF = 3/100 = 0.03.
19+
2. IDF (Inverse Document Frequency)
20+
• Measures how unique or rare a word is across all documents.
21+
• Formula:
22+
23+
• Example:
24+
• If “apple” appears in 10 out of 1000 documents:
25+
26+
3. TF-IDF Score Calculation
27+
• The final score is:
28+
29+
• Example:
30+
• If TF = 0.03 and IDF = 2, then TF-IDF = 0.03 × 2 = 0.06.
31+
32+
1.2. Uses of TF-IDF
33+
• Search engines (Google, Elasticsearch)
34+
• Keyword extraction
35+
• Text classification
36+
37+
2. BM25 (Best Matching 25)
38+
39+
BM25 is an improved version of TF-IDF that introduces additional parameters to adjust for document length and term saturation. It’s widely used in modern search engines.
40+
41+
2.1. How BM25 Works
42+
43+
BM25 improves TF-IDF by:
44+
1. Adjusting Term Frequency (TF) with Saturation
45+
• In TF-IDF, if a word appears 100 times, it gets 100 times the weight.
46+
• BM25 limits the effect of repeated words using a saturation function.
47+
48+
• k₁ is a tuning parameter (usually 1.2–2.0).
49+
• This prevents a single frequent word from dominating the ranking.
50+
2. Length Normalization
51+
• Long documents naturally contain more words.
52+
• BM25 normalizes document length using a parameter b (0 ≤ b ≤ 1).
53+
54+
• If b = 1, normalization is fully applied.
55+
• If b = 0, no length normalization is applied.
56+
3. BM25 Score Calculation
57+
58+
where:
59+
• t is the term.
60+
• IDF(t) is computed like TF-IDF.
61+
• k₁ and b are hyperparameters.
62+
63+
2.2. Why BM25 is Better than TF-IDF
64+
• Better ranking accuracy (avoids giving too much weight to very frequent words).
65+
• Handles long documents better (prevents bias against short documents).
66+
• More flexible with parameters (you can tune k₁ and b for different datasets).
67+
68+
2.3. Uses of BM25
69+
• Search engines (Elasticsearch, Apache Solr, Lucene)
70+
• Document ranking (Wikipedia, PubMed search)
71+
• Question answering systems (Chatbots)
72+
73+
3. TF-IDF vs. BM25: Key Differences
74+
75+
Feature TF-IDF BM25
76+
Formula Multiplicative (TF × IDF) More advanced with saturation & normalization
77+
Handles Long Documents? No Yes (length normalization)
78+
Handles Frequent Terms? No Yes (TF saturation)
79+
Weighting Linear scaling Saturated scaling
80+
Customization No parameters to tune k₁ and b can be adjusted
81+
Use Case Simple text analysis Advanced search ranking
82+
83+
4. When to Use Which?
84+
• Use TF-IDF when:
85+
• You need a quick, simple ranking method.
86+
• You’re working on small datasets (like keyword extraction).
87+
• You don’t have control over ranking parameters.
88+
• Use BM25 when:
89+
• You need high-quality search rankings.
90+
• You’re working with large-scale datasets (search engines, knowledge bases).
91+
• Document length varies significantly.
92+
93+
5. Conclusion
94+
• TF-IDF is simple and effective for basic keyword weighting.
95+
• BM25 improves upon TF-IDF by handling document length variations and term saturation, making it better for real-world search engines.
96+
• BM25 is widely used in modern search engines like Elasticsearch, Solr, and Lucene.
97+
98+
Would you like a code example for TF-IDF or BM25 in Python? 🚀
99+
100+
101+
102+
103+
104+
105+
Why Lucene Uses BM25 Instead of TF-IDF
106+
107+
Apache Lucene is a powerful text search library used in Elasticsearch, Solr, and OpenSearch. It switched from TF-IDF to BM25 as its default ranking function because BM25 provides better search relevance.
108+
109+
1. How Lucene Uses BM25
110+
111+
Lucene uses BM25 to rank search results by computing a score for each document based on how well it matches a given query. The ranking formula is:
112+
113+
114+
115+
Where:
116+
• TF(t) → Term frequency (how many times term t appears in a document)
117+
• IDF(t) → Inverse document frequency (importance of t across all documents)
118+
• |doc| → Length of the document
119+
• avgDocLength → Average length of all documents
120+
• k₁ and b → Hyperparameters (default: k₁ = 1.2, b = 0.75)
121+
122+
2. Why Lucene Switched from TF-IDF to BM25
123+
124+
Lucene originally used TF-IDF, but it had limitations in real-world search ranking. The main problems with TF-IDF were:
125+
126+
2.1. Overemphasis on Term Frequency
127+
• TF-IDF problem → If a word appears many times in a document, it gets a very high weight.
128+
• BM25 solution → Uses saturation, meaning extra occurrences of a term add diminishing value.
129+
Example:
130+
• TF-IDF: A document with “apple” 100 times gets 100× weight.
131+
• BM25: A document with “apple” 100 times gets limited extra weight (because of TF saturation).
132+
133+
2.2. No Document Length Normalization in TF-IDF
134+
• TF-IDF problem → Long documents are unfairly penalized because they have more words.
135+
• BM25 solution → Normalizes scores based on document length.
136+
Example:
137+
• A short article (300 words) and a long article (3000 words) might discuss “Lucene” equally.
138+
• TF-IDF favors the short document (high term density).
139+
• BM25 adjusts scores fairly by normalizing for document length.
140+
141+
2.3. Better Tuning for Different Use Cases
142+
• TF-IDF problem → No tuning parameters.
143+
• BM25 solution → k₁ and b can be adjusted based on the dataset.
144+
Example:
145+
• News search (where long documents matter) → Set b = 0.75 (normal length normalization).
146+
• Short document search (tweets, forum posts) → Set b = 0.25 (less normalization).
147+
148+
3. Real-World Benefits of BM25 in Lucene
149+
150+
✅ More relevant search results → BM25 ranks important documents better.
151+
✅ Handles long documents better → Fair scoring for long vs. short content.
152+
✅ Reduces bias from frequent words → TF saturation prevents one word from dominating.
153+
✅ More configurable → You can tweak k₁ and b for different datasets.
154+
155+
4. BM25 in Elasticsearch and Solr
156+
• Elasticsearch 5.0+ and Solr 7+ both use BM25 by default (instead of TF-IDF).
157+
• You can still switch back to TF-IDF if needed, but BM25 generally performs better.
158+
159+
Would you like a code example showing BM25 vs. TF-IDF in Python using Lucene-style search? 🚀

0 commit comments

Comments
 (0)