Docs

CodeHariK · CodeHariK · commit 3a41f4aca5d4 · 2025-03-04T22:46:05.000+05:30
diff --git a/secretary/crdt.md b/secretary/crdt.md
@@ -5,6 +5,7 @@
 * [CRDTs: The Hard Parts](https://www.youtube.com/watch?v=x7drE24geUw)
 * [CRDTs and the Quest for Distributed Consistency](https://www.youtube.com/watch?v=B5NULPSiOGw)
 * [A CRDT Primer: Defanging Order Theory](https://www.youtube.com/watch?v=OOlnp2bZVRs)
+* [Conflict-Free Replicated Data Types (CRDT) for Distributed JavaScript Apps.](https://www.youtube.com/watch?v=M8-WFTjZoA0)
 
 * [Loro Is Local-First State With CRDT](https://www.youtube.com/watch?v=NB7HRfyufLk)
 * [How Yjs works from the inside out](https://www.youtube.com/watch?v=0l5XgnQ6rB4)
diff --git a/secretary/kademilia.md b/secretary/kademilia.md
@@ -4,6 +4,8 @@
 * [Kademilia Paper](https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf)
 * [Distributed Hash Tables with Kademlia](https://codethechange.stanford.edu/guides/guide_kademlia.html#supporting-dynamic-leaves-and-joins)
 
+* [P2P Networks](https://www.youtube.com/playlist?list=PLL8woMHwr36F-1h7BE92ynHHOE3zebGpA)
+* [Kademlia: A Peer-to-Peer Information System Based on the XOR Metric](https://www.youtube.com/watch?v=NxhZ_c8YX8E&list=PLL8woMHwr36F-1h7BE92ynHHOE3zebGpA&index=9)
 * [Kademlia, Explained](https://www.youtube.com/watch?v=1QdKhNpsj8M)
 * [Kademlia - a Distributed Hash Table implementation | Paper Dissection and Deep-dive](https://www.youtube.com/watch?v=_kCHOpINA5g&list=PLsdq-3Z1EPT1rNeq2GXpnivaWINnOaCd0&index=7)
 * [Playlist](https://www.youtube.com/playlist?list=PLiYqQVdgdw_sSDkdIZzDRQR9xZlsukIxD)
@@ -19,6 +21,9 @@
 * https://github.com/pdelong/Kademlia
 * https://github.com/prettymuchbryce/kademlia
 
+* [Consistent Hashing with Bounded Loads](https://research.google/blog/consistent-hashing-with-bounded-loads/)
+* https://github.com/buraksezer/consistent
+
 ###
 
 You’re looking for a Kademlia DHT implementation in Golang. I’ll explain the key components and then provide an implementation outline.
diff --git a/secretary/lsmtree.md b/secretary/lsmtree.md
@@ -0,0 +1,142 @@
+Object stores typically do not use B-trees like databases. Instead, they use hash-based indexing or LSM-trees (Log-Structured Merge Trees) depending on the use case. Here’s why:
+
+1. Hash-Based Indexing (Common for Object Stores)
+	•	Object stores like Amazon S3, MinIO, Ceph, and Swift use a distributed key-value store where objects are accessed using a unique key (often a hash of the object name).
+	•	Hash maps allow for fast lookups (O(1) complexity) but do not support range queries efficiently.
+	•	Metadata (such as object name, size, timestamp) is often stored in a metadata store, which may use databases like PostgreSQL, MySQL, or embedded key-value stores.
+
+2. LSM-Trees (Used in Some Object Stores)
+	•	Some object stores use LSM-Trees (Log-Structured Merge Trees) to handle high write throughput efficiently.
+	•	Example: RocksDB, LevelDB, which are used in systems like Ceph’s BlueStore.
+
+3. B-Trees (More Common in Databases)
+	•	B+ trees are optimized for range queries and ordered data, which is common in databases but not in object stores.
+	•	Object stores generally don’t need range queries on keys because they retrieve objects by unique identifiers.
+
+When Might Object Stores Use B-Trees?
+
+Some hybrid systems or metadata layers within object stores might use B-Trees:
+	•	Metadata storage: If the metadata layer is implemented in a traditional database (PostgreSQL, MySQL, etc.), it may use B+ trees for indexing.
+	•	Filesystems used by Object Stores: If the object store is built on top of a filesystem (e.g., XFS, ext4), those filesystems may use B-trees internally.
+
+Conclusion
+	•	B-Trees are great for databases with ordered key access and range queries.
+	•	Object stores prioritize fast lookups and scalability, favoring hash maps or LSM-Trees.
+	•	Some object store metadata layers might use B-Trees if they rely on traditional databases.
+
+Are you thinking about designing your own object store?
+
+
+
+
+Internals of LSM-Tree vs. B-Tree: A Deep Dive
+
+Both LSM-Trees (Log-Structured Merge-Trees) and B-Trees are used for indexing data, but they are optimized for different workloads. Understanding their internals helps in choosing the right one for a database, key-value store, or file system.
+
+1. Internals of B-Tree
+
+A B-Tree (Balanced Tree) is a self-balancing tree data structure optimized for disk-based storage. It maintains sorted keys and allows efficient point lookups, range queries, insertions, and deletions.
+
+1.1. Structure of a B-Tree
+	•	A B-Tree of order d (degree d) has:
+	•	A root node (may have fewer keys)
+	•	Internal nodes (branching factor of up to 2d)
+	•	Leaf nodes (contain actual data or pointers to data)
+	•	Each node contains up to 2d keys and 2d+1 child pointers.
+	•	Keys in a node are sorted, making binary search within a node possible.
+
+1.2. Operations in B-Tree
+
+1.2.1. Search (O(log N))
+	•	Start at the root, perform binary search within the node.
+	•	If the key is found, return it.
+	•	If not found, follow the correct child pointer and repeat.
+
+1.2.2. Insertion (O(log N))
+	1.	Search for the correct leaf node to insert the key.
+	2.	If there is space, insert it.
+	3.	If the leaf node is full:
+	•	Split the node into two.
+	•	Push the middle key to the parent.
+	•	If the parent is full, repeat the split upwards (recursively).
+
+1.2.3. Deletion (O(log N))
+	1.	Find the key in the leaf node.
+	2.	If removing it causes an underflow (too few keys), borrow a key from a sibling.
+	3.	If borrowing is not possible, merge the node with a sibling.
+	4.	If the parent gets underfilled, merge upwards recursively.
+
+1.3. Characteristics of B-Tree
+	•	Disk-efficient: Minimizes disk reads by keeping nodes large (typically 4KB, matching disk page sizes).
+	•	Well-suited for range queries due to the sorted structure.
+	•	Balanced: Ensures O(log N) time complexity for operations.
+	•	Mutable: Supports in-place updates without rewriting entire nodes.
+
+2. Internals of LSM-Tree (Log-Structured Merge Tree)
+
+The LSM-Tree is optimized for high write throughput by deferring and batching writes instead of modifying disk structures in-place.
+
+2.1. Structure of LSM-Tree
+
+Instead of modifying data in-place like B-Trees, LSM-Trees follow a write-append strategy with multiple levels of sorted structures.
+	1.	MemTable (Memory Table)
+	•	An in-memory sorted data structure (usually a Red-Black Tree or Skip List).
+	•	Writes go here first.
+	•	Fast inserts, but limited in size.
+	2.	SSTables (Sorted String Tables) on Disk
+	•	When the MemTable fills up, it is flushed to disk as an immutable sorted file (SSTable).
+	•	SSTables are sorted and allow efficient range scans.
+	3.	Compaction Process
+	•	Multiple SSTables are periodically merged (compacted) into larger SSTables, removing old versions of keys.
+	•	This reduces read amplification.
+
+2.2. Operations in LSM-Tree
+
+2.2.1. Insertion (O(1) amortized)
+	1.	Write to the MemTable (fast, in-memory).
+	2.	When the MemTable is full, it is flushed to disk as an SSTable.
+	3.	Periodic compaction merges SSTables to optimize read efficiency.
+
+2.2.2. Search (O(log N) or worse due to multiple SSTables)
+	1.	Check the MemTable first (fast, in-memory).
+	2.	If not found, search recent SSTables on disk.
+	3.	If still not found, search older SSTables.
+	4.	Bloom filters are used to avoid unnecessary SSTable scans.
+
+2.2.3. Deletion (Tombstones)
+	1.	Instead of deleting immediately, a tombstone (delete marker) is written.
+	2.	The actual data is removed later during compaction.
+
+2.3. Characteristics of LSM-Tree
+	•	Optimized for high write throughput (batching and append-only writes).
+	•	Immutable SSTables prevent fragmentation and reduce write amplification.
+	•	Compaction reduces read latency but adds extra background work.
+	•	Higher read amplification compared to B-Trees (must search multiple SSTables).
+
+3. B-Tree vs. LSM-Tree: Key Differences
+
+Feature	B-Tree	LSM-Tree
+Write Speed	Slower (in-place updates, multiple disk I/Os)	Faster (writes to MemTable, append-only SSTables)
+Read Speed	Faster (single lookup, O(log N))	Slower (may scan multiple SSTables, higher read amplification)
+Range Queries	Efficient (sorted, contiguous leaves)	Less efficient (data spread across SSTables)
+Disk Usage	More fragmentation (frequent updates)	More compact (compaction removes old versions)
+Compaction Overhead	No compaction needed	Requires background compaction (CPU, I/O overhead)
+Write Amplification	Higher (multiple disk writes per update)	Lower (append-only, batch writes)
+Read Amplification	Lower (fewer disk reads)	Higher (may read multiple SSTables)
+Best for	Balanced workloads (mix of reads & writes)	Write-heavy workloads (logging, key-value stores)
+
+4. When to Use Which?
+
+Use Case	Best Choice
+Relational Databases (PostgreSQL, MySQL, etc.)	B-Tree (supports transactions, indexing, and range queries)
+Key-Value Stores (RocksDB, LevelDB, etc.)	LSM-Tree (handles high write throughput efficiently)
+File Systems (XFS, NTFS, ext4, etc.)	B-Tree (supports random access and metadata storage)
+Log Storage (Cassandra, ScyllaDB, HBase, etc.)	LSM-Tree (high write performance and durability)
+Distributed Databases (Bigtable, CockroachDB, etc.)	LSM-Tree (optimized for distributed writes and merges)
+
+5. Conclusion
+	•	B-Trees are best when reads and writes are balanced, supporting low-latency reads and efficient range scans.
+	•	LSM-Trees are best when writes dominate, using batching and compaction to optimize disk usage.
+	•	Many modern databases use hybrid approaches, such as B-Trees for metadata and LSM-Trees for logs.
+
+Would you like a deeper dive into a specific area, such as compaction strategies, optimizations, or real-world implementations? 🚀
diff --git a/secretary/tfidf.md b/secretary/tfidf.md
@@ -0,0 +1,159 @@
+* https://github.com/blevesearch/bleve
+
+
+TF-IDF and BM25: Concepts & Differences
+
+TF-IDF and BM25 are information retrieval techniques used to rank documents based on their relevance to a query. They are commonly used in search engines, chatbots, and text analysis.
+
+1. TF-IDF (Term Frequency - Inverse Document Frequency)
+
+TF-IDF is a statistical measure that evaluates how important a word is in a document relative to a collection (corpus) of documents.
+
+1.1. How TF-IDF Works
+	1.	TF (Term Frequency)
+	•	Measures how often a word appears in a document.
+	•	Formula:
+￼
+	•	Example:
+	•	“apple” appears 3 times in a document of 100 words → TF = 3/100 = 0.03.
+	2.	IDF (Inverse Document Frequency)
+	•	Measures how unique or rare a word is across all documents.
+	•	Formula:
+￼
+	•	Example:
+	•	If “apple” appears in 10 out of 1000 documents:
+￼
+	3.	TF-IDF Score Calculation
+	•	The final score is:
+￼
+	•	Example:
+	•	If TF = 0.03 and IDF = 2, then TF-IDF = 0.03 × 2 = 0.06.
+
+1.2. Uses of TF-IDF
+	•	Search engines (Google, Elasticsearch)
+	•	Keyword extraction
+	•	Text classification
+
+2. BM25 (Best Matching 25)
+
+BM25 is an improved version of TF-IDF that introduces additional parameters to adjust for document length and term saturation. It’s widely used in modern search engines.
+
+2.1. How BM25 Works
+
+BM25 improves TF-IDF by:
+	1.	Adjusting Term Frequency (TF) with Saturation
+	•	In TF-IDF, if a word appears 100 times, it gets 100 times the weight.
+	•	BM25 limits the effect of repeated words using a saturation function.
+￼
+	•	k₁ is a tuning parameter (usually 1.2–2.0).
+	•	This prevents a single frequent word from dominating the ranking.
+	2.	Length Normalization
+	•	Long documents naturally contain more words.
+	•	BM25 normalizes document length using a parameter b (0 ≤ b ≤ 1).
+￼
+	•	If b = 1, normalization is fully applied.
+	•	If b = 0, no length normalization is applied.
+	3.	BM25 Score Calculation
+￼
+where:
+	•	t is the term.
+	•	IDF(t) is computed like TF-IDF.
+	•	k₁ and b are hyperparameters.
+
+2.2. Why BM25 is Better than TF-IDF
+	•	Better ranking accuracy (avoids giving too much weight to very frequent words).
+	•	Handles long documents better (prevents bias against short documents).
+	•	More flexible with parameters (you can tune k₁ and b for different datasets).
+
+2.3. Uses of BM25
+	•	Search engines (Elasticsearch, Apache Solr, Lucene)
+	•	Document ranking (Wikipedia, PubMed search)
+	•	Question answering systems (Chatbots)
+
+3. TF-IDF vs. BM25: Key Differences
+
+Feature	TF-IDF	BM25
+Formula	Multiplicative (TF × IDF)	More advanced with saturation & normalization
+Handles Long Documents?	No	Yes (length normalization)
+Handles Frequent Terms?	No	Yes (TF saturation)
+Weighting	Linear scaling	Saturated scaling
+Customization	No parameters to tune	k₁ and b can be adjusted
+Use Case	Simple text analysis	Advanced search ranking
+
+4. When to Use Which?
+	•	Use TF-IDF when:
+	•	You need a quick, simple ranking method.
+	•	You’re working on small datasets (like keyword extraction).
+	•	You don’t have control over ranking parameters.
+	•	Use BM25 when:
+	•	You need high-quality search rankings.
+	•	You’re working with large-scale datasets (search engines, knowledge bases).
+	•	Document length varies significantly.
+
+5. Conclusion
+	•	TF-IDF is simple and effective for basic keyword weighting.
+	•	BM25 improves upon TF-IDF by handling document length variations and term saturation, making it better for real-world search engines.
+	•	BM25 is widely used in modern search engines like Elasticsearch, Solr, and Lucene.
+
+Would you like a code example for TF-IDF or BM25 in Python? 🚀
+
+
+
+
+
+
+Why Lucene Uses BM25 Instead of TF-IDF
+
+Apache Lucene is a powerful text search library used in Elasticsearch, Solr, and OpenSearch. It switched from TF-IDF to BM25 as its default ranking function because BM25 provides better search relevance.
+
+1. How Lucene Uses BM25
+
+Lucene uses BM25 to rank search results by computing a score for each document based on how well it matches a given query. The ranking formula is:
+
+￼
+
+Where:
+	•	TF(t) → Term frequency (how many times term t appears in a document)
+	•	IDF(t) → Inverse document frequency (importance of t across all documents)
+	•	|doc| → Length of the document
+	•	avgDocLength → Average length of all documents
+	•	k₁ and b → Hyperparameters (default: k₁ = 1.2, b = 0.75)
+
+2. Why Lucene Switched from TF-IDF to BM25
+
+Lucene originally used TF-IDF, but it had limitations in real-world search ranking. The main problems with TF-IDF were:
+
+2.1. Overemphasis on Term Frequency
+	•	TF-IDF problem → If a word appears many times in a document, it gets a very high weight.
+	•	BM25 solution → Uses saturation, meaning extra occurrences of a term add diminishing value.
+Example:
+	•	TF-IDF: A document with “apple” 100 times gets 100× weight.
+	•	BM25: A document with “apple” 100 times gets limited extra weight (because of TF saturation).
+
+2.2. No Document Length Normalization in TF-IDF
+	•	TF-IDF problem → Long documents are unfairly penalized because they have more words.
+	•	BM25 solution → Normalizes scores based on document length.
+Example:
+	•	A short article (300 words) and a long article (3000 words) might discuss “Lucene” equally.
+	•	TF-IDF favors the short document (high term density).
+	•	BM25 adjusts scores fairly by normalizing for document length.
+
+2.3. Better Tuning for Different Use Cases
+	•	TF-IDF problem → No tuning parameters.
+	•	BM25 solution → k₁ and b can be adjusted based on the dataset.
+Example:
+	•	News search (where long documents matter) → Set b = 0.75 (normal length normalization).
+	•	Short document search (tweets, forum posts) → Set b = 0.25 (less normalization).
+
+3. Real-World Benefits of BM25 in Lucene
+
+✅ More relevant search results → BM25 ranks important documents better.
+✅ Handles long documents better → Fair scoring for long vs. short content.
+✅ Reduces bias from frequent words → TF saturation prevents one word from dominating.
+✅ More configurable → You can tweak k₁ and b for different datasets.
+
+4. BM25 in Elasticsearch and Solr
+	•	Elasticsearch 5.0+ and Solr 7+ both use BM25 by default (instead of TF-IDF).
+	•	You can still switch back to TF-IDF if needed, but BM25 generally performs better.
+
+Would you like a code example showing BM25 vs. TF-IDF in Python using Lucene-style search? 🚀