Make formulas compatible with github

Alexandru Dima · Alexandru Dima · commit 14612f870128 · 2026-03-24T11:58:24.000+02:00
diff --git a/Fine-Tuning/Decoder Fine-Tuning/README.md b/Fine-Tuning/Decoder Fine-Tuning/README.md
@@ -27,15 +27,15 @@ Self-attention is a mechanism that enables each token in a sequence to compute a
 
 Given an input sequence represented as embeddings:
 
-$
+$$
 X = [x_1, x_2, \dots, x_n] \in \mathbb{R}^{n \times d}
-$
+$$
 
 Three linear projections are applied:
 
-$
+$$
 Q = XW^Q, \quad K = XW^K, \quad V = XW^V
-$
+$$
 
 where:
 
@@ -45,11 +45,11 @@ where:
 
 The attention mechanism computes:
 
-$
+$$
 \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
-$
+$$
 
-* ($QK^T$) computes similarity scores.
+* $QK^T$ computes similarity scores.
 * Division by $\sqrt{d_k}$ stabilizes gradients.
 * The softmax produces attention weights.
 * These weights are used to compute a weighted sum of value vectors.
@@ -99,9 +99,9 @@ x4  →  [ •    •    •    • ]
 
 The upper-triangular portion is masked (set to $-\infty$ before softmax), ensuring:
 
-$
+$$
 x_t \text{ attends only to } x_{\leq t}
-$
+$$
 
 This produces an autoregressive model suitable for text generation.
 
@@ -114,21 +114,21 @@ A single attention operation may not capture all relevant relational patterns in
 
 Instead of a single set of projections $(W^Q, W^K, W^V)$, we define (h) sets:
 
-$
+$$
 Q_i = XW_i^Q, \quad K_i = XW_i^K, \quad V_i = XW_i^V
-$
+$$
 
 Each head computes:
 
-$
+$$
 \text{head}_i = \text{Attention}(Q_i, K_i, V_i)
-$
+$$
 
 The outputs are concatenated:
 
-$
+$$
 \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
-$
+$$
 
 where $W^O$ is a learned projection matrix.
 
@@ -160,9 +160,9 @@ Each head may learn to specialize in distinct types of relationships (e.g., synt
 
 The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $x_{<t} = (x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
 
-$
+$$
 P(x_t \mid x_{<t})
-$
+$$
 
 Since the vocabulary contains $|\mathcal{V}|$ discrete tokens, this prediction corresponds to a **multi-class classification problem**, where:
 
@@ -188,15 +188,15 @@ Training aims to minimize the **cross-entropy loss** between the predicted distr
 
 For a single position $t$, the loss is:
 
-$
-\mathcal{L}*t = - \log P*\theta(x_t \mid x_{<t})
-$
+$$
+\mathcal{L}_t = - \log P_\theta(x_t \mid x_{<t})
+$$
 
-For a sequence of length $ T $:
+For a sequence of length $T$:
 
-$
+$$
 \mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})
-$
+$$
 
 This is equivalent to maximizing the likelihood of the training data (Maximum Likelihood Estimation).
 
@@ -216,7 +216,7 @@ Conceptually:
 * Input to predict token 3 is tokens 1–2.
 * And so forth.
 
-Although the full sequence is fed into the model in parallel, causal masking ensures that position $ t $ only attends to positions $ < t $.
+Although the full sequence is fed into the model in parallel, causal masking ensures that position $t$ only attends to positions $<t$.
 
 ---
 
@@ -237,20 +237,20 @@ Example:
 
 **Input tokens:**
 
-$
+$$
 [tok_1, tok_2, \dots, tok_n, tok_{n+1}, \dots, tok_T]
-$
+$$
 
 where:
 
-* $ tok_1 \dots tok_n $ = prompt
-* $ tok_{n+1} \dots tok_T $ = target completion
+* $tok_1 \dots tok_n$ = prompt
+* $tok_{n+1} \dots tok_T$ = target completion
 
 **Labels:**
 
-$
+$$
 [-100, -100, \dots, -100, tok_{n+1}, \dots, tok_T]
-$
+$$
 
 Here:
 
@@ -259,9 +259,9 @@ Here:
 
 Thus, the loss becomes:
 
-$
+$$
 \mathcal{L} = - \sum_{t=n+1}^{T} \log P_\theta(x_t \mid x_{<t})
-$
+$$
 
 The model still *conditions on the prompt*, but it is only penalized for errors in generating the response.
 
diff --git a/Fine-Tuning/Encoder Fine-Tuning/README.md b/Fine-Tuning/Encoder Fine-Tuning/README.md
@@ -30,63 +30,61 @@ Unlike decoder-only transformers, which are almost always trained with next-toke
 An encoder-only transformer can be viewed as a **general-purpose feature extractor** that produces contextualized token representations. On top of this shared backbone, different **task-specific heads** can be attached.
 
 Given an input sequence
-$
+$$
 x = (x_1, \dots, x_T),
-$
+$$
 the encoder produces hidden states:
 
-$
+$$
 H = (h_1, \dots, h_T).
-$
+$$
 
 These representations can then be fed into one or more classification heads, depending on the task.
 
 #### (a) Sequence-Level Classification
 
-A special token (e.g., [CLS]) is prepended to the sequence. Its final hidden representation $ h_{\text{CLS}} $ serves as an aggregate summary of the input.  This representation can then be fed into a classification head. Alternatively, instead of relying solely on the `[CLS]` token, one may construct the input to the classification head using a learned linear combination (or pooling) of all token representations in the final layer.
+A special token (e.g., [CLS]) is prepended to the sequence. Its final hidden representation $h_{\text{CLS}}$ serves as an aggregate summary of the input. This representation can then be fed into a classification head. Alternatively, instead of relying solely on the `[CLS]` token, one may construct the input to the classification head using a learned linear combination (or pooling) of all token representations in the final layer.
 
 
 A classification head (typically a linear layer + softmax) predicts:
 
-$
+$$
 P_\theta(y \mid x)
-$
+$$
 
 The loss is standard cross-entropy:
 
-$
+$$
 \mathcal{L} = - \log P_\theta(y \mid x)
-$
+$$
 
 This setup is used for tasks such as sentiment analysis, topic classification, or natural language inference.
 
 
 #### (b) Token-Level Classification
 
 For tasks like named entity recognition or part-of-speech tagging, a classification head is applied **independently to each token representation**:
-
-$
+$$
 P_\theta(y_t \mid x)
-$
+$$
 
 The total loss becomes:
 
-$
+$$
 \mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(y_t \mid x)
-$
+$$
 
 #### (c) Multi-Task Learning with Multiple Heads
 
 A single encoder backbone can simultaneously support multiple objectives by attaching **multiple classification heads**, each with its own loss:
-
-$
+$$
 \mathcal{L} = \sum_i \lambda_i \mathcal{L}_i
-$
+$$
 
 where:
 
-* $ \mathcal{L}_i $ is the loss for task $ i $,
-* $ \lambda_i $ controls the relative importance of each task.
+* $\mathcal{L}_i$ is the loss for task $i$,
+* $\lambda_i$ controls the relative importance of each task.
 
 In this setting, the encoder learns shared representations that transfer across tasks, while each head specializes in a particular prediction problem.
 
@@ -107,16 +105,15 @@ Instead of adding extra classificatiomn heads over the transformer backbone, we
 > "Question: How many planets are in our solar system? Answer: [MASK] planets."
 
 The encoder processes the entire sequence simultaneously, attending to both the question and the unmasked parts of the answer. The objective is to predict:
-
-$
+$$
 P(x_t \mid x_{\setminus \mathcal{M}}), \quad t \in \mathcal{M}
-$
+$$
 
 The loss is computed only on the masked answer tokens:
 
-$
+$$
 \mathcal{L} = - \sum_{t \in \mathcal{M}} \log P_\theta(x_t \mid x_{\setminus \mathcal{M}})
-$
+$$
 
 In this formulation: