Fixed formulas rendering

Alexandru Dima · Alexandru Dima · commit dc4eb767f85d · 2026-03-24T12:11:09.000+02:00
diff --git a/Fine-Tuning/Decoder Fine-Tuning/README.md b/Fine-Tuning/Decoder Fine-Tuning/README.md
@@ -158,10 +158,10 @@ Each head may learn to specialize in distinct types of relationships (e.g., synt
 
 ### 1. Next-Token Prediction as a Classification Problem
 
-The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $x_{<t} = (x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
+The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $(x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
 
 $$
-P(x_t \mid x_{<t})
+P(x_t \mid x_1, \dots, x_{t-1})
 $$
 
 Since the vocabulary contains $|\mathcal{V}|$ discrete tokens, this prediction corresponds to a **multi-class classification problem**, where:
@@ -189,13 +189,13 @@ Training aims to minimize the **cross-entropy loss** between the predicted distr
 For a single position $t$, the loss is:
 
 $$
-\mathcal{L}_t = - \log P_\theta(x_t \mid x_{<t})
+\mathcal{L}_t = - \log P_\theta(x_t \mid x_1, \dots, x_{t-1})
 $$
 
 For a sequence of length $T$:
 
 $$
-\mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})
+\mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \dots, x_{t-1})
 $$
 
 This is equivalent to maximizing the likelihood of the training data (Maximum Likelihood Estimation).
@@ -260,7 +260,7 @@ Here:
 Thus, the loss becomes:
 
 $$
-\mathcal{L} = - \sum_{t=n+1}^{T} \log P_\theta(x_t \mid x_{<t})
+\mathcal{L} = - \sum_{t=n+1}^{T} \log P_\theta(x_t \mid x_1, \dots, x_{t-1})
 $$
 
 The model still *conditions on the prompt*, but it is only penalized for errors in generating the response.
diff --git a/Fine-Tuning/Encoder Fine-Tuning/README.md b/Fine-Tuning/Encoder Fine-Tuning/README.md
@@ -29,11 +29,7 @@ Unlike decoder-only transformers, which are almost always trained with next-toke
 
 An encoder-only transformer can be viewed as a **general-purpose feature extractor** that produces contextualized token representations. On top of this shared backbone, different **task-specific heads** can be attached.
 
-Given an input sequence
-$$
-x = (x_1, \dots, x_T),
-$$
-the encoder produces hidden states:
+Given an input sequence $x = (x_1, \dots, x_T)$, the encoder produces hidden states:
 
 $$
 H = (h_1, \dots, h_T).
@@ -64,6 +60,7 @@ This setup is used for tasks such as sentiment analysis, topic classification, o
 #### (b) Token-Level Classification
 
 For tasks like named entity recognition or part-of-speech tagging, a classification head is applied **independently to each token representation**:
+
 $$
 P_\theta(y_t \mid x)
 $$
@@ -77,6 +74,7 @@ $$
 #### (c) Multi-Task Learning with Multiple Heads
 
 A single encoder backbone can simultaneously support multiple objectives by attaching **multiple classification heads**, each with its own loss:
+
 $$
 \mathcal{L} = \sum_i \lambda_i \mathcal{L}_i
 $$
@@ -105,6 +103,7 @@ Instead of adding extra classificatiomn heads over the transformer backbone, we
 > "Question: How many planets are in our solar system? Answer: [MASK] planets."
 
 The encoder processes the entire sequence simultaneously, attending to both the question and the unmasked parts of the answer. The objective is to predict:
+
 $$
 P(x_t \mid x_{\setminus \mathcal{M}}), \quad t \in \mathcal{M}
 $$