Skip to content

Commit dc4eb76

Browse files
Alexandru DimaAlexandru Dima
authored andcommitted
Fixed formulas rendering
1 parent 2bd232e commit dc4eb76

2 files changed

Lines changed: 9 additions & 10 deletions

File tree

Fine-Tuning/Decoder Fine-Tuning/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -158,10 +158,10 @@ Each head may learn to specialize in distinct types of relationships (e.g., synt
158158

159159
### 1. Next-Token Prediction as a Classification Problem
160160

161-
The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $x_{<t} = (x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
161+
The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $(x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
162162

163163
$$
164-
P(x_t \mid x_{<t})
164+
P(x_t \mid x_1, \dots, x_{t-1})
165165
$$
166166

167167
Since the vocabulary contains $|\mathcal{V}|$ discrete tokens, this prediction corresponds to a **multi-class classification problem**, where:
@@ -189,13 +189,13 @@ Training aims to minimize the **cross-entropy loss** between the predicted distr
189189
For a single position $t$, the loss is:
190190

191191
$$
192-
\mathcal{L}_t = - \log P_\theta(x_t \mid x_{<t})
192+
\mathcal{L}_t = - \log P_\theta(x_t \mid x_1, \dots, x_{t-1})
193193
$$
194194

195195
For a sequence of length $T$:
196196

197197
$$
198-
\mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})
198+
\mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \dots, x_{t-1})
199199
$$
200200

201201
This is equivalent to maximizing the likelihood of the training data (Maximum Likelihood Estimation).
@@ -260,7 +260,7 @@ Here:
260260
Thus, the loss becomes:
261261

262262
$$
263-
\mathcal{L} = - \sum_{t=n+1}^{T} \log P_\theta(x_t \mid x_{<t})
263+
\mathcal{L} = - \sum_{t=n+1}^{T} \log P_\theta(x_t \mid x_1, \dots, x_{t-1})
264264
$$
265265

266266
The model still *conditions on the prompt*, but it is only penalized for errors in generating the response.

Fine-Tuning/Encoder Fine-Tuning/README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,7 @@ Unlike decoder-only transformers, which are almost always trained with next-toke
2929

3030
An encoder-only transformer can be viewed as a **general-purpose feature extractor** that produces contextualized token representations. On top of this shared backbone, different **task-specific heads** can be attached.
3131

32-
Given an input sequence
33-
$$
34-
x = (x_1, \dots, x_T),
35-
$$
36-
the encoder produces hidden states:
32+
Given an input sequence $x = (x_1, \dots, x_T)$, the encoder produces hidden states:
3733

3834
$$
3935
H = (h_1, \dots, h_T).
@@ -64,6 +60,7 @@ This setup is used for tasks such as sentiment analysis, topic classification, o
6460
#### (b) Token-Level Classification
6561

6662
For tasks like named entity recognition or part-of-speech tagging, a classification head is applied **independently to each token representation**:
63+
6764
$$
6865
P_\theta(y_t \mid x)
6966
$$
@@ -77,6 +74,7 @@ $$
7774
#### (c) Multi-Task Learning with Multiple Heads
7875

7976
A single encoder backbone can simultaneously support multiple objectives by attaching **multiple classification heads**, each with its own loss:
77+
8078
$$
8179
\mathcal{L} = \sum_i \lambda_i \mathcal{L}_i
8280
$$
@@ -105,6 +103,7 @@ Instead of adding extra classificatiomn heads over the transformer backbone, we
105103
> "Question: How many planets are in our solar system? Answer: [MASK] planets."
106104
107105
The encoder processes the entire sequence simultaneously, attending to both the question and the unmasked parts of the answer. The objective is to predict:
106+
108107
$$
109108
P(x_t \mid x_{\setminus \mathcal{M}}), \quad t \in \mathcal{M}
110109
$$

0 commit comments

Comments
 (0)