Skip to content

Commit 14612f8

Browse files
Alexandru DimaAlexandru Dima
authored andcommitted
Make formulas compatible with github
1 parent dc17057 commit 14612f8

2 files changed

Lines changed: 53 additions & 56 deletions

File tree

Fine-Tuning/Decoder Fine-Tuning/README.md

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@ Self-attention is a mechanism that enables each token in a sequence to compute a
2727

2828
Given an input sequence represented as embeddings:
2929

30-
$
30+
$$
3131
X = [x_1, x_2, \dots, x_n] \in \mathbb{R}^{n \times d}
32-
$
32+
$$
3333

3434
Three linear projections are applied:
3535

36-
$
36+
$$
3737
Q = XW^Q, \quad K = XW^K, \quad V = XW^V
38-
$
38+
$$
3939

4040
where:
4141

@@ -45,11 +45,11 @@ where:
4545

4646
The attention mechanism computes:
4747

48-
$
48+
$$
4949
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
50-
$
50+
$$
5151

52-
* ($QK^T$) computes similarity scores.
52+
* $QK^T$ computes similarity scores.
5353
* Division by $\sqrt{d_k}$ stabilizes gradients.
5454
* The softmax produces attention weights.
5555
* These weights are used to compute a weighted sum of value vectors.
@@ -99,9 +99,9 @@ x4 → [ • • • • ]
9999

100100
The upper-triangular portion is masked (set to $-\infty$ before softmax), ensuring:
101101

102-
$
102+
$$
103103
x_t \text{ attends only to } x_{\leq t}
104-
$
104+
$$
105105

106106
This produces an autoregressive model suitable for text generation.
107107

@@ -114,21 +114,21 @@ A single attention operation may not capture all relevant relational patterns in
114114

115115
Instead of a single set of projections $(W^Q, W^K, W^V)$, we define (h) sets:
116116

117-
$
117+
$$
118118
Q_i = XW_i^Q, \quad K_i = XW_i^K, \quad V_i = XW_i^V
119-
$
119+
$$
120120

121121
Each head computes:
122122

123-
$
123+
$$
124124
\text{head}_i = \text{Attention}(Q_i, K_i, V_i)
125-
$
125+
$$
126126

127127
The outputs are concatenated:
128128

129-
$
129+
$$
130130
\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
131-
$
131+
$$
132132

133133
where $W^O$ is a learned projection matrix.
134134

@@ -160,9 +160,9 @@ Each head may learn to specialize in distinct types of relationships (e.g., synt
160160

161161
The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $x_{<t} = (x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
162162

163-
$
163+
$$
164164
P(x_t \mid x_{<t})
165-
$
165+
$$
166166

167167
Since the vocabulary contains $|\mathcal{V}|$ discrete tokens, this prediction corresponds to a **multi-class classification problem**, where:
168168

@@ -188,15 +188,15 @@ Training aims to minimize the **cross-entropy loss** between the predicted distr
188188

189189
For a single position $t$, the loss is:
190190

191-
$
192-
\mathcal{L}*t = - \log P*\theta(x_t \mid x_{<t})
193-
$
191+
$$
192+
\mathcal{L}_t = - \log P_\theta(x_t \mid x_{<t})
193+
$$
194194

195-
For a sequence of length $ T $:
195+
For a sequence of length $T$:
196196

197-
$
197+
$$
198198
\mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})
199-
$
199+
$$
200200

201201
This is equivalent to maximizing the likelihood of the training data (Maximum Likelihood Estimation).
202202

@@ -216,7 +216,7 @@ Conceptually:
216216
* Input to predict token 3 is tokens 1–2.
217217
* And so forth.
218218

219-
Although the full sequence is fed into the model in parallel, causal masking ensures that position $ t $ only attends to positions $ < t $.
219+
Although the full sequence is fed into the model in parallel, causal masking ensures that position $t$ only attends to positions $<t$.
220220

221221
---
222222

@@ -237,20 +237,20 @@ Example:
237237

238238
**Input tokens:**
239239

240-
$
240+
$$
241241
[tok_1, tok_2, \dots, tok_n, tok_{n+1}, \dots, tok_T]
242-
$
242+
$$
243243

244244
where:
245245

246-
* $ tok_1 \dots tok_n $ = prompt
247-
* $ tok_{n+1} \dots tok_T $ = target completion
246+
* $tok_1 \dots tok_n$ = prompt
247+
* $tok_{n+1} \dots tok_T$ = target completion
248248

249249
**Labels:**
250250

251-
$
251+
$$
252252
[-100, -100, \dots, -100, tok_{n+1}, \dots, tok_T]
253-
$
253+
$$
254254

255255
Here:
256256

@@ -259,9 +259,9 @@ Here:
259259

260260
Thus, the loss becomes:
261261

262-
$
262+
$$
263263
\mathcal{L} = - \sum_{t=n+1}^{T} \log P_\theta(x_t \mid x_{<t})
264-
$
264+
$$
265265

266266
The model still *conditions on the prompt*, but it is only penalized for errors in generating the response.
267267

Fine-Tuning/Encoder Fine-Tuning/README.md

Lines changed: 21 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -30,63 +30,61 @@ Unlike decoder-only transformers, which are almost always trained with next-toke
3030
An encoder-only transformer can be viewed as a **general-purpose feature extractor** that produces contextualized token representations. On top of this shared backbone, different **task-specific heads** can be attached.
3131

3232
Given an input sequence
33-
$
33+
$$
3434
x = (x_1, \dots, x_T),
35-
$
35+
$$
3636
the encoder produces hidden states:
3737

38-
$
38+
$$
3939
H = (h_1, \dots, h_T).
40-
$
40+
$$
4141

4242
These representations can then be fed into one or more classification heads, depending on the task.
4343

4444
#### (a) Sequence-Level Classification
4545

46-
A special token (e.g., [CLS]) is prepended to the sequence. Its final hidden representation $ h_{\text{CLS}} $ serves as an aggregate summary of the input. This representation can then be fed into a classification head. Alternatively, instead of relying solely on the `[CLS]` token, one may construct the input to the classification head using a learned linear combination (or pooling) of all token representations in the final layer.
46+
A special token (e.g., [CLS]) is prepended to the sequence. Its final hidden representation $h_{\text{CLS}}$ serves as an aggregate summary of the input. This representation can then be fed into a classification head. Alternatively, instead of relying solely on the `[CLS]` token, one may construct the input to the classification head using a learned linear combination (or pooling) of all token representations in the final layer.
4747

4848

4949
A classification head (typically a linear layer + softmax) predicts:
5050

51-
$
51+
$$
5252
P_\theta(y \mid x)
53-
$
53+
$$
5454

5555
The loss is standard cross-entropy:
5656

57-
$
57+
$$
5858
\mathcal{L} = - \log P_\theta(y \mid x)
59-
$
59+
$$
6060

6161
This setup is used for tasks such as sentiment analysis, topic classification, or natural language inference.
6262

6363

6464
#### (b) Token-Level Classification
6565

6666
For tasks like named entity recognition or part-of-speech tagging, a classification head is applied **independently to each token representation**:
67-
68-
$
67+
$$
6968
P_\theta(y_t \mid x)
70-
$
69+
$$
7170

7271
The total loss becomes:
7372

74-
$
73+
$$
7574
\mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(y_t \mid x)
76-
$
75+
$$
7776

7877
#### (c) Multi-Task Learning with Multiple Heads
7978

8079
A single encoder backbone can simultaneously support multiple objectives by attaching **multiple classification heads**, each with its own loss:
81-
82-
$
80+
$$
8381
\mathcal{L} = \sum_i \lambda_i \mathcal{L}_i
84-
$
82+
$$
8583

8684
where:
8785

88-
* $ \mathcal{L}_i $ is the loss for task $ i $,
89-
* $ \lambda_i $ controls the relative importance of each task.
86+
* $\mathcal{L}_i$ is the loss for task $i$,
87+
* $\lambda_i$ controls the relative importance of each task.
9088

9189
In this setting, the encoder learns shared representations that transfer across tasks, while each head specializes in a particular prediction problem.
9290

@@ -107,16 +105,15 @@ Instead of adding extra classificatiomn heads over the transformer backbone, we
107105
> "Question: How many planets are in our solar system? Answer: [MASK] planets."
108106
109107
The encoder processes the entire sequence simultaneously, attending to both the question and the unmasked parts of the answer. The objective is to predict:
110-
111-
$
108+
$$
112109
P(x_t \mid x_{\setminus \mathcal{M}}), \quad t \in \mathcal{M}
113-
$
110+
$$
114111

115112
The loss is computed only on the masked answer tokens:
116113

117-
$
114+
$$
118115
\mathcal{L} = - \sum_{t \in \mathcal{M}} \log P_\theta(x_t \mid x_{\setminus \mathcal{M}})
119-
$
116+
$$
120117

121118
In this formulation:
122119

0 commit comments

Comments
 (0)