You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Fine-Tuning/Decoder Fine-Tuning/README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -158,10 +158,10 @@ Each head may learn to specialize in distinct types of relationships (e.g., synt
158
158
159
159
### 1. Next-Token Prediction as a Classification Problem
160
160
161
-
The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $x_{<t} = (x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
161
+
The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $(x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
162
162
163
163
$$
164
-
P(x_t \mid x_{<t})
164
+
P(x_t \mid x_1, \dots, x_{t-1})
165
165
$$
166
166
167
167
Since the vocabulary contains $|\mathcal{V}|$ discrete tokens, this prediction corresponds to a **multi-class classification problem**, where:
@@ -189,13 +189,13 @@ Training aims to minimize the **cross-entropy loss** between the predicted distr
Copy file name to clipboardExpand all lines: Fine-Tuning/Encoder Fine-Tuning/README.md
+4-5Lines changed: 4 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,11 +29,7 @@ Unlike decoder-only transformers, which are almost always trained with next-toke
29
29
30
30
An encoder-only transformer can be viewed as a **general-purpose feature extractor** that produces contextualized token representations. On top of this shared backbone, different **task-specific heads** can be attached.
31
31
32
-
Given an input sequence
33
-
$$
34
-
x = (x_1, \dots, x_T),
35
-
$$
36
-
the encoder produces hidden states:
32
+
Given an input sequence $x = (x_1, \dots, x_T)$, the encoder produces hidden states:
37
33
38
34
$$
39
35
H = (h_1, \dots, h_T).
@@ -64,6 +60,7 @@ This setup is used for tasks such as sentiment analysis, topic classification, o
64
60
#### (b) Token-Level Classification
65
61
66
62
For tasks like named entity recognition or part-of-speech tagging, a classification head is applied **independently to each token representation**:
63
+
67
64
$$
68
65
P_\theta(y_t \mid x)
69
66
$$
@@ -77,6 +74,7 @@ $$
77
74
#### (c) Multi-Task Learning with Multiple Heads
78
75
79
76
A single encoder backbone can simultaneously support multiple objectives by attaching **multiple classification heads**, each with its own loss:
77
+
80
78
$$
81
79
\mathcal{L} = \sum_i \lambda_i \mathcal{L}_i
82
80
$$
@@ -105,6 +103,7 @@ Instead of adding extra classificatiomn heads over the transformer backbone, we
105
103
> "Question: How many planets are in our solar system? Answer: [MASK] planets."
106
104
107
105
The encoder processes the entire sequence simultaneously, attending to both the question and the unmasked parts of the answer. The objective is to predict:
106
+
108
107
$$
109
108
P(x_t \mid x_{\setminus \mathcal{M}}), \quad t \in \mathcal{M}
0 commit comments