You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -160,9 +160,9 @@ Each head may learn to specialize in distinct types of relationships (e.g., synt
160
160
161
161
The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $x_{<t} = (x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$:
162
162
163
-
$
163
+
$$
164
164
P(x_t \mid x_{<t})
165
-
$
165
+
$$
166
166
167
167
Since the vocabulary contains $|\mathcal{V}|$ discrete tokens, this prediction corresponds to a **multi-class classification problem**, where:
168
168
@@ -188,15 +188,15 @@ Training aims to minimize the **cross-entropy loss** between the predicted distr
Copy file name to clipboardExpand all lines: Fine-Tuning/Encoder Fine-Tuning/README.md
+21-24Lines changed: 21 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,63 +30,61 @@ Unlike decoder-only transformers, which are almost always trained with next-toke
30
30
An encoder-only transformer can be viewed as a **general-purpose feature extractor** that produces contextualized token representations. On top of this shared backbone, different **task-specific heads** can be attached.
31
31
32
32
Given an input sequence
33
-
$
33
+
$$
34
34
x = (x_1, \dots, x_T),
35
-
$
35
+
$$
36
36
the encoder produces hidden states:
37
37
38
-
$
38
+
$$
39
39
H = (h_1, \dots, h_T).
40
-
$
40
+
$$
41
41
42
42
These representations can then be fed into one or more classification heads, depending on the task.
43
43
44
44
#### (a) Sequence-Level Classification
45
45
46
-
A special token (e.g., [CLS]) is prepended to the sequence. Its final hidden representation $h_{\text{CLS}}$ serves as an aggregate summary of the input. This representation can then be fed into a classification head. Alternatively, instead of relying solely on the `[CLS]` token, one may construct the input to the classification head using a learned linear combination (or pooling) of all token representations in the final layer.
46
+
A special token (e.g., [CLS]) is prepended to the sequence. Its final hidden representation $h_{\text{CLS}}$ serves as an aggregate summary of the input. This representation can then be fed into a classification head. Alternatively, instead of relying solely on the `[CLS]` token, one may construct the input to the classification head using a learned linear combination (or pooling) of all token representations in the final layer.
47
47
48
48
49
49
A classification head (typically a linear layer + softmax) predicts:
50
50
51
-
$
51
+
$$
52
52
P_\theta(y \mid x)
53
-
$
53
+
$$
54
54
55
55
The loss is standard cross-entropy:
56
56
57
-
$
57
+
$$
58
58
\mathcal{L} = - \log P_\theta(y \mid x)
59
-
$
59
+
$$
60
60
61
61
This setup is used for tasks such as sentiment analysis, topic classification, or natural language inference.
62
62
63
63
64
64
#### (b) Token-Level Classification
65
65
66
66
For tasks like named entity recognition or part-of-speech tagging, a classification head is applied **independently to each token representation**:
A single encoder backbone can simultaneously support multiple objectives by attaching **multiple classification heads**, each with its own loss:
81
-
82
-
$
80
+
$$
83
81
\mathcal{L} = \sum_i \lambda_i \mathcal{L}_i
84
-
$
82
+
$$
85
83
86
84
where:
87
85
88
-
* $\mathcal{L}_i$ is the loss for task $ i $,
89
-
* $\lambda_i$ controls the relative importance of each task.
86
+
* $\mathcal{L}_i$ is the loss for task $i$,
87
+
* $\lambda_i$ controls the relative importance of each task.
90
88
91
89
In this setting, the encoder learns shared representations that transfer across tasks, while each head specializes in a particular prediction problem.
92
90
@@ -107,16 +105,15 @@ Instead of adding extra classificatiomn heads over the transformer backbone, we
107
105
> "Question: How many planets are in our solar system? Answer: [MASK] planets."
108
106
109
107
The encoder processes the entire sequence simultaneously, attending to both the question and the unmasked parts of the answer. The objective is to predict:
110
-
111
-
$
108
+
$$
112
109
P(x_t \mid x_{\setminus \mathcal{M}}), \quad t \in \mathcal{M}
113
-
$
110
+
$$
114
111
115
112
The loss is computed only on the masked answer tokens:
0 commit comments