|
| 1 | +# The Decoder-Only Transformer |
| 2 | + |
| 3 | +Large Language Models (LLMs) are neural networks trained on vast amounts of text data to learn the statistical patterns of language. By predicting and generating words based on context, they can perform a wide range of tasks, including text completion, translation, summarization, and reasoning. Modern LLMs are typically built on the decoder-only transformer architecture, which enables efficient modeling of long-range dependencies in text through self-attention mechanisms. Compared to the original transformer decoder, contemporary LLMs primarily differ in scale, architectural refinements, and the use of extensive pretraining and alignment procedures. Therefore, understanding how the transformer decoder operates is a fundamental step toward understanding the design and behavior of current LLMs. |
| 4 | + |
| 5 | +## Architecture |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +The model begins by converting input tokens into continuous vectors through a token embedding layer. To encode word order, a positional embedding is added to these token representations. The resulting vectors are then passed into a stack of identical transformer blocks (repeated N times). |
| 10 | + |
| 11 | +Each decoder block consists of two main sublayers: |
| 12 | +1. <strong>Masked Multi-Head Self-Attention</strong><br> |
| 13 | +This layer allows each token to attend only to previous tokens in the sequence (via causal masking), ensuring autoregressive behavior for next-token prediction. Multiple attention heads operate in parallel to capture different types of relationships. |
| 14 | + |
| 15 | +2. <strong>Feed-Forward Network (FFN)</strong><br> |
| 16 | +A position-wise fully connected network that processes each token representation independently after attention has mixed contextual information. |
| 17 | + |
| 18 | +Both sublayers are wrapped with residual connections (skip connections) and followed by layer normalization, which stabilizes training and improves gradient flow. After passing through the stacked decoder blocks, the final hidden states are fed into a linear projection (often tied to the embedding matrix) to produce logits over the vocabulary, enabling next-token prediction. |
| 19 | + |
| 20 | +## Attention |
| 21 | +### Self-Attention |
| 22 | +#### Conceptual Overview |
| 23 | + |
| 24 | +Self-attention is a mechanism that enables each token in a sequence to compute a contextualized representation by attending to all other tokens in the same sequence. Self-attention allows direct pairwise interactions between all positions, enabling efficient modeling of long-range dependencies. |
| 25 | + |
| 26 | +#### Mathematical Formulation |
| 27 | + |
| 28 | +Given an input sequence represented as embeddings: |
| 29 | + |
| 30 | +$ |
| 31 | +X = [x_1, x_2, \dots, x_n] \in \mathbb{R}^{n \times d} |
| 32 | +$ |
| 33 | + |
| 34 | +Three linear projections are applied: |
| 35 | + |
| 36 | +$ |
| 37 | +Q = XW^Q, \quad K = XW^K, \quad V = XW^V |
| 38 | +$ |
| 39 | + |
| 40 | +where: |
| 41 | + |
| 42 | +* $Q$ = Queries |
| 43 | +* $K$ = Keys |
| 44 | +* $V$ = Values |
| 45 | + |
| 46 | +The attention mechanism computes: |
| 47 | + |
| 48 | +$ |
| 49 | +\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V |
| 50 | +$ |
| 51 | + |
| 52 | +* ($QK^T$) computes similarity scores. |
| 53 | +* Division by $\sqrt{d_k}$ stabilizes gradients. |
| 54 | +* The softmax produces attention weights. |
| 55 | +* These weights are used to compute a weighted sum of value vectors. |
| 56 | + |
| 57 | +#### Structural Interpretation |
| 58 | + |
| 59 | +Each token representation is updated as a weighted combination of all tokens: |
| 60 | + |
| 61 | +``` |
| 62 | + x1 x2 x3 x4 |
| 63 | +x1 → [ • • • • ] |
| 64 | +x2 → [ • • • • ] |
| 65 | +x3 → [ • • • • ] |
| 66 | +x4 → [ • • • • ] |
| 67 | +``` |
| 68 | + |
| 69 | +Each row corresponds to the attention distribution of a token over the sequence. Thus, self-attention produces context-aware token embeddings. |
| 70 | + |
| 71 | +### Masked Self-Attention (Causal Attention) |
| 72 | +#### Motivation |
| 73 | + |
| 74 | +In autoregressive language modeling, the objective is next-token prediction. Therefore, when predicting the token at position $t$, the model must not access information from positions $> t$. Standard self-attention allows full bidirectional access. To prevent information leakage, a **causal mask** is applied. |
| 75 | + |
| 76 | +#### Causal Mask Mechanism |
| 77 | + |
| 78 | +The attention matrix is modified by masking future positions before applying softmax. |
| 79 | + |
| 80 | +Instead of: |
| 81 | + |
| 82 | +``` |
| 83 | + x1 x2 x3 x4 |
| 84 | +x1 → [ • • • • ] |
| 85 | +x2 → [ • • • • ] |
| 86 | +x3 → [ • • • • ] |
| 87 | +x4 → [ • • • • ] |
| 88 | +``` |
| 89 | + |
| 90 | +We enforce: |
| 91 | + |
| 92 | +``` |
| 93 | + x1 x2 x3 x4 |
| 94 | +x1 → [ • 0 0 0 ] |
| 95 | +x2 → [ • • 0 0 ] |
| 96 | +x3 → [ • • • 0 ] |
| 97 | +x4 → [ • • • • ] |
| 98 | +``` |
| 99 | + |
| 100 | +The upper-triangular portion is masked (set to $-\infty$ before softmax), ensuring: |
| 101 | + |
| 102 | +$ |
| 103 | +x_t \text{ attends only to } x_{\leq t} |
| 104 | +$ |
| 105 | + |
| 106 | +This produces an autoregressive model suitable for text generation. |
| 107 | + |
| 108 | +### Masked Multi-Head Self-Attention |
| 109 | +#### Motivation |
| 110 | + |
| 111 | +A single attention operation may not capture all relevant relational patterns in the data. Different linguistic or structural relationships may require distinct representational subspaces. Multi-head attention addresses this by running several independent attention mechanisms in parallel. |
| 112 | + |
| 113 | +#### Mechanism |
| 114 | + |
| 115 | +Instead of a single set of projections $(W^Q, W^K, W^V)$, we define (h) sets: |
| 116 | + |
| 117 | +$ |
| 118 | +Q_i = XW_i^Q, \quad K_i = XW_i^K, \quad V_i = XW_i^V |
| 119 | +$ |
| 120 | + |
| 121 | +Each head computes: |
| 122 | + |
| 123 | +$ |
| 124 | +\text{head}_i = \text{Attention}(Q_i, K_i, V_i) |
| 125 | +$ |
| 126 | + |
| 127 | +The outputs are concatenated: |
| 128 | + |
| 129 | +$ |
| 130 | +\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O |
| 131 | +$ |
| 132 | + |
| 133 | +where $W^O$ is a learned projection matrix. |
| 134 | + |
| 135 | +#### Structural Illustration |
| 136 | + |
| 137 | +``` |
| 138 | + Input X |
| 139 | + | |
| 140 | + ---------------------------- |
| 141 | + | | | | | |
| 142 | + Head1 Head2 Head3 ... Headh |
| 143 | + | | | | |
| 144 | + Attention Attention Attention |
| 145 | + | | | | |
| 146 | + -------------------------------- |
| 147 | + Concatenate |
| 148 | + | |
| 149 | + Linear Projection |
| 150 | + | |
| 151 | + Output |
| 152 | +``` |
| 153 | + |
| 154 | +Each head may learn to specialize in distinct types of relationships (e.g., syntactic dependencies, semantic similarity, positional structure). In decoder-only transformer architectures, masked multi-head self-attention constitutes the core operation enabling scalable, parallelizable, and autoregressive language modeling. |
| 155 | + |
| 156 | + |
| 157 | +## Training Objective |
| 158 | + |
| 159 | +### 1. Next-Token Prediction as a Classification Problem |
| 160 | + |
| 161 | +The training objective of a decoder-only transformer can be formulated as a sequence of classification tasks. At each position $t$, given the previous tokens $x_{<t} = (x_1, \dots, x_{t-1})$, the model predicts a probability distribution over the entire vocabulary $\mathcal{V}$: |
| 162 | + |
| 163 | +$ |
| 164 | +P(x_t \mid x_{<t}) |
| 165 | +$ |
| 166 | + |
| 167 | +Since the vocabulary contains $|\mathcal{V}|$ discrete tokens, this prediction corresponds to a **multi-class classification problem**, where: |
| 168 | + |
| 169 | +* The classes are all tokens in the vocabulary, |
| 170 | +* The model outputs a probability distribution via a softmax layer, |
| 171 | +* The correct class is the actual next token in the sequence. |
| 172 | + |
| 173 | +This process occurs at every time step of the sequence. At a high level, the model learns to continue a given text prefix. During inference, generation proceeds iteratively: |
| 174 | + |
| 175 | +1. Provide an initial prompt. |
| 176 | +2. Predict a distribution over the next token. |
| 177 | +3. Sample or select the most likely token. |
| 178 | +4. Append it to the input. |
| 179 | +5. Repeat. |
| 180 | + |
| 181 | +The next figure showcases this behaviour. |
| 182 | + |
| 183 | + |
| 184 | + |
| 185 | +### 2. Cross-Entropy Loss |
| 186 | + |
| 187 | +Training aims to minimize the **cross-entropy loss** between the predicted distribution and the true token at each position. |
| 188 | + |
| 189 | +For a single position $t$, the loss is: |
| 190 | + |
| 191 | +$ |
| 192 | +\mathcal{L}*t = - \log P*\theta(x_t \mid x_{<t}) |
| 193 | +$ |
| 194 | + |
| 195 | +For a sequence of length $ T $: |
| 196 | + |
| 197 | +$ |
| 198 | +\mathcal{L} = - \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t}) |
| 199 | +$ |
| 200 | + |
| 201 | +This is equivalent to maximizing the likelihood of the training data (Maximum Likelihood Estimation). |
| 202 | + |
| 203 | +### 3. Training |
| 204 | + |
| 205 | +During pretraining, the input and target sequence are identical, but shifted internally by one position. |
| 206 | + |
| 207 | +Example: |
| 208 | + |
| 209 | +**Text sequence:** |
| 210 | + |
| 211 | +> "There are 8 planets." |
| 212 | +
|
| 213 | +Conceptually: |
| 214 | + |
| 215 | +* Input to predict token 2 is token 1. |
| 216 | +* Input to predict token 3 is tokens 1–2. |
| 217 | +* And so forth. |
| 218 | + |
| 219 | +Although the full sequence is fed into the model in parallel, causal masking ensures that position $ t $ only attends to positions $ < t $. |
| 220 | + |
| 221 | +--- |
| 222 | + |
| 223 | +During supervised fine-tuning, training data often consists of: |
| 224 | + |
| 225 | +* A **prompt** |
| 226 | +* A desired **completion** |
| 227 | + |
| 228 | +Example: |
| 229 | + |
| 230 | +**Input:** |
| 231 | + |
| 232 | +> "Respond to the following question: How many planets are in our solar system? There are 8 planets." |
| 233 | +
|
| 234 | +If we compute loss over the entire sequence, the model would be trained to reproduce both the prompt and the answer. However, the objective in instruction tuning is to train the model to generate only the completion, conditioned on the prompt. To prevent computing loss on the prompt portion, the labels corresponding to the prompt tokens are masked. |
| 235 | + |
| 236 | +Example: |
| 237 | + |
| 238 | +**Input tokens:** |
| 239 | + |
| 240 | +$ |
| 241 | +[tok_1, tok_2, \dots, tok_n, tok_{n+1}, \dots, tok_T] |
| 242 | +$ |
| 243 | + |
| 244 | +where: |
| 245 | + |
| 246 | +* $ tok_1 \dots tok_n $ = prompt |
| 247 | +* $ tok_{n+1} \dots tok_T $ = target completion |
| 248 | + |
| 249 | +**Labels:** |
| 250 | + |
| 251 | +$ |
| 252 | +[-100, -100, \dots, -100, tok_{n+1}, \dots, tok_T] |
| 253 | +$ |
| 254 | + |
| 255 | +Here: |
| 256 | + |
| 257 | +* The special value (e.g., -100 in PyTorch) indicates that loss should **not** be computed for those positions. |
| 258 | +* Gradients are propagated only for the completion tokens. |
| 259 | + |
| 260 | +Thus, the loss becomes: |
| 261 | + |
| 262 | +$ |
| 263 | +\mathcal{L} = - \sum_{t=n+1}^{T} \log P_\theta(x_t \mid x_{<t}) |
| 264 | +$ |
| 265 | + |
| 266 | +The model still *conditions on the prompt*, but it is only penalized for errors in generating the response. |
| 267 | + |
| 268 | +## Parameter Efficient Fine-Tuning |
| 269 | +### Why is it necessary? |
| 270 | + |
| 271 | +As the scale of Large Language Models (LLMs) has expanded, **Parameter-Efficient Fine-Tuning (PEFT)** has emerged as a critical methodology to mitigate the computational and storage problems associated with fine-tuning. Historically, the models utilized in standard NLP pipelines were significantly smaller than today's state-of-the-art (SOTA) decoder architectures. To put this into perspective: |
| 272 | + |
| 273 | +* **BERT-base:** 110 million parameters. |
| 274 | +* **FLAN-T5-small:** 77 million parameters. |
| 275 | +* **Qwen (1.5B/0.5B variant):** Approximately 494 million parameters. |
| 276 | + |
| 277 | +In contrast, models have transitioned from the millions to billions, typically starting from **7 billion parameters** and scaling upwards to **hundreds of billions**. At this magnitude, updating every weight in the network (Full Fine-Tuning) is no longer viable for most organizations due to memory constraints and high GPU costs. |
| 278 | + |
| 279 | +### How PEFT Optimizes Adaptation |
| 280 | + |
| 281 | +Instead of performing a global update of all model weights, PEFT strategies focus on modifying only a specialized, task-specific subset of parameters. The core weights of the original pre-trained model remain **frozen**, which yields several advantages: |
| 282 | + |
| 283 | +* **Resource Efficiency:** By only training a fraction of the total parameters (often less than 1%), the memory footprint during training is drastically reduced. |
| 284 | +* **Mitigation of Catastrophic Forgetting:** Keeping the base weights fixed helps the model retain its general knowledge while learning new tasks. |
| 285 | +* **Modular Portability:** Since the original model remains unchanged, practitioners only need to store small "checkpoints" (adapters) for each specific task, rather than multiple full-sized versions of the model. |
| 286 | + |
| 287 | +### Low-Rank Adaptation (LoRA) |
| 288 | + |
| 289 | +In the context of modern chat-based models, **Low-Rank Adaptation (LoRA)** has become the industry-standard PEFT technique. LoRA functions by injecting trainable low-rank matrices into the transformer layers. This allows the model to learn task-specific nuances through a highly compressed mathematical representation, ensuring that even billion-parameter models can be adapted on consumer-grade hardware. |
| 290 | + |
| 291 | + |
| 292 | + |
| 293 | +## Dockerfile |
| 294 | + |
| 295 | +``` |
| 296 | +FROM ubuntu:24.04 |
| 297 | +
|
| 298 | +RUN apt-get update && apt-get install -y python3-pip python3-venv && apt-get clean |
| 299 | +
|
| 300 | +RUN python3 -m venv /opt/venv |
| 301 | +ENV PATH="/opt/venv/bin:$PATH" |
| 302 | +
|
| 303 | +COPY requirements.txt /tmp/ |
| 304 | +RUN pip install --upgrade pip wheel |
| 305 | +RUN pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128 |
| 306 | +RUN pip install -r /tmp/requirements.txt |
| 307 | +``` |
| 308 | + |
| 309 | +## Requirements |
| 310 | + |
| 311 | +``` |
| 312 | +transformers>=5.2.0 |
| 313 | +lightning>=2.6.1 |
| 314 | +datasets>=4.5.0 |
| 315 | +evaluate>=0.4.6 |
| 316 | +bitsandbytes>=0.49.0 |
| 317 | +peft>=0.18.0 |
| 318 | +wandb>=0.25.0 |
| 319 | +``` |
0 commit comments