- Overview
- Learning Paradigms
- Model Evaluation
- Regularization
- Theoretical Foundations of Generalization
- Feature Engineering
- Supervised Learning Algorithms
- Unsupervised Learning Algorithms
- Ensemble Methods
- Reinforcement Learning
- References
Machine learning is a field of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. This document covers fundamental machine learning concepts organized into key domains: supervised learning (prediction from labeled data), unsupervised learning (discovering structure in unlabeled data), and reinforcement learning (sequential decision-making). The document addresses both practical algorithms and theoretical foundations including generalization, regularization, and the bias-variance tradeoff.
The model learns from a labeled dataset, trying to predict outcomes for new, unseen data based on knowledge acquired from the training dataset. Common tasks include classification and regression.
Key characteristics:
- Requires labeled training data
- Goal is to learn mapping from inputs to outputs
- Performance measured against known correct answers
The model works with unlabeled data and tries to find underlying patterns or groupings in the data, without explicit instructions on what to find. Common tasks include clustering and dimensionality reduction.
Key characteristics:
- No labeled data required
- Discovers hidden structure in data
- Used for exploratory data analysis
A combination of supervised and unsupervised learning, where the model learns from a partially labeled dataset, using the labeled data to learn the structure and make predictions about the unlabeled data.
Applications:
- When labeling data is expensive
- Large amounts of unlabeled data available
- Medical image analysis, speech recognition
The model learns to make decisions by performing actions and receiving feedback in the form of rewards or penalties, aiming to maximize cumulative reward.
Key components:
- Agent: decision-making entity
- Environment: world the agent interacts with
- Actions: choices available to agent
- Rewards: feedback signal
Mean Absolute Error (MAE): $$ MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i| $$
Average of absolute differences between predicted and actual values.
Mean Squared Error (MSE): $$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$
Average of squared differences, penalizes larger errors more than MAE.
Root Mean Squared Error (RMSE): $$ RMSE = \sqrt{MSE} $$
In same units as response variable, gives high weight to large errors.
R-squared (Coefficient of Determination): $$ R^2 = 1 - \frac{SS_{residual}}{SS_{total}} $$
Measures proportion of variance in dependent variable predictable from independent variables.
Accuracy: $$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$
Proportion of correct predictions. Good when classes are balanced but misleading with imbalanced datasets.
Precision (Positive Predictive Value): $$ Precision = \frac{TP}{TP + FP} $$
Ratio of true positives to total positive predictions. Important when false positives are costly.
Recall (Sensitivity, True Positive Rate): $$ Recall = \frac{TP}{TP + FN} $$
Ratio of true positives to total actual positives. Important when false negatives are costly.
F1 Score: $$ F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} $$
Harmonic mean of precision and recall. Particularly useful with uneven class distribution.
Confusion Matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
ROC Curve and AUC:
ROC (Receiver Operating Characteristic) curve plots true positive rate against false positive rate at various threshold settings. AUC (Area Under Curve) provides aggregate measure of performance across all classification thresholds.
k-Fold Cross-Validation:
- Split dataset into k equal-sized folds
- For each fold:
- Use fold as validation set
- Use remaining k-1 folds as training set
- Train and evaluate model
- Average results across all k folds
Provides robust estimate of model performance and reduces overfitting.
The bias-variance tradeoff describes the relationship between model complexity, prediction accuracy, and generalization to unseen data.
Overfitting:
- High variance, low bias
- Model too complex
- Fits training data too closely
- Poor generalization to unseen data
Underfitting:
- High bias, low variance
- Model too simple
- Fails to capture patterns in training data
- Low accuracy on both training and test data
Total Error: $$ Error = Bias^2 + Variance + Irreducible\ Error $$
L1 Regularization (Lasso): $$ Cost = Loss(Y, \hat{Y}) + \lambda \sum_{i} |w_i| $$
- Adds penalty equal to absolute value of coefficients
- Can lead to sparse models (some coefficients exactly zero)
- Performs automatic feature selection
L2 Regularization (Ridge): $$ Cost = Loss(Y, \hat{Y}) + \lambda \sum_{i} w_i^2 $$
- Adds penalty equal to square of coefficients
- Encourages smaller coefficients
- Does not eliminate features entirely
Elastic Net: $$ Cost = Loss(Y, \hat{Y}) + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2 $$
- Combines L1 and L2 penalties
- Useful when features are correlated
- Benefits of both feature selection and coefficient shrinkage
Other Prevention Techniques:
- Cross-validation for hyperparameter tuning
- Increasing training dataset size
- Feature selection or reduction
- Early stopping during training
- Dropout (for neural networks)
The expected test error of a model can be decomposed into three components: $$ \text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$
Bias:
Error from incorrect assumptions in the learning algorithm. High bias corresponds to underfitting.
- Simple models have high bias
- Model cannot capture true relationship
- Both training and test errors are high
Variance:
Error from sensitivity to small fluctuations in the training set. High variance corresponds to overfitting.
- Complex models have high variance
- Model learns noise in training data
- Low training error, high test error
Tradeoff:
- Increasing model complexity decreases bias but increases variance
- Optimal model balances both sources of error
- Sweet spot minimizes total expected test error
Underfitting:
The model is too simple to capture the underlying structure of the data.
Symptoms:
- High training error
- High test error
- Model has high bias
Solutions:
- Increase model complexity
- Add more features
- Reduce regularization
Overfitting:
The model learns the training data too well, including noise and spurious patterns.
Symptoms:
- Low training error
- High test error
- Model has high variance
Solutions:
- Get more training data
- Reduce model complexity
- Apply regularization
- Use cross-validation
- Early stopping
A modern observation that challenges the classical U-shaped bias-variance curve.
Classical View:
Test error decreases as model complexity increases, reaches minimum, then increases (U-shape).
Modern Observation:
For highly overparameterized models (parameters > data points):
- Underparameterized regime: Classical bias-variance tradeoff applies
- Interpolation threshold: Model can perfectly fit training data, test error peaks
- Overparameterized regime: Test error decreases again as model size increases further
Key Insights:
- Very large models can generalize well despite perfect training fit
- Observed in deep neural networks
- Challenges traditional statistical wisdom
- Implicit regularization from optimization plays a role
Vapnik-Chervonenkis (VC) dimension measures the capacity or expressive power of a hypothesis class.
Definition:
The VC dimension of a hypothesis class
Shattering:
A set of points is shattered if the hypothesis class can realize every possible labeling of those points.
Examples:
- Linear classifier in
$\mathbb{R}^2$ : VC dimension = 3 - Linear classifier in
$\mathbb{R}^d$ : VC dimension =$d+1$
Sample Complexity:
The number of training examples needed for good generalization is typically linear in the VC dimension: $$ m = O\left(\frac{d}{\epsilon}\log\frac{1}{\delta}\right) $$
where
Implications:
- Higher VC dimension requires more training data
- Provides theoretical bounds on generalization
- Connects model capacity to sample requirements
The phenomenon where the optimization algorithm itself introduces a bias toward solutions with better generalization.
Characteristics:
- No explicit regularization term in loss function
- Optimization dynamics (e.g., SGD) prefer certain solutions
- Mini-batch noise acts as implicit regularizer
- Early stopping provides implicit regularization
Examples:
- SGD tends toward flat minima in deep learning
- Gradient descent on linear models finds minimum norm solution
- Architecture choices (e.g., depth) provide inductive biases
Creating New Features:
- Use domain knowledge to engineer features
- Polynomial features
- Interaction terms
- Aggregations and transformations
Normalization and Scaling:
-
Min-Max scaling:
$x' = \frac{x - x_{min}}{x_{max} - x_{min}}$ -
Z-score normalization:
$x' = \frac{x - \mu}{\sigma}$ - Robust scaling: Uses median and IQR
Handling Missing Data:
- Mean/median/mode imputation
- Forward/backward fill
- Model-based imputation
- Creating missingness indicator feature
Dimensionality Reduction:
- Principal Component Analysis (PCA)
- t-SNE for visualization
- Feature selection techniques
Label Encoding:
- Converts categories to integers (0, 1, 2, ...)
- Suitable for ordinal data with inherent order
- Not recommended for nominal data
One-Hot Encoding:
- Creates binary column for each category
- Suitable for nominal data
- Increases dimensionality
Target Encoding:
- Encodes based on target variable statistics
- Can lead to overfitting if not done carefully
- Useful for high-cardinality features
Methods to Calculate Importance:
- Mean decrease in impurity (decision trees)
- Coefficient weights (linear models)
- SHAP (SHapley Additive exPlanations) values
- Permutation importance
- Feature elimination
Applications:
- Feature selection
- Model interpretability
- Understanding data relationships
A fundamental distinction exists in how classification algorithms approach the learning problem:
Discriminative Algorithms:
Learn the conditional probability
Examples: Logistic Regression, Support Vector Machines, Decision Trees
Characteristics:
- More robust to incorrect modeling assumptions
- Typically require less training data when assumptions don't hold
- Focus directly on the classification task
Generative Algorithms:
Model the class priors
Examples: Gaussian Discriminant Analysis, Naive Bayes
Characteristics:
- More data-efficient when modeling assumptions are correct
- Can generate new data samples
- Can handle missing features more naturally
Gaussian Discriminant Analysis (GDA):
Assumes
GDA and Logistic Regression often produce similar decision boundaries, but GDA is more data-efficient if its Gaussian assumption holds, while Logistic Regression is more robust when the assumption is violated.
Linear regression models the relationship between input variables and a continuous output variable using a linear function.
Model: $$ y = \mathbf{w}^T\mathbf{x} + b $$
Matrix Form: $$ \mathbf{y} = \mathbf{X}\mathbf{w} $$
Least Squares Solution: $$ \mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$
Gradient: $$ \frac{\partial J}{\partial \mathbf{w}} = 2\mathbf{X}^T(\mathbf{X}\mathbf{w} - \mathbf{y}) $$
Assumptions:
- Linearity between features and target
- Independence of residuals
- Homoscedasticity (constant variance of residuals)
- Normal distribution of residuals
Statistical View: $$ (Y|\mathbf{X}=\mathbf{x}) \sim N(\mathbf{x} \cdot \mathbf{w}, \sigma^2) $$
Optimization Methods:
Gradient Descent: Iterative algorithm that updates parameters in the direction of steepest decrease: $$ \mathbf{w} := \mathbf{w} - \alpha\frac{\partial J}{\partial \mathbf{w}} $$
- Batch Gradient Descent: Updates after scanning entire training set
- Stochastic Gradient Descent (SGD): Updates after each training example, more efficient for large datasets
Normal Equations: Closed-form analytical solution: $$ \mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$
Requires
Probabilistic Interpretation:
Under the assumption that errors
Locally Weighted Linear Regression (LWR) is a non-parametric regression method that fits a different linear model at each query point.
Objective:
For each query point
where
Characteristics:
- Weights determined by distance from query point
- Bandwidth parameter
$\tau$ controls locality - Non-parametric: must store entire training set
- Prediction computationally expensive (requires fitting at each query)
Logistic regression models the probability of a binary outcome.
Model: $$ P(y=1|\mathbf{x}) = \frac{1}{1+e^{-\mathbf{w}^T\mathbf{x}}} = \frac{e^{\mathbf{w}^T\mathbf{x}}}{1+e^{\mathbf{w}^T\mathbf{x}}} $$
Loss Function (Cross-Entropy): $$ Loss = -\sum_{i=1}^m [y_i\log(p_1) + (1-y_i)\log(p_0)] $$
KL Divergence Interpretation: $$ D_{KL}(p||q) = \sum_x p(x)\log\frac{p(x)}{q(x)} $$
Assumptions:
- Binary dependent variable
- Independent observations
- No multicollinearity
- Linearity between features and log-odds
Decision trees recursively partition the feature space based on feature values to make predictions.
Splitting Criteria:
Entropy (Information Gain): $$ H(D) = -\sum_{k} p_k\log p_k $$
Gini Impurity: $$ Gini(D) = 1 - \sum_{k}p_k^2 $$
Information Gain Ratio (C4.5): $$ Gain_ratio(D,a) = \frac{Gain(D,a)}{IV(a)} $$
where
Tree Algorithms:
- ID3: Uses entropy and information gain
- C4.5: Extension of ID3 with gain ratio, handles continuous attributes
- CART: Uses Gini impurity, supports regression
Pruning:
- Pre-pruning: Set maximum depth, minimum samples per split
- Post-pruning: Build full tree then remove branches
SVM finds the optimal hyperplane that maximally separates classes.
Objective: $$ \min_{\mathbf{w},b} \frac{1}{2}|\mathbf{w}|^2 $$
Subject to: $$ y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, \quad i=1,2,...,m $$
Lagrangian: $$ L(\mathbf{w}, b, \alpha) = \frac{1}{2}|\mathbf{w}|^2 + \sum_i\alpha_i(1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)) $$
Dual Form: $$ \max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j\mathbf{x}_i^T\mathbf{x}_j $$
Key Concepts:
- Hyperplane: Decision boundary separating classes
- Margin: Distance from hyperplane to nearest data points
- Support Vectors: Data points closest to decision boundary
- Kernel Trick: Map data to higher dimensions for non-linear separation
Common Kernels:
- Linear:
$K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T\mathbf{x}_j$ - Polynomial:
$K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T\mathbf{x}_j + c)^d$ - RBF (Gaussian):
$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma|\mathbf{x}_i - \mathbf{x}_j|^2)$
Soft Margin Classification:
To handle non-separable data and outliers, slack variables
The parameter
Kernel methods enable algorithms to operate in high-dimensional feature spaces without explicitly computing the transformation.
Feature Mapping:
A feature map
Kernel Trick:
Instead of explicitly computing
This allows algorithms to implicitly work in very high (even infinite) dimensional spaces efficiently.
Mercer's Theorem:
A function
Valid Kernels:
- Must be symmetric:
$K(\mathbf{x}, \mathbf{z}) = K(\mathbf{z}, \mathbf{x})$ - Kernel matrix must be positive semi-definite
- Sum and product of valid kernels are valid kernels
Applications:
- Support Vector Machines
- Kernel Ridge Regression
- Kernel PCA
- Gaussian Processes
k-NN classifies data points based on the majority class of their k nearest neighbors.
Algorithm:
- Choose number of neighbors k
- Calculate distance to all training points
- Select k nearest neighbors
- For classification: majority vote
- For regression: average of neighbor values
Distance Metrics:
- Euclidean:
$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i(x_i - y_i)^2}$ - Manhattan:
$d(\mathbf{x}, \mathbf{y}) = \sum_i|x_i - y_i|$ - Minkowski:
$d(\mathbf{x}, \mathbf{y}) = (\sum_i|x_i - y_i|^p)^{1/p}$
Characteristics:
- Non-parametric (no training phase)
- Lazy learning (computation at prediction time)
- Sensitive to feature scaling
- Computationally expensive for large datasets
Naive Bayes applies Bayes' theorem with the "naive" assumption of feature independence.
Bayes' Theorem: $$ P(C_k|\mathbf{x}) = \frac{P(\mathbf{x}|C_k) \cdot P(C_k)}{P(\mathbf{x})} $$
Naive Independence Assumption: $$ P(C_k|x_1, x_2, ..., x_n) \propto P(C_k) \cdot \prod_{i=1}^{n}P(x_i|C_k) $$
Variants:
- Gaussian NB: Assumes features follow normal distribution
- Multinomial NB: For discrete counts (e.g., word counts)
- Bernoulli NB: For binary features
Advantages:
- Fast training and prediction
- Works well with high-dimensional data
- Requires small training data
- Handles missing values well
K-means partitions data into K clusters by minimizing within-cluster variance.
Algorithm:
- Initialize K centroids randomly
- Assignment: Assign each point to nearest centroid
- Update: Recalculate centroids as mean of assigned points
- Repeat steps 2-3 until convergence
Objective Function: $$ J = \sum_{i=1}^K\sum_{\mathbf{x}\in C_i}|\mathbf{x} - \mu_i|^2 $$
Characteristics:
- Simple and fast
- Requires specifying K in advance
- Sensitive to initialization (use k-means++)
- Assumes spherical clusters
- Sensitive to outliers
Builds hierarchy of clusters without requiring K to be specified in advance.
Agglomerative (Bottom-up):
- Start with each point as its own cluster
- Merge closest pair of clusters
- Update distance matrix
- Repeat until single cluster remains
Linkage Methods:
- Single linkage: Minimum distance between clusters
- Complete linkage: Maximum distance between clusters
- Average linkage: Average distance between clusters
- Ward's method: Minimizes within-cluster variance
Advantages:
- No need to specify K
- Produces dendrogram for visualization
- Can capture non-spherical clusters
PCA transforms data to new coordinate system where variance is maximized along new axes.
Covariance Matrix: $$ \Sigma = \frac{1}{n}\sum_{i=1}^n(\mathbf{x}_i - \mu)(\mathbf{x}_i - \mu)^T $$
Eigendecomposition: $$ \Sigma = \sum_{i=1}^d\lambda_i\mathbf{v}_i\mathbf{v}_i^T $$
where
Dimensionality Reduction:
Project data onto top k eigenvectors: $$ \mathbf{z} = \mathbf{V}_k^T(\mathbf{x} - \mu) $$
Variance Explained: $$ \frac{\sum_{i=1}^k\lambda_i}{\sum_{i=1}^d\lambda_i} $$
Applications:
- Dimensionality reduction
- Data visualization
- Noise reduction
- Feature extraction
Any matrix can be decomposed as: $$ \mathbf{A} = \mathbf{U}\Sigma\mathbf{V}^T $$
where:
-
$\mathbf{U}$ : left singular vectors -
$\Sigma$ : diagonal matrix of singular values -
$\mathbf{V}$ : right singular vectors
Matrix Approximation: $$ |\mathbf{A} - \mathbf{A}_k|F^2 = \sum{i=k+1}^{rank(\mathbf{A})}\sigma_i^2 $$
Relationship to PCA: $$ \mathbf{A}^T\mathbf{A} = \mathbf{V}\Sigma^2\mathbf{V}^T $$
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique optimized for visualization.
Key Features:
- Converts similarities to joint probabilities
- Uses Gaussian distribution in high-dimensional space
- Uses t-distribution in low-dimensional space
- Minimizes KL divergence between distributions
- Preserves local structure
Applications:
- Visualizing high-dimensional data
- Exploring clusters in data
- Quality assessment of embeddings
Limitations:
- Computationally expensive
- Non-deterministic (different runs give different results)
- Mainly for visualization, not for dimensionality reduction
The Expectation-Maximization (EM) algorithm is used for finding maximum likelihood estimates of parameters in models with latent (unobserved) variables.
Algorithm:
The EM algorithm alternates between two steps:
-
E-step (Expectation): Compute the posterior probability of the latent variables given the data and current parameters $$ Q_i(z^{(i)}) = P(z^{(i)}|\mathbf{x}^{(i)};\theta) $$
-
M-step (Maximization): Update model parameters to maximize the expected log-likelihood $$ \theta := \arg\max_\theta \sum_i\sum_{z^{(i)}}Q_i(z^{(i)})\log\frac{P(\mathbf{x}^{(i)},z^{(i)};\theta)}{Q_i(z^{(i)})} $$
Key Properties:
- Iteratively maximizes a lower bound (ELBO) on the data's log-likelihood
- Guaranteed to converge (monotonically increases likelihood)
- May converge to local maximum
- Sensitive to initialization
Applications:
- Gaussian Mixture Models (GMM)
- Hidden Markov Models
- Missing data imputation
- Clustering with probabilistic assignments
Mixture of Gaussians Example:
In GMM, the E-step computes "soft" cluster assignments, and the M-step updates cluster means, covariances, and mixing coefficients.
Independent Component Analysis (ICA) aims to separate a multivariate signal into additive, independent, non-Gaussian subcomponents.
Problem Formulation:
Given observed mixed signals
-
$\mathbf{s}$ : vector of independent source signals -
$\mathbf{A}$ : unknown mixing matrix -
$\mathbf{x}$ : observed mixed signals
Goal: Find unmixing matrix
Key Assumptions:
- Source signals are statistically independent
- Source signals are non-Gaussian
- Mixing matrix is square and invertible
Classic Application - Cocktail Party Problem:
Separate individual speakers' voices from multiple microphone recordings where each microphone records a mixture of all speakers.
Contrast with PCA:
- PCA finds orthogonal components maximizing variance
- ICA finds independent components (stronger condition)
- PCA only requires second-order statistics
- ICA requires higher-order statistics (non-Gaussianity)
The modern paradigm that has revolutionized AI, particularly in Natural Language Processing and Computer Vision.
Core Idea:
A large, general-purpose foundation model is first pre-trained on massive amounts of broad, unlabeled data using a self-supervised objective. This model is then adapted to a wide range of specific downstream tasks, often with very little labeled data.
Pre-training Methods:
Contrastive Learning (Vision):
- Learn similar representations for different augmented views of the same image (positive pairs)
- Learn dissimilar representations for views from different images (negative pairs)
- Examples: SimCLR, MoCo, BYOL
Language Modeling (NLP):
- Predict next word in a sequence:
$P(x_t|x_1,...,x_{t-1})$ - Learns grammar, semantics, and world knowledge from text
- Examples: GPT, BERT (masked language modeling)
Adaptation Methods:
Linear Probe:
- Freeze pre-trained model
- Train new linear classifier on top of feature representations
- Tests quality of learned representations
Fine-tuning:
- Use pre-trained weights as initialization
- Train all or part of model on downstream task's labeled data
- Most common adaptation method
Zero-Shot and In-Context Learning (Language Models):
- No weight updates required
- Task described in natural language as a prompt
- Few examples included in prompt for in-context learning
- Model performs task on new queries
Advantages:
- Dramatically reduces labeled data requirements
- Transfers knowledge across tasks
- State-of-the-art performance on many benchmarks
- Enables few-shot and zero-shot learning
Examples of Foundation Models:
- Vision: CLIP, DINO, MAE
- Language: GPT-3/4, BERT, T5
- Multimodal: CLIP, Flamingo
Algorithm:
- Create M bootstrap samples from training data
- Train model on each bootstrap sample
- Aggregate predictions (voting for classification, averaging for regression)
Average Prediction: $$ f_{avg}(\mathbf{x}) = \frac{1}{M}\sum_{t=1}^M f_t(\mathbf{x}) $$
Variance Reduction: $$ E[(f_{avg}(\mathbf{X}) - Y)^2] = \frac{1}{M}\sum_{t=1}^M E[(f_t(\mathbf{X}) - Y)^2] - \frac{1}{2M^2}\sum_{s=1}^M\sum_{t=1}^M E[(f_s(\mathbf{X}) - f_t(\mathbf{X}))^2] $$
Examples:
- Random Forest
- Bagged Decision Trees
Boosting sequentially trains weak learners, each focusing on examples misclassified by previous learners.
AdaBoost Algorithm:
For each iteration t = 1, 2, ..., T:
- Train weak learner
$h_t$ on weighted data$D_t$ - Calculate error:
$\epsilon_t = \sum_{i=1}^n D_t(i) \cdot \mathbb{1}{h_t(\mathbf{x}_i) \neq y_i}$ - Calculate weight:
$\alpha_t = \frac{1}{2}\log\frac{1-\epsilon_t}{\epsilon_t}$ - Update distribution:
$D_{t+1}(i) = \frac{D_t(i)}{Z_t} \exp(-\alpha_t y_i h_t(\mathbf{x}_i))$
Final Classifier: $$ f(\mathbf{x}) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(\mathbf{x})\right) $$
Examples:
- AdaBoost
- Gradient Boosting
- XGBoost
Reinforcement Learning (RL) is a framework for training an agent to make a sequence of decisions in an environment to maximize cumulative reward.
An MDP formalizes the RL problem as a tuple
Components:
-
$S$ : Set of states -
$A$ : Set of actions -
$P_{sa}$ : State transition probabilities$P(s'|s,a)$ -
$\gamma$ : Discount factor$(0 \leq \gamma < 1)$ -
$R$ : Reward function$R(s)$ or$R(s,a)$
Goal:
Learn a policy
Value Function:
The value of a state under policy
Bellman Equation: $$ V^\pi(s) = R(s) + \gamma\sum_{s'}P_{s\pi(s)}(s')V^\pi(s') $$
Optimal Value Function: $$ V^*(s) = \max_\pi V^\pi(s) $$
Optimal Policy: $$ \pi^(s) = \arg\max_a \sum_{s'}P_{sa}(s')V^(s') $$
These algorithms assume knowledge of transition probabilities
Value Iteration:
Iteratively computes the optimal value function by applying the Bellman backup operator: $$ V(s) := R(s) + \max_a \gamma\sum_{s'}P_{sa}(s')V(s') $$
Algorithm:
- Initialize
$V(s) = 0$ for all states - Repeat until convergence:
- For each state, update
$V(s)$ using Bellman equation
- For each state, update
- Extract policy:
$\pi(s) = \arg\max_a \sum_{s'}P_{sa}(s')V(s')$
Policy Iteration:
Alternates between policy evaluation and policy improvement:
-
Policy Evaluation: Compute
$V^\pi$ for current policy by solving: $$ V^\pi(s) = R(s) + \gamma\sum_{s'}P_{s\pi(s)}(s')V^\pi(s') $$ -
Policy Improvement: Update policy to be greedy with respect to
$V^\pi$ : $$ \pi(s) := \arg\max_a \sum_{s'}P_{sa}(s')V^\pi(s') $$
For problems with continuous states (e.g., robotics), discretization suffers from the curse of dimensionality.
Value Function Approximation:
Approximate
where
Fitted Value Iteration:
- Sample states from environment
- Compute target values using Bellman equation
- Use supervised learning to fit
$V_\theta$ to target values - Repeat until convergence
These algorithms learn directly from interaction without knowing
Policy Gradient Methods (REINFORCE):
Directly optimize the policy
where
Variance Reduction with Baselines:
Subtract a baseline
Q-Learning:
Learn action-value function
Advantages:
- Model-free: doesn't require knowing dynamics
- Can learn from experience
- Handles stochastic environments
Challenges:
- Sample efficiency
- Credit assignment
- Exploration vs exploitation
LQR (Linear Quadratic Regulation):
For finite-horizon MDPs with linear dynamics and quadratic costs: $$ s_{t+1} = A_ts_t + B_ta_t $$ $$ \text{cost} = s_T^TU_Ts_T + \sum_{t=0}^{T-1}(s_t^TU_ts_t + a_t^TV_ta_t) $$
Optimal controller is linear:
DDP (Differential Dynamic Programming):
For non-linear systems:
- Linearize dynamics around nominal trajectory
- Apply LQR to linearized system
- Update nominal trajectory
- Iterate until convergence
LQG (Linear Quadratic Gaussian):
Extends LQR to partially observable systems:
- Use Kalman Filter to maintain belief over states
- Apply LQR controller to mean of belief
- Separation principle: estimation and control can be solved independently
Course Materials:
- CS229 Machine Learning - Stanford University
- COMS W4721 Machine Learning for Data Science - Columbia University
Foundational Resources:
- Christopher Bishop, Pattern Recognition and Machine Learning
- Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning