Classification

Fri 27 March 2026

by Daniel Melichar

in Crash Course

tagged statistics, classification, machine learning

Part of the Statistics for Machine Learning series.

Linear Classification

Linear Classification

Fundamentals

The response takes distinct values $y \in \mathbb{N}^K$ with $K \geq 1$. Let our predictor $G(x)$ be a function taking in an observation $\boldsymbol{x}^T = (1, x_1, x_2, \dots, x_p)$, i.e. a row $\boldsymbol{x} \in X$ from the design matrix $X \in \mathbb{R}^{n \times (p+1)}$ as follows

$$y_k = G(\boldsymbol{x}) = \begin{cases} 1 & \text{if } \arg\max_{k} f_1(\boldsymbol{x}) = \boldsymbol{x}^T\boldsymbol{\beta}1 \text{ otherwise } 0 \ 1 & \text{if } \arg\max} f_2(\boldsymbol{x}) = \boldsymbol{x}^T\boldsymbol{\beta2 \text{ otherwise } 0 \ & \cdots \ 1 & \text{if } \arg\max 0 \end{cases}$$} f_k(\boldsymbol{x}) = \boldsymbol{x}^T\boldsymbol{\beta}_k \text{ otherwise

for all $k \leq K$.

A hyperplane is a generalization of lines and planes. It is given by $\mathbf{w}\cdot\mathbf{x} + b = 0$. We can also define a hyperplane, known as decision boundary $D$, to separate two categories $i, j \in {1, 2, \dots, k}$ by the set of points where $f_i(\boldsymbol{x}) = f_j(\boldsymbol{x})$ and thus

$$\begin{split} D &= { \boldsymbol{x} \mid f_j(\boldsymbol{x}) - f_i(\boldsymbol{x}) = 0 } \ &= { \boldsymbol{x} \mid \boldsymbol{x}^T\beta_j - \boldsymbol{x}^T\beta_i = 0 } \ &= { \boldsymbol{x} \mid \boldsymbol{x}^T(\beta_j - \beta_i) = 0 } \end{split}$$

A more advanced and general formulation is here.

Indicator Matrix

Assume that the response variable can take a distinct number of values $y \in \mathbb{N}^K$ where $K > 1$ represents the number of categories. We define the indicator matrix $Y \in \mathbb{R}^{n \times K}$ as follows: for any observation's response $y_i$ with $i \leq n$, if it belongs to category $1 \leq k \leq K$ then mark it

$$y_i = \begin{cases} 1 & \text{if } i = k \ 0 & \text{otherwise} \end{cases}$$

So, we also need coefficients for each category

$$\hat{\beta}_k = (X^TX)^{-1}X^Ty_k \quad \text{with } k = 1, 2, \dots, K$$

which we collect in a matrix $\hat{B} \in \mathbb{R}^{(p+1) \times K}$ like so

$$\begin{split} \hat{B} &= (\hat{\beta}_1, \hat{\beta}_2, \dots, \hat{\beta}_K) \ &= (X^TX)^{-1}X^TY \end{split}$$

and we estimate using $\hat{Y} = X\hat{B}$ where the category membership is given by the highest value.

Generative

The name Discriminant Analysis is just historic, see here.
The methods are like PCA for classification.
Note that there are no coefficients, hence no inference and just classification.

Linear Discriminant Analysis

Assumptions

We assume that each response $y_k$ is normally distributed.
We wish to determine the category $k \in {1, 2, \dots, K}$ for every observation.
So, if we collect all $Y = {y_1, y_2, \dots, y_k}$ responses we have $K$ normal distributions which together form a multivariate normal distribution $Y \sim \mathcal{N}(\mu, \Omega)$ with $\mu$ the vector containing the means of each category, and the homoscedastic $\Omega$ the $K \times K$ covariance matrix (variance does not increase in each $k$).
We also assume multicollinearity and independence.

Let $\pi_k$ denote the prior probability of an observation belonging to category $k$. Let $\varphi_k(x)$ be the density function of the multivariate normal. We have that the conditional probability of category membership given an observation $\boldsymbol{x}$ which is a row of the design matrix:

$$P(G = k \mid x) = \frac{\varphi_k(x)\pi_k}{\sum_{i=1}^K\varphi_i(x)\pi_k}$$

The decision boundary $D$ for two categories $l, k \in {1, 2, \dots, K}$ is given by $0 = P(G = k \mid x) - P(G = l \mid X)$.

Taking the log, noting $\log(1) = 0$ and expanding we get

$$\begin{split} \log(1) &= \log\frac{P(G = k \mid x)}{P(G = l \mid x)} \ &= \log\frac{\varphi_k(x)\pi_k}{\varphi_l(x)\pi_l} \ &= \log\varphi_k(x) + \log\pi_k - \log\varphi_l(x) - \log\pi_k \end{split}$$

Applying the density function of the multivariate normal to the above and simplifying gives the linear discriminant function

$$\delta_k(x) = (x^T\Omega^{-1}\mu_k) - \frac{1}{2}(\mu_k^T\Omega^{-1}\mu_k) + \log\pi_k$$

Then we classify according to $G(X) = \arg\max_k\delta_k(x)$.

Quadratic Discriminant Analysis

We drop the assumption of homoscedastic covariance matrix $\Omega$, i.e. the variances of each category is not static and each category covariance matrix is distinct. Then we can derive the new discriminant function

$$\delta_k(x) = -\frac{1}{2}\log|\Omega_k| - \frac{1}{2}\left(\Omega_k^{-1}|x - \mu_k|_2^2\right) + \log\pi_k$$

We can regularize the covariance matrix by introducing a control parameter $\alpha \in [0,1]$ and $\hat{\Omega}$ an average over the category covariance matrices $\hat{\Omega}_k$

$$\hat{\Omega}_k(\alpha) = \alpha\hat{\Omega}_k + (1-\alpha)\hat{\Omega}$$

Discriminative

These methods allow for inference.
They are more like traditional LS.
The difference between models is the procedure for determining (training) the optimal weights/coefficients.

Logit and Logistic Function

A standard logistic function is a S-shaped function (a.k.a. sigmoid function) with the form

$$f(x) = \frac{1}{1+\exp{-x}} = \frac{\exp{x}}{1+\exp{x}}$$

The inverse function $f^{-1}$ is called the logit function.

$$\textit{logit}(x) = f^{-1}(x) = \ln\frac{x}{1-x}$$

Proof. Let $y = \textit{logit}(x) = \ln\frac{x}{1-x}$. Then

$$\begin{split} \exp{y} &= \frac{x}{1-x} \ 1 + \exp{y} &= 1 + \frac{x}{1-x} \ 1 + \exp{y} &= \frac{1-x}{1-x} + \frac{x}{1-x} \ 1 + \exp{y} &= \frac{1}{1-x} \ 1 - x &= \frac{1}{1+\exp{y}} \ x &= 1 - \frac{1}{1+\exp{y}} \ x &= \frac{\exp{y}}{1+\exp{y}} \end{split}$$

Simple Logistic Regression

We will model this using a GLM of the form

$$\mathbb{E}(Y \mid X) = g^{-1}(\boldsymbol{x}^T\boldsymbol{\beta}) \Leftrightarrow g(\mathbb{E}(Y \mid X)) = \boldsymbol{x}^T\boldsymbol{\beta}$$

where the response takes up two $K = 2$ categories $y \in {0, 1}$.

We also assume $\boldsymbol{x}^T = (1, x_1, x_2, \dots, x_p)$ and $\boldsymbol{\beta} \in \mathbb{R}^{p+1}$. The link function $g$ will be the logit function, and so we get just as in the proof above

$$\begin{split} \mathbb{E}(Y \mid X) &= \frac{\exp{\boldsymbol{x}^T\boldsymbol{\beta}}}{1+\exp{\boldsymbol{x}^T\boldsymbol{\beta}}} \ &\Leftrightarrow \ \ln\left(\frac{\mathbb{E}(Y \mid X)}{1-\mathbb{E}(Y \mid X)}\right) &= \boldsymbol{x}^T\boldsymbol{\beta} \end{split}$$

For estimation, we start with the analogous (i.e. binary choice) pmf of the Bernoulli.

$$f(k; p) = kp + (1-k)(1-p)$$

Let $p_k(x) = \mathbb{E}(Y = k \mid X)$ with $k \in {0, 1}$. Our error function is to maximise the positive log-likelihood over all $n$ samples.

$$\begin{split} l(\boldsymbol{\beta}) &= \arg\min_\beta\left{\ln f(k; p)\right} \ &= \arg\min_\beta\left{\sum_{i=1}^n y_i\ln p_k(x_i) + \left(1 - \sum_{i=1}^n y_i\right)\ln\left(1 - p_k(x_i)\right)\right} \ &= \arg\min_\beta\sum_{i=1}^n\left{y_i\boldsymbol{x}i^T\boldsymbol{\beta} + (1-y_i)\ln\left(1 - \frac{\exp{\boldsymbol{x}_i^T\boldsymbol{\beta}}}{1+\exp{\boldsymbol{x}_i^T\boldsymbol{\beta}}}\right)\right} \ &= \arg\min\beta\sum_{i=1}^n\left{y_i\boldsymbol{x}_i^T\boldsymbol{\beta} - \ln\left(1 + \exp\left{\boldsymbol{x}_i^T\boldsymbol{\beta}\right}\right)\right} \end{split}$$

To estimate the parameters $\beta$ we differentiate, set to zero and so on. There is no closed-form solution, we approximate with Iterative Reweighted Least Squares. For matrix notation, we introduce the following

$\mathbf{X} \in \mathbb{R}^{n \times (p+1)}$ the design matrix
$\mathbf{y}$ the response vector
vector of (current) predictions $\mathbf{p}^T = (p_k(\mathbf{x_1}), p_k(\mathbf{x_2}), \cdots, p_k(\mathbf{x_n}))$
The diagonal weight matrix $W \in \mathbb{R}^{n \times n}$ where $$W = \begin{pmatrix} p_k(x_1)(1-p_k(x_1)) & 0 & 0 & 0 \ 0 & p_k(x_2)(1-p_k(x_2)) & 0 & 0 \ 0 & 0 & \cdots & 0 \ 0 & 0 & 0 & p_k(x_n)(1-p_k(x_n)) \end{pmatrix}$$

We get the following expressions

$$\begin{split} \frac{\partial l(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} &= \sum_{i=1}^n (y_i - p_k(\boldsymbol{x_i}))\boldsymbol{x_i}^T = 0 \ &= \boldsymbol{X}^T(\mathbf{y} - \mathbf{p}) = 0 \ \frac{\partial^2 l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^T} &= \boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X} \end{split}$$

We then update iteratively for some number $i \in \mathbb{N}$ with the following rule

$$\begin{split} \boldsymbol{\beta}_{i+1} &= \boldsymbol{\beta}_i - \left(\frac{\partial^2 l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^T}\right)^{-1} \cdot \frac{\partial l(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}} \ &= \boldsymbol{\beta}_i + (\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X})^{-1}\boldsymbol{X}^T(\mathbf{y} - \mathbf{p}) \end{split}$$

Tree-Based Methods

Given $y \in \mathbb{R}^n$ and $X \in \mathbb{R}^{n \times p}$ in the form $(y_i, x_{i1}, x_{i2}, \dots, x_{ip})_{i=1}^n$ we partition the range of $X$ into $M > 1$ regions $R_1, R_2, \dots, R_m$ and define each region to have its own response $c_m$.

$$y = f(X) = \sum_{m=1}^{M} c_m I(x, m)$$

with the indicator function $I(x, m) = 1$ if $x \in R_m$ and $0$ otherwise. An estimate is given by the average over all values in the respective region.

$$\hat{c}m = \frac{1}{|R_m|}\sum y_i$$

The core computation is finding the best input variable and split point in each region. Consider a variable $x_j \in X$ and w.l.o.g. pick two regions $R_1, R_2$. Our goal is to find a split point $s \in \mathbb{R}$ such that

$$R_1(j,s) := {X \mid x_j \leq s} \land {X \mid x_j \geq s} =: R_2(j,s)$$

which we can think of as a node with decision rule in a binary tree $T = (V, E)$ and using the least squares goal we can find the parameters $(j, s)$ of each node by

$$(j, s) = \arg\min_{j,s}\left(\min_{c_1}\sum_{y_i \in R_1(j,s)}(y_i - c_1)^2 + \min_{c_2}\sum_{y_i \in R_2(j,s)}(y_i - c_2)^2\right)$$

In the case of classification on $k = 1, \dots, K$ categories where $y \in \mathbb{N}^K$ we can define the number of categories in each region as an average and then pick the region with the highest probability

$$\begin{split} & P(Y = k \mid R_m) = \frac{1}{|R_m|}\sum_{y_i \in R_m} I(y_i = k) \ & \implies \ & P(Y = k \mid X) = \arg\max_{R_m} P(Y = k \mid R_m) \end{split}$$

If the number of iterations on the parameter search is too large, we might run into overfitting. So, a control parameter is introduced that regulates the number of parameter searches and thus on the number of nodes in the tree. There are different quality or cost measures for the tree

$$Q_m(T) = \begin{cases} \displaystyle\frac{1}{|R_m|}\sum_{y_i \in R_m}(y_i - \hat{c}m)^2 & \text{Mean sum of squares} \ \displaystyle\frac{1}{|R_m|}\sum \ \displaystyle\sum_{k=1}^K P(Y = i \mid R_m)(1 - P(Y = k \mid R_m)) & \text{Gini Index} \ \displaystyle\sum_{k=1}^K P(Y = i \mid R_m)\log P(Y = i \mid R_m) & \text{Cross entropy} \end{cases}$$} I(y_i \neq k) & \text{Misclassification error

and regulate the size of the tree with the parameter $\alpha \geq 0$ where the larger $\alpha$ is the smaller the tree will be

$$c_\alpha(T) = \sum_{m=1}^{|T|} \alpha|T| + Q_m(T)|R_m|$$

A random forest is a set of decision trees $F = {T_1, T_2, \cdots, T_q}$ where each $T_i$ is trained using bootstrap samples for train/test on a random subset of the input variables $x' \subset X$. Variable importance is measured by calculating $Q_m(T_i)$ on the train set, and, after shuffling the input variables $x_i$ calculating the quality on the test set. For classification, we take the majority over all trees and for regression we take the average.

Support Vector Machines

Assume a dataset of $n$ points $(\mathbf{x_i}, \mathbf{y_i})_{i=1}^n$ with $\mathbf{y_i} \in {0, 1}$ and $\mathbf{x_i} \in \mathbb{R}^p$. A hyperplane is a generalization of lines and planes. It is the set of points $\mathbf{x}$ where

$$\mathbf{w}^T\mathbf{x} - b = 0$$

with the vector $\mathbf{w}$ being orthogonal to the plane. Multiplying with the norm $1/|\mathbf{w}|$ we get

$$\hat{\mathbf{w}}\cdot\mathbf{x} + b' = 0$$

with the orthonormal unit vector $\hat{\mathbf{w}} = \mathbf{w}/|\mathbf{w}|$ and the distance vector from the plane to the origin $b' = b/|\mathbf{w}|$. We will use the following model consisting of two planes

$$y = \begin{cases} 0 & \text{if } f(\mathbf{x}) = \mathbf{w}\cdot\mathbf{x} + b \geq 1 \ 1 & \text{if } f(\mathbf{x}) = \mathbf{w}\cdot\mathbf{x} + b \leq -1 \end{cases}$$

If we subtract both of these equations from another and multiply by the norm, we get the distance between the planes

$$\begin{split} \mathbf{w}\cdot\mathbf{x} &= 2 \ \frac{\mathbf{w}}{|\mathbf{w}|}\cdot\mathbf{x} &= \frac{2}{|\mathbf{w}|} \end{split}$$

We can combine the equations with for the following optimization goal which, in essence, is to have the line maximize the distance between the categories. This is equivalent to a minimal distance (measured by the length of $\mathbf{w}$) between the planes.

$$\begin{split} & \arg\min_{w}|\mathbf{w}|_2^2 \ \text{s.t. } & \mathbf{y_i}(\mathbf{w}\cdot\mathbf{x} + b) \geq 1 \quad \text{for } i = 1, 2, \dots, n \end{split}$$

If the data is not linearly separable we apply the hinge loss function $h_i = \left(0, 1 - \mathbf{y_i}(\mathbf{w}\cdot\mathbf{x} + b)\right)$ which is zero if the prediction matches the response

$$\begin{split} & \arg\min_{w}|\mathbf{w}|2^2 + C\sum^n \max h_i \ \text{s.t. } & \mathbf{y_i}(\mathbf{w}\cdot\mathbf{x} + b) \geq 1 - h_i \quad \text{for } i = 1, 2, \dots, n \end{split}$$

where $C$ is a control parameter. To move beyond linearly we apply the kernel method. The input variables are transformed by a kernel function

$$\kappa(x_i, x_j) = \varphi(x_i)\cdot\varphi(x_j)$$