Regression
Part of the Statistics for Machine Learning series.
Regression
:::Definition (Multiple Linear Regression).
The parametric regression model is of the form $\mathbf{y} = f(\mathbf{X}) + \epsilon$. In the linear regression model the response is a linear combination of the inputs:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon} \quad \Leftrightarrow \quad y_i = \boldsymbol{x}_i^T\boldsymbol{\beta} + \epsilon_i \quad \text{for } i = 1, 2, \dots, n$$
with Gaussian noise $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.
We have the following terminology
- The response vector $y \in \mathbb{R}^n$ is $\mathbf{y} = (y_1, y_2, \dots, y_n)^T$
- The design matrix $X \in \mathbb{R}^{n \times p+1}$ containing the inputs or functions thereof and $1$s for the intercept $$\mathbf{X} = \begin{bmatrix}\mathbf{x}1^T \ \mathbf{x}_2^T \ \vdots \ \mathbf{x}_n^T\end{bmatrix} = \begin{bmatrix}1 & x$$} & \cdots & x_{1p} \ 1 & x_{21} & \cdots & x_{2p} \ \vdots & \vdots & \ddots & \vdots \ 1 & x_{n1} & \cdots & x_{np}\end{bmatrix
- The coefficient or parameter vector $\boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, \dots, \beta_p)^T$ of dimension $(p+1)$ with $\beta_0$ being the intercept.
- The residuals or error terms $\boldsymbol{\varepsilon} = (\varepsilon_1, \varepsilon_2, \dots, \varepsilon_n)^T$ of size $n$.
The assumptions of the least squares model are
- Weak exogeneity. Predictor variables $x_i$ are fixed values, rather than random variables. Meaning that the input variables are without measurement errors.
- Linearity. The mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.
- Constant variance (or homoscedasticity). $\mathbb{E}(\epsilon_i\mid X) = \sigma^2$. The variance of the errors does not depend on the values of the predictor variables.
- Independence of errors. This assumes that the errors of the response variables are uncorrelated with each other.
- No linear dependence or lack of perfect multicollinearity in the predictors. For standard least squares estimation methods, the design matrix $X \in \mathbb{R}^{n \times p+1}$ must have full column rank $\text{rank}(X) = p+1$; otherwise the Gram Matrix $X^TX$ could not be inverted.
- Normality of residuals sometimes additionally assumed: $\varepsilon \mid X \sim \mathcal{N}(0, \sigma^2 I_n)$. When the errors are normal, the OLS estimator is equivalent to the maximum likelihood estimator.
- Nonsingular $X$. Even if $X \in \mathbb{R}^{m \times n}$ is non-square $m \neq n$, the multiplication with the transpose will make them so $X^TX \in \mathbb{R}^{n \times n}$. For further use $X^TX$ needs to be invertible.
There are multiple derivations to estimate the parameters. I think of the first one as pure matrix analysis and the second two taking a probabilistic perspective.
:::
:::Definition (Parameter estimate).
Assuming a multiple linear regression model with $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2)$ we wish to find the coefficient vector $\boldsymbol{\beta}$ with minimal error $\boldsymbol{\varepsilon} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$. The sum of squared errors is defined as
$$\begin{split} \boldsymbol{\hat{\beta}} &= \arg\min_{\boldsymbol{\beta}}|\epsilon|^2 \ &= \arg\min_{\boldsymbol{\beta}}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 \ &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \ &= \mathbf{y}^T\mathbf{y} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} \ &= \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} \end{split}$$
and after taking the derivative $\frac{\partial\boldsymbol{\hat{\beta}}}{\partial \boldsymbol{\beta}}$ of the last line with respect to $\boldsymbol{\beta}$ and setting the gradient to zero we get
$$\begin{split} 0 &= -2\boldsymbol{X}^T\mathbf{y} + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} \ \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} &= \boldsymbol{X}^T\mathbf{y} \ \boldsymbol{\beta} &= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\mathbf{y} \end{split}$$
and thus we have that $\boldsymbol{\hat{\beta}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\mathbf{y}$. To prove that the $\boldsymbol{\hat{\beta}}$ obtained is indeed the local minimum, one needs to differentiate once more to obtain the Hessian matrix and show that it is positive definite.
:::
:::Definition (Maximum Likelihood Estimate).
If we have a linear regression model of $n$ independent, identically distributed instances with Gaussian noise
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \qquad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$
we can denote this through a conditional probability
$$p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) = \prod_{i=1}^n \mathcal{N}(y_n \mid \mathbf{x}_n^T\boldsymbol{\beta}, \sigma^2)$$
The negative log-likelihood is then
$$\begin{split} \mathcal{L}(\boldsymbol\beta) &= -\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \ &= -\log \prod_{i=1}^n \mathcal{N}(y_n \mid \mathbf{x}n^T\boldsymbol{\beta}, \sigma^2) \ &= -\sum}^n \log \mathcal{N}(y_n \mid \mathbf{xn^T\boldsymbol{\beta}, \sigma^2) \ &= -\sum}^n \log \frac{1}{\sqrt{2\pi\sigma^2}} - \sum_{i=1}^n \log \exp\left(-\frac{1}{2}\left(\frac{y_n - \mathbf{xn^T\boldsymbol{\beta}}{\sigma^2}\right)^2\right) \ &= -\sum\right)^2 \end{split}$$}^n \log \frac{1}{\sqrt{2\pi\sigma^2}} + \frac{1}{2\sigma^2}\sum_{i=1}^n \left(y_n - \mathbf{x}_n^T\boldsymbol{\beta
Now if we assume that $\sigma^2$ is given, then the left side of the addition is a constant and so we can see that minimizing $\mathcal{L}(\boldsymbol\beta)$ corresponds to solving the least squares problem.
:::
:::Definition (Maximum A Posteriori Estimate).
We will assume that the parameters follow a normal distribution $\boldsymbol{\beta} \sim \mathcal{N}(0, b^2 I)$ where $b$ is just an estimate of the variance of the parameters. To obtain the distribution of the parameters given the data and model we follow the maximum log a posteriori estimate
$$\log p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \log p(\mathbf{y} \mid \boldsymbol{\beta}, \mathbf{X}) \log p(\boldsymbol{\beta})$$
The gradient of the negative log-posterior with respect to $\boldsymbol{\beta}$ is
$$-\frac{d \log p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X})}{d \boldsymbol{\beta}} = \frac{1}{\sigma^2}(\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X} - \mathbf{y}^T\mathbf{X}) + \frac{1}{b^2}\boldsymbol{\beta}$$
If we set this to zero we obtain after some operations that
$$\boldsymbol{\beta}_{MAP} = (\mathbf{X}^T\mathbf{X} + \frac{\sigma^2}{b^2}\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$
Thus we can see that the MAP estimate corresponds to a regularized least squares solution.
:::
:::Definition (Transformation matrices for ordinary least squares).
Assume a linear regression model and least squares estimation. Then the estimated parameters, responses, and residuals are given by
$$\begin{split} \hat{\beta} &= (X^T X)^{-1} X^T y \ \hat{y} &= X(X^T X)^{-1} X^T y = Py \ \hat{\epsilon} &= (I_n - P)y \end{split}$$
where $P = X(X^T X)^{-1} X^T$ is called the projection or hat matrix. For each $i$-th row and $j$-th column of $P$ we have that $p_{ij} = \frac{\mathbb{C}ov(y, \hat{y})}{\mathbb{V}ar(y)}$.
:::
:::Definition (Distributions of ordinary least squares).
Assume a linear regression model and least squares estimation. We can derive to get
$$\begin{split} y &\sim \mathcal{N}(X\beta, \sigma^2) \ \hat{\beta} &\sim \mathcal{N}\left(\beta, \sigma^2(X^T X)^{-1}\right) \ \hat{y} &\sim \mathcal{N}\left(X\beta, \sigma^2 P\right) \ \hat{\varepsilon} &\sim \mathcal{N}\left(0, \sigma^2(I_n - P)\right) \end{split}$$
:::
:::Definition (Multivariate Least Squares).
We drop the independence or multicollinearity assumption from the linear regression $y = X\beta + \epsilon$ such that $\epsilon \sim \mathcal{N}(0, \Sigma)$ is the multivariate normal distribution where the covariance matrix
$$\begin{split} \Sigma = \mathbb{C}ov(X) &= \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^T] \ &= \frac{1}{n-1}X^TX \end{split}$$
informs about the linear dependence. The diagonal consists of the variables variance $\mathbb{V}ar(X_i)$ and each column about the respective covariance $\mathbb{C}ov(X_i, X_j)$ for $i \neq j$.
We estimate as before but using a special case of Euclidean norm $|\cdot|$, namely the Mahalanobis distance
$$\begin{split} \hat{\beta} &= \arg\min_{\beta}|\epsilon|^2 \ &= \arg\min_{\beta}|y - X\beta|^2 \ &= \arg\min_{\beta}(y - X\beta)^T\Sigma^{-1}(y - X\beta) \ &= \left(X^T\Sigma^{-1}X\right)^{-1}X^T\Sigma^{-1}y \end{split}$$
:::
:::Definition (Partial Least Squares).
A different method, generally similar to principal components in the sense of finding orthonormal vectors, but including the response $y$ and with the least squares method on a reduced number of parameters. We use a score matrix $T = XW$ with $T \in \mathbb{R}^{n \times q}$ and a reduced number of coefficients $\gamma \in \mathbb{R}^{q}$ where $q \leq p$.
$$\begin{split} y &= T\gamma + \epsilon \ &= (XW)\gamma + \epsilon \ &= X(W\gamma) + \epsilon \ &\approx X\beta + \epsilon \end{split}$$
:::
:::Definition (Continuum Regression).
The least squares method, principal components and partial least squares can be generalized as follows. Our model is
$$y = Q\gamma + \epsilon$$
with $\gamma \in \mathbb{R}^q$. We wish to find a matrix $W \in \mathbb{R}^{n \times q}$ with $q \leq p$ where each $w_i \in W$ is an orthonormal vector $|w_i| = 1$ and $\mathbb{C}ov(Xw_i, Xw_j) = 0$ for $i < j$ such that $Q = XW$ yields maximum variance. Phrasing this as an optimization problem
$$w_i = \arg\max_w \left[\left(\mathbb{C}ov(y, Xw)^2\right) \left(\mathbb{V}ar(Xw)^{\frac{\psi}{1-\psi}-1}\right)\right]$$
The control parameter $\psi$ regulates how much the information of the response $y$ is included in finding the coefficients.
$$\psi = \begin{cases} 0 \implies \frac{\mathbb{C}ov(y, Xw)^2}{\mathbb{V}ar(Xw)} & \text{Least Squares} \ 0.5 \implies \mathbb{C}ov(y, Xw)^2 & \text{Partial Least Squares} \ 1 \implies \mathbb{V}ar(Xw) & \text{Principal components} \end{cases}$$
:::
:::Definition (Ridge and Lasso regression).
This method helps overcome non-singularity of $(X^TX)^{-1}$ by adding a constant to it. We use a mean-centered linear model
$$\boldsymbol{y_} = \boldsymbol{X}_\boldsymbol{\beta}_{R \lor L} + \boldsymbol{\epsilon}$$
with $X \in \mathbb{R}^{n \times p+1}$ and $\epsilon \sim \mathcal{N}(0, \Sigma)$. To fit the model we calculate the intercept as mean $\beta_0 = \frac{1}{n}\sum y$ and for the remaining $p$ coefficients in $\beta_R$ we use the design matrix $\mathbb{R}^{n \times p-1}$ where $\lambda \geq 0$ is the control parameter; the larger it is the smaller the coefficients. We can estimate using the described method to get a ridge coefficient $\beta_R$
$$\begin{split} \boldsymbol\beta_R &= \arg\min_{\boldsymbol{\beta}}|\epsilon|^2 \ &= \arg\min_{\boldsymbol{\beta}}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 + \lambda|\boldsymbol{\beta}|_2^2 \end{split}$$
and a lasso coefficient $\beta_L$
$$\begin{split} \boldsymbol\beta_L &= \arg\min_{\boldsymbol{\beta}}|\epsilon|^2 \ &= \arg\min_{\boldsymbol{\beta}}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 + \lambda|\boldsymbol{\beta}|_1 \end{split}$$
with the norms
$$\begin{split} |x|_1 &= \sum |x_i| \ |x|_2 &= \sqrt{\sum x_i^2} \end{split}$$
We can then estimate using
$$\beta_{R \lor L} = (X^TX + \lambda I)^{-1}X^Ty$$
The intercept has been left out because of centering. Penalization of the intercept would make the procedure depend on the origin chosen for $y$. If $X$ is orthogonal then ridge simply scales.
:::
Dimensionality Reduction
Eigendecomposition of Covariance Matrix
To center a data matrix, we subtract the mean for each observation. Suppose we have a centered data matrix $X \in \mathbb{R}^{n \times p}$ with sample means $\overline{x}_n$. The unbiased sample covariance is then defined to be
$$\Sigma = \frac{1}{n-1}X^TX$$
Since $\Sigma$ is a symmetric $n \times n$ matrix, it has an eigendecomposition
$$\Sigma = V\Lambda V^T$$
where $\Lambda = \operatorname*{diag}(\lambda_1, \dots, \lambda_n)$ is an $n \times n$ diagonal matrix consisting of the eigenvalues of $\Sigma$ and $V$ is an orthogonal matrix $VV^T = V^TV = I$ whose columns are unit $|v_i| = 1$ mutually orthogonal $v_i v_j = 0$ eigenvectors $v_1, \dots, v_n$ of $\Sigma$ and so form an orthonormal basis. Note: A covariance matrix is always positive semi-definite and thus always has non-negative eigenvalues.
Eigendecomposition of Correlation Matrix
The $n \times n$ correlation matrix $C$ is a standardized covariance matrix. Each entry is defined as
$$c_{ij} = \frac{\operatorname*{cov}(X_i, X_j)}{\sigma_{X_i}\sigma_{X_j}}$$
with the properties of each diagonal entry being one. We can obtain $C$ from a covariance matrix $\Sigma$ by letting $D = \sqrt{\operatorname*{diag}(\Sigma)}$ and then $C = D^{-1}\Sigma D^{-1}$.
The major difference is that with $C$ we have scale invariance. The benefit of standardizing is to compare quantities which may have different range values.
Singular Value Decomposition
We can avoid computing the covariance or correlation matrix altogether if we apply the singular value decomposition on the centered data matrix $X$ given by
$$X = U\Delta V^T$$
with the $p \times p$ matrix $\Delta = \operatorname*{diag}(\delta_1, \dots, \delta_p)$ with non-negative singular values of $X$, the $n \times p$ orthonormal matrix $U$ with left singular vectors (i.e. eigenvectors of $XX^T$) and the $p \times p$ orthonormal $V$ with right singular vectors (i.e. eigenvectors of $X^TX$). To make this relationship clear, consider the covariance up to a factor $1/(n-1)$ can be factorized
$$\Sigma = X^TX = V\Lambda V^T$$
which requires us to get the eigenvalues (also known as principal values) in the matrix $\Lambda$. We get it if we just square the singular values: $\Lambda = \Delta^2$. Then
$$\Sigma = X^TX = V\Delta^T U^T U\Delta V^T = V\Delta\Delta V^T = V\Lambda V^T$$
We simply consider the $k$ first eigenvectors, i.e. principal components of $V$, giving the $p \times k$ matrix $V_k$. We then project in the corresponding space and obtain a new matrix of the first $k$ principal component scores
$$Z_k = XV_k = U_k\Delta_k V_k^T V = U_k\Delta_k$$
To do regression we use the linear model
$$y = Z\alpha + \epsilon$$
with the parameter estimate by least squares given by
$$\begin{split} \hat{\alpha} &= (Z^TZ)^{-1}Z^Ty \ &= (\Delta_k^T U_k U_k \Delta_k)^{-1}\Delta_k^T U_k^T y \ &= \Sigma_k^{-1} U_k^T y \end{split}$$
The predictions will be
$$\begin{split} \hat{y} &= Z\hat{\alpha} \ &= U_k\Delta_k\Sigma_k^{-1}U_k^T y \ &= U_k U_k^T y \end{split}$$