Model Evaluation & Selection


Part of the Statistics for Machine Learning series.

Models

:::Definition (Decomposition of variance).

In a linear regression model with least squares estimator, the total sum of squares can be partitioned into an explained component and unexplained component.

$$\text{TSS} = \text{RSS} + \text{ESS}$$

where we have

  • Total Sum of Squares $\text{TSS} = \sum_{i=1}^n (y_i - \overline{y_i})^2$
  • Residual Sum of Squares $\text{RSS} = \sum_{i=1}^n (y_i - \hat{y})^2$
  • Explained Sum of Squares $\text{ESS} = \sum_{i=1}^n (\overline{y_i} - \hat{y_i})^2$

:::

:::Definition (Testing coefficients to be zero).

Assuming two models

$$\begin{split} M_0 : y &= \beta_0 + x_1\beta_1 + \cdots + x_p\beta_p + \epsilon \ M_1 : y &= \beta_0 + x_1\beta_1 + \cdots + x_q\beta_q + \epsilon \end{split}$$

where $p < q$. The smaller model $M_0$ explains the variance with less coefficients if the p-value of the following statistic

$$\begin{split} F &= \frac{\text{explained variance}}{\text{unexplained variance}} \ &= \frac{(RSS_{M_0} - RSS_{M_1}) / (p_1 - p_0)}{(RSS_{M_1}) / (n - p_1 - 1)} \end{split}$$

is bigger than the $(1-\alpha)$-quantile of the $F_{p_1-p_0,\, n-p_1-1}$ distribution.

:::

:::Definition (Explained Variance).

Describes the amount of variance explained by the model

$$\begin{split} R^2 &= 1 - \frac{RSS}{TSS} \ &= \frac{ESS}{TSS} \ &= \text{Cor}^2(y, \hat{y}) \end{split}$$

where $\text{Cor}^2$ is the squared Pearson-correlation coefficient $\frac{\sigma_{y\hat{y}}}{\sigma_y\sigma_{\hat{y}}}$. The closer the score is to 1 the better. The adjusted R-squared looks as follows

$$R^2 = 1 - \frac{RSS / (n-p-1)}{TSS / (n-1)}$$

:::

:::Definition (Information criteria).

Assuming we want to approximate a true stochastic function with another $f(x) \approx g(x; \beta)$ for $\beta \in \mathbb{R}^{n+1}$ and $x \in \mathbb{R}^n$ with $n > 0$ we can quantify the fit of our approximation by

$$\begin{split} I(f, g) &= \int f(x) \log\left(\frac{f(x)}{g(x; \beta)}\right) \ &= C - \mathbb{E}_f(\log(g(x, \beta))) \end{split}$$

This is equivalent to maximizing the log-likelihood and so we have

$$I(f, g) = C - \arg\max_\beta \log(\mathcal{L}(y \mid x, \beta))$$

There are two formulations that only consider relative information and are defined as follows

$$\begin{split} \text{AIC} &: -2\arg\max_\beta\log(\mathcal{L}(y \mid x, \beta)) + 2p \ \text{BIC} &: -2\arg\max_\beta\log(\mathcal{L}(y \mid x, \beta)) + \log(n)p \end{split}$$

If the residual variance $\sigma^2$ in least squares is known we have

$$\begin{split} \text{AIC} &: \frac{\text{RSS}}{\sigma^2} + 2p \ \text{BIC} &: \frac{\text{RSS}}{\sigma^2} + \log(n)p \ C_p &: \frac{\text{RSS}}{\sigma^2} + 2p - n \end{split}$$

:::

:::Definition (Variable selection).

For an input matrix $X \in \mathbb{R}^{n \times p+1}$, two methods for reducing $p$ are

Stepwise algorithm

Backward

  1. Generate $\hat{\beta}$ with all inputs $X \in \mathbb{R}^{n \times p+1}$
  2. Calculate F-statistic (or $R^2$, AIC, BIC)
  3. Remove variable $j \in {1, \dots, p}$
  4. Repeat step 2

Forward

  1. $\hat{\beta}$ with no inputs $X \in \mathbb{R}^{n}$
  2. Calculate F-statistic (or $R^2$, AIC, BIC)
  3. Add variable $j \in {1, \dots, p}$
  4. Repeat step 2

Best subset which finds a subset of the parameters $k \subseteq {0, 1, \dots, p}$ that gives best test statistic using leaps and bounds. Idea of algorithm:

  1. Iterate over power set $\mathcal{P}({0, 1, \dots, p})$
  2. Calculate test statistic for individual variables
  3. Successively add other variables
  4. Remember value of best statistic (bounds)
  5. If a node reflecting a combination of variables has a test statistic less than the bounds then the node's branch can be discarded for further analysis

:::

Data

:::Definition (Cross validation).

Partition the data set $S = {x_i, y_i}_{i=1}^N$ into two $S = T \cup V$ such that they do not overlap $T \cap V = \emptyset$. In $k$-fold cross validation we split $S$ such that in any given iteration we have $|T| = k-1$ and $|V| = 1$. This is also known as leave-one-out cross validation. Then for chunk $T^{(k)}$ we get a predictor $h^{(k)}$ which is applied to the validation set $V^{(k)}$ to compute the risk

$$L_{V}(h) := \frac{1}{K}$$

:::

:::Definition (Bootstrap).

Draw $m \leq n$ observations from a data set $X \in \mathbb{R}^{n \times p}$ with replacement (i.e. an observation may appear multiple times). Repeat drawing observations $x$ times such that there are $q$ data sets each with $m$ observations. Fit model on each data set. Error $E = y - \hat{y}$ is average over all with e.g. squared error

$$\hat{E} = \frac{1}{q}\frac{1}{m}\sum_{k=1}^{q}\sum_{i=1}^{m}(y_i - \hat{y}_i)^2$$

To improve the average, we can calculate the error of a data set $q_i$ only in terms of the observations that are not part of the data set. Similar to train/test split. $I$ is an index set of all $m \notin q_i$

$$\hat{E} = \frac{1}{q}\sum_{k=1}^q \underbrace{\frac{1}{|I|}\sum_{i \in I}(y_i - \hat{y}i)^2}$$} q_i

:::

:::Definition (Centering and Scaling).

The following regularization methods of the linear regression model $\boldsymbol{y} = \boldsymbol{x}^T\boldsymbol{\beta} + \boldsymbol{\epsilon}$ typically require mean-centering where the mean is subtracted from each input variable and response variable

$$\begin{split} \boldsymbol{x}_^T &= (x_1 - \overline{x}, x_2 - \overline{x}, \dots, x_p - \overline{x}) \ \boldsymbol{y}_ &= (y_1 - \overline{y}, y_2 - \overline{y}, \dots, y_p - \overline{y}) \end{split}$$

such that $\boldsymbol{y_} = \boldsymbol{x}_^T\boldsymbol{\beta} + \epsilon$ and sometimes scaling is also required where for each input variable $x_i$ with $i \leq p$ we do

$$x_i = \frac{x_i - \text{min}(\boldsymbol{x})}{\text{max}(\boldsymbol{x}) - \text{min}(\boldsymbol{x})}$$

:::