Statistical Inference & Hypothesis Testing


Part of the Statistics for Machine Learning series.

Parameter Estimation

:::Definition (Maximum Likelihood Estimation).

The essential idea is to define a function which gives a numeric value (not a distribution!) for how well the parameters $\theta$ of some model fit the data $S = (x_1, x_2, \dots, x_n)$. This is called the likelihood $\mathcal{L}$ of the parameters.

$$\mathcal{L}_x(\theta) = p(x \mid \theta)$$

If the data samples are independent and identically distributed we can formulate the entire data set as

$$\mathcal{L}S(\theta) = \prod^n p(x \mid \theta)$$

Note that $p$ is a distribution. If we fix $\beta$ then confidence intervals can be obtained, but we actually fix $x$ and so choose parameters $\theta$ that maximise the likelihood of obtaining $x$. The goal is to find parameters that maximise the likelihood

$$\hat{\theta}{ML} = \arg\max\theta L_S(\theta)$$

In Machine Learning the negative log likelihood is often used. Two reasons for that is because a sum is computationally less expensive than a product $\log(ab) = \log(a)+\log(b)$ and because historically optimization problems used to be minimized and so we add the negative sign to denote maximizing.

$$\mathcal{L}_x(\theta) = - \log p(x \mid \theta)$$

There are some further properties of the MLE that can be analysed. See Wikipedia for more info. In general, MLE may suffer from overfitting and if a non-Gaussian distribution is used then no closed-form solution may exist.

  • Asymptotic consistency: The MLE converges to the true value in the limit of infinitely many observations, plus a random error that is approximately normal.
  • The size of the samples necessary to achieve these properties can be quite large.
  • The error's variance decays in $1/N$, where $N$ is the number of data points.
  • Especially, in the "small" data regime, maximum likelihood estimation can lead to overfitting.

:::

:::Definition (Maximum A Posteriori Estimation).

If we have prior knowledge about the distribution of the parameters $\theta$, we can multiply an additional term to the likelihood. This additional term is a prior probability distribution on parameters $p(\theta)$. We can then apply Bayes' theorem to obtain the posterior distribution

$$p(\theta \mid x) = \frac{p(x \mid \theta)p(\theta)}{p(x)}$$

We are interested in $\theta$ and $p(x)$ does not depend on that. So we simply ignore $p(x)$ and obtain

$$p(\theta \mid x) \propto p(x \mid \theta)p(\theta)$$

which means that the former is proportional to the latter but not exactly the same since we the former is a distribution (adds to 1) and the latter is not because we dropped the normalizing factor $p(x)$. So we can formulate this as

$$\hat{\theta}{MAP} = \arg\max\theta L(\theta)p(\theta)$$

:::

Hypothesis Testing

:::Definition (p-values).

We call the probability of a hypothesis $\mu_0$ to be compatible with our data the $p$-value. To quantify if we can reject a hypothesis given our data we define a significance level $\alpha \in (0,1)$. Let $w$ be the score of our hypothesis test. We can interpret the significance level to be a $\alpha$-quantile such that $P(w < q_\alpha) = \alpha$. The range $R$ of the $\alpha$-quantile can be

  • two-sided $R = (-\infty, q_{\alpha/2}] \cup [q_{\alpha/2}, +\infty)$ where $H_0 = \mu = \mu_0$
  • left-sided $R = (-\infty, q_{\alpha}]$ where $H_0 = \mu \geq \mu_0$
  • right-sided $R = [q_{\alpha}, +\infty)$ where $H_0 = \mu \leq \mu_0$

We define the following decision rule

  • $p \leq \alpha \Leftrightarrow s \in R$ and we reject $H_0$
  • $p > \alpha \Leftrightarrow s \notin R$ and we do not reject $H_0$

:::

:::Definition (Type I and Type II errors).

We typically use

  • $\beta$ = probability of a Type II error, known as a "false negative"
  • $1-\beta$ = probability of a "true positive", i.e., correctly rejecting the null hypothesis. Also known as the power of the test.
  • $\alpha$ = probability of a Type I error, known as a "false positive"
  • $1-\alpha$ = probability of a "true negative", i.e., correctly not rejecting the null hypothesis

See also: Type I and Type II Errors, Sensitivity and specificity, Power of a test.

:::

:::Definition (Z-Test).

Given $X_1, \cdots, X_n$ i.i.d. random variables with $X_i \sim N(\mu, \sigma^2)$ with $\mu \in \mathbb{R}$ and $\sigma^2 < \infty$ we know that by the Central Limit Theorem

$$\overline{X} = \frac{1}{n}\sum_{i=1}^n X_i \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$

We use this hypothesis $\mu_0$ to calculate its probability given our data. First we calculate the z-score with $\mu_0$ as follows

$$Z = \frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$$

such that $Z \sim N(0,1)$. Then we calculate the two-sided p-value $P_{H_0}(|Z| \geq |z|)$ using the normal distribution's CDF. We use the decision rule above to reject or not.

:::

:::Definition (T-Test, one sample).

In the Z-Test we assume to know $\sigma$, but in practice that is not the case and we estimate using the empirical standard deviation $S$. This is plugged into the estimator above such that

$$T = \frac{\overline{X}-\mu}{S / \sqrt{n}}$$

where $T \sim t(n-1)$. The remainder of the method is largely the same.

:::

:::Definition (SEM).

We define the standard error of the mean

$$SEM = S / \sqrt{n}$$

The $t$-statistic can be intuitively interpreted as how many empirical standard deviations (i.e. ~SEMs) away the $t$-score is from the discrepancy

$$|T| \cdot SEM = \overline{X}-\mu_0$$

Since the $t$-distribution approximates to the normal, the more SEMs away the discrepancy, the less likely.

:::

:::Definition (Confidence Intervals).

Let $X_1, \cdots, X_n$ be i.i.d. random variables with $X_i \sim N(\mu, \sigma^2)$ with $\mu \in \mathbb{R}$ and $0 < \sigma^2 < \infty$. Let $q_{1-\alpha/2}$ be the two-sided $(1-\alpha/2)$-quantile. Under a $H_0 : \mu = \mu_0$, the confidence interval

$$I = (\overline{X} - q_{1-\alpha/2} \cdot SEM, \; \overline{X} + q_{1-\alpha/2} \cdot SEM)$$

overlaps the parameter $\mu_0$ with a probability $1-\alpha$. In other words

$$P(\mu_0 \in I) = 1-\alpha$$

:::

:::Definition (T-Test, two sample).

We can extend the idea of the one-sample case to the two-sample case. Let $X_1, \dots, X_n$ and $Y_1, \dots, Y_m$ be i.i.d. random variables where for $i = 1, 2, \dots, n$ we have $X_i \sim N(\mu_1, \sigma_1^2)$ and for $j = 1, 2, \dots, m$ we have $Y_j \sim N(\mu_2, \sigma_2^2)$. Thus if we define hypothesis that there is no significant difference between the (empirical) means of these two populations $H_0 : \mu_1 = \mu_2$ then the discrepancy between them is $d = \mu_1 - \mu_2 = 0$. Now we introduce two models for the test:

  • Welch ($\sigma_1^2 \neq \sigma_2^2$): Under the hypothesis $H_0$ it holds approx that $$T = \frac{(\overline{Y}-\overline{X})-d}{\sqrt{SEM_y^2+SEM_x^2}} \sim t(v)$$ where $R$ knows how to calculate $v$. Let $q_{1-\alpha/2}$ be the $(1-\alpha/2)$-quantile and the confidence interval $$I = \left( (\overline{Y}-\overline{X}) - q_{1-\alpha/2} \cdot \sqrt{SEM_y^2+SEM_x^2},\; (\overline{Y}-\overline{X}) + q_{1-\alpha/2} \cdot \sqrt{SEM_y^2+SEM_x^2} \right)$$ overlaps with $d$ with the approximate probability $1-\alpha$.

  • Student ($\sigma_1^2 = \sigma_2^2$): Under the hypothesis $H_0$ it holds exactly that $$T = \frac{(\overline{Y}-\overline{X})-d}{S_p \cdot \sqrt{\frac{1}{m}+\frac{1}{n}}} \sim t(m+n-2)$$ where the empirical pooled variance is defined as $$S_{p}^{2}=\frac{\sum_{i=1}^{m}(n_i-1)s_{i}^{2}}{\sum_{i=1}^{m}(n_i-1)}=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2+\cdots+(n_m-1)s_m^2}{n_1+n_2+\cdots+n_m-m}$$

:::

:::Remark

It is not strictly necessary that the random variables have a normal distribution since the CLT ensures that the mean will be approximately normal given a large enough sample size. Other hypotheses can also be defined such that $d \in \mathbb{R}$. For example, if you are comparing the effectiveness of a new drug versus a placebo, a non-zero difference in means could indicate that the drug has a significant impact on the outcome being measured.

:::

:::Definition ($\chi^2$-Test).

For a random vector $\mathcal{X} = (X_1, \cdots, X_d) \sim \text{mult}(n,p)$ with $p \in (0,1)^d$ and $\sum_{k=1}^d p_k = 1$ we define the hypothesis $H_0 : p = p_0 := (p_{0,1}, \dots, p_{0,d})$. For example a fair distribution would be where each category $d$ appears equally likely, i.e. $p_0 := (1/d, \dots, 1/d)$. For $k \in {1, \dots, d}$ each $X_k$ has an expected value $\mathbb{E}{H_0}(X_k) = n \cdot p$. The $\chi^2$-statistic is the difference between observed frequencies from the expected frequencies under the null hypothesis. Thus under the $H_0$ it holds for large enough $n$ that

$$\chi^2 = \sum_{k=1}^d \frac{(X_k-\mathbb{E}{H_0}(X_k))^2}{\mathbb{E} \sim \chi^2(d-1)$$}(X_k)

This can be formulated more generally as

$$\chi^2 = \sum_{i=1}^{n}\frac{(O_i - E_i)^2}{E_i}$$

where $O_i$ is the number of observations of type $i$ and $E_i$ is the expected (theoretical) frequency of type $i$ asserted by the null hypothesis.

:::