Statistical Inference & Hypothesis Testing

Fri 27 March 2026

by Daniel Melichar

in Crash Course

tagged statistics, inference, hypothesis testing, machine learning

Part of the Statistics for Machine Learning series.

Parameter Estimation
Hypothesis Testing

Parameter Estimation

:::Definition (Maximum Likelihood Estimation).

The essential idea is to define a function which gives a numeric value (not a distribution!) for how well the parameters $\theta$ of some model fit the data $S = (x_1, x_2, \dots, x_n)$. This is called the likelihood $\mathcal{L}$ of the parameters.

$$\mathcal{L}_x(\theta) = p(x \mid \theta)$$

If the data samples are independent and identically distributed we can formulate the entire data set as

$$\mathcal{L}S(\theta) = \prod^n p(x \mid \theta)$$

Note that $p$ is a distribution. If we fix $\beta$ then confidence intervals can be obtained, but we actually fix $x$ and so choose parameters $\theta$ that maximise the likelihood of obtaining $x$. The goal is to find parameters that maximise the likelihood

$$\hat{\theta}{ML} = \arg\max\theta L_S(\theta)$$

In Machine Learning the negative log likelihood is often used. Two reasons for that is because a sum is computationally less expensive than a product $\log(ab) = \log(a)+\log(b)$ and because historically optimization problems used to be minimized and so we add the negative sign to denote maximizing.

$$\mathcal{L}_x(\theta) = - \log p(x \mid \theta)$$

There are some further properties of the MLE that can be analysed. See Wikipedia for more info. In general, MLE may suffer from overfitting and if a non-Gaussian distribution is used then no closed-form solution may exist.

Asymptotic consistency: The MLE converges to the true value in the limit of infinitely many observations, plus a random error that is approximately normal.
The size of the samples necessary to achieve these properties can be quite large.
The error's variance decays in $1/N$, where $N$ is the number of data points.
Especially, in the "small" data regime, maximum likelihood estimation can lead to overfitting.

:::

:::Definition (Maximum A Posteriori Estimation).

If we have prior knowledge about the distribution of the parameters $\theta$, we can multiply an additional term to the likelihood. This additional term is a prior probability distribution on parameters $p(\theta)$. We can then apply Bayes' theorem to obtain the posterior distribution

$$p(\theta \mid x) = \frac{p(x \mid \theta)p(\theta)}{p(x)}$$

We are interested in $\theta$ and $p(x)$ does not depend on that. So we simply ignore $p(x)$ and obtain

$$p(\theta \mid x) \propto p(x \mid \theta)p(\theta)$$

which means that the former is proportional to the latter but not exactly the same since we the former is a distribution (adds to 1) and the latter is not because we dropped the normalizing factor $p(x)$. So we can formulate this as

$$\hat{\theta}{MAP} = \arg\max\theta L(\theta)p(\theta)$$

:::

Hypothesis Testing

:::Definition (p-values).

We call the probability of a hypothesis $\mu_0$ to be compatible with our data the $p$-value. To quantify if we can reject a hypothesis given our data we define a significance level $\alpha \in (0,1)$. Let $w$ be the score of our hypothesis test. We can interpret the significance level to be a $\alpha$-quantile such that $P(w < q_\alpha) = \alpha$. The range $R$ of the $\alpha$-quantile can be

two-sided $R = (-\infty, q_{\alpha/2}] \cup [q_{\alpha/2}, +\infty)$ where $H_0 = \mu = \mu_0$
left-sided $R = (-\infty, q_{\alpha}]$ where $H_0 = \mu \geq \mu_0$
right-sided $R = [q_{\alpha}, +\infty)$ where $H_0 = \mu \leq \mu_0$

We define the following decision rule

$p \leq \alpha \Leftrightarrow s \in R$ and we reject $H_0$
$p > \alpha \Leftrightarrow s \notin R$ and we do not reject $H_0$

:::

:::Definition (Type I and Type II errors).

We typically use

$\beta$ = probability of a Type II error, known as a "false negative"
$1-\beta$ = probability of a "true positive", i.e., correctly rejecting the null hypothesis. Also known as the power of the test.
$\alpha$ = probability of a Type I error, known as a "false positive"
$1-\alpha$ = probability of a "true negative", i.e., correctly not rejecting the null hypothesis

:::

:::Definition (Z-Test).

Given $X_1, \cdots, X_n$ i.i.d. random variables with $X_i \sim N(\mu, \sigma^2)$ with $\mu \in \mathbb{R}$ and $\sigma^2 < \infty$ we know that by the Central Limit Theorem

$$\overline{X} = \frac{1}{n}\sum_{i=1}^n X_i \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$

We use this hypothesis $\mu_0$ to calculate its probability given our data. First we calculate the z-score with $\mu_0$ as follows

$$Z = \frac{\overline{X}-\mu_0}{\sigma / \sqrt{n}}$$

such that $Z \sim N(0,1)$. Then we calculate the two-sided p-value $P_{H_0}(|Z| \geq |z|)$ using the normal distribution's CDF. We use the decision rule above to reject or not.

:::

:::Definition (T-Test, one sample).

In the Z-Test we assume to know $\sigma$, but in practice that is not the case and we estimate using the empirical standard deviation $S$. This is plugged into the estimator above such that

$$T = \frac{\overline{X}-\mu}{S / \sqrt{n}}$$

where $T \sim t(n-1)$. The remainder of the method is largely the same.

:::

:::Definition (SEM).

We define the standard error of the mean

$$SEM = S / \sqrt{n}$$

The $t$-statistic can be intuitively interpreted as how many empirical standard deviations (i.e. ~SEMs) away the $t$-score is from the discrepancy

$$|T| \cdot SEM = \overline{X}-\mu_0$$

Since the $t$-distribution approximates to the normal, the more SEMs away the discrepancy, the less likely.

:::

:::Definition (Confidence Intervals).

Let $X_1, \cdots, X_n$ be i.i.d. random variables with $X_i \sim N(\mu, \sigma^2)$ with $\mu \in \mathbb{R}$ and $0 < \sigma^2 < \infty$. Let $q_{1-\alpha/2}$ be the two-sided $(1-\alpha/2)$-quantile. Under a $H_0 : \mu = \mu_0$, the confidence interval

$$I = (\overline{X} - q_{1-\alpha/2} \cdot SEM, \; \overline{X} + q_{1-\alpha/2} \cdot SEM)$$

overlaps the parameter $\mu_0$ with a probability $1-\alpha$. In other words

$$P(\mu_0 \in I) = 1-\alpha$$

:::

:::Definition (T-Test, two sample).

We can extend the idea of the one-sample case to the two-sample case. Let $X_1, \dots, X_n$ and $Y_1, \dots, Y_m$ be i.i.d. random variables where for $i = 1, 2, \dots, n$ we have $X_i \sim N(\mu_1, \sigma_1^2)$ and for $j = 1, 2, \dots, m$ we have $Y_j \sim N(\mu_2, \sigma_2^2)$. Thus if we define hypothesis that there is no significant difference between the (empirical) means of these two populations $H_0 : \mu_1 = \mu_2$ then the discrepancy between them is $d = \mu_1 - \mu_2 = 0$. Now we introduce two models for the test:

Welch ($\sigma_1^2 \neq \sigma_2^2$): Under the hypothesis $H_0$ it holds approx that $$T = \frac{(\overline{Y}-\overline{X})-d}{\sqrt{SEM_y^2+SEM_x^2}} \sim t(v)$$ where $R$ knows how to calculate $v$. Let $q_{1-\alpha/2}$ be the $(1-\alpha/2)$-quantile and the confidence interval $$I = \left( (\overline{Y}-\overline{X}) - q_{1-\alpha/2} \cdot \sqrt{SEM_y^2+SEM_x^2},\; (\overline{Y}-\overline{X}) + q_{1-\alpha/2} \cdot \sqrt{SEM_y^2+SEM_x^2} \right)$$ overlaps with $d$ with the approximate probability $1-\alpha$.
Student ($\sigma_1^2 = \sigma_2^2$): Under the hypothesis $H_0$ it holds exactly that $$T = \frac{(\overline{Y}-\overline{X})-d}{S_p \cdot \sqrt{\frac{1}{m}+\frac{1}{n}}} \sim t(m+n-2)$$ where the empirical pooled variance is defined as $$S_{p}^{2}=\frac{\sum_{i=1}^{m}(n_i-1)s_{i}^{2}}{\sum_{i=1}^{m}(n_i-1)}=\frac{(n_1-1)s_1^2+(n_2-1)s_2^2+\cdots+(n_m-1)s_m^2}{n_1+n_2+\cdots+n_m-m}$$

:::

:::Remark

It is not strictly necessary that the random variables have a normal distribution since the CLT ensures that the mean will be approximately normal given a large enough sample size. Other hypotheses can also be defined such that $d \in \mathbb{R}$. For example, if you are comparing the effectiveness of a new drug versus a placebo, a non-zero difference in means could indicate that the drug has a significant impact on the outcome being measured.

:::

:::Definition ($\chi^2$-Test).

For a random vector $\mathcal{X} = (X_1, \cdots, X_d) \sim \text{mult}(n,p)$ with $p \in (0,1)^d$ and $\sum_{k=1}^d p_k = 1$ we define the hypothesis $H_0 : p = p_0 := (p_{0,1}, \dots, p_{0,d})$. For example a fair distribution would be where each category $d$ appears equally likely, i.e. $p_0 := (1/d, \dots, 1/d)$. For $k \in {1, \dots, d}$ each $X_k$ has an expected value $\mathbb{E}{H_0}(X_k) = n \cdot p$. The $\chi^2$-statistic is the difference between observed frequencies from the expected frequencies under the null hypothesis. Thus under the $H_0$ it holds for large enough $n$ that

$$\chi^2 = \sum_{k=1}^d \frac{(X_k-\mathbb{E}{H_0}(X_k))^2}{\mathbb{E} \sim \chi^2(d-1)$$}(X_k)

This can be formulated more generally as

$$\chi^2 = \sum_{i=1}^{n}\frac{(O_i - E_i)^2}{E_i}$$

where $O_i$ is the number of observations of type $i$ and $E_i$ is the expected (theoretical) frequency of type $i$ asserted by the null hypothesis.

:::