Probability Foundations
Part of the Statistics for Machine Learning series.
Probability Theory
:::Definition (Probability space)
A triple $(\Omega, \mathcal{E}, P)$ is called a probability space where
- $\Omega$ is called the sample space and is the non-empty set of all possible outcomes
- the event space or $\sigma$-algebra with $\mathcal{E} \subseteq \mathcal{P}(\Omega)$ is the set of events
- a probability measure $P: \mathcal{E} \to [0,1]$ which has to uphold the Kolmogorov axioms of probability
- (positivity) $P(A) \in \mathbb{R}^+$ for all $A \in \mathcal{E}$
- (unitarity) $P(\Omega) = 1$
- ($\sigma$-additivity) $P(\bigcup_{i=1}^\infty A_i) = \sum_{i=1}^\infty P(A_i)$ for all $A_i \cap A_{i+1} = \emptyset$ with $A_i \in \mathcal{E}$
:::
Other properties
- $P(\Omega) = 1$ the almost surely event
- $P(\emptyset) = 0$ the almost never event
- $P(A^\complement) = 1 - P(A)$
- $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
- $A \subseteq B \implies P(A) \leq P(B)$
:::Definition (Experiments and trials)
A process that selects an event from a set of possible outcomes $\Omega$ is called a trial. We call $\omega \in \Omega$ the elementary events. We call the sample space discrete if $|\Omega| = n$ for $n \in \mathbb{N}$, and continuous if $x \leq |\Omega| \leq y$ for some $x,y \in \mathbb{R}$.
:::
If the experiment consists of just one flip of a fair coin, then the outcome is either heads or tails: $\Omega = {\text{H}, \text{T}}$. The event space $\mathcal{E}$ contains $2^2 = 4$ events, namely:
- ${\text{H}}$ ("heads"),
- ${\text{T}}$ ("tails"),
- ${}$ ("neither heads nor tails"), and
- ${\text{H}, \text{T}}$ ("either heads or (!!!) tails")
Therefore, $\mathcal{E} = \{ \text{H, T, H, T} \}$. There is a fifty percent chance of tossing heads and fifty percent for tails, so the probability measure in this example is $P({}) = 0$, $P({\text{H}}) = 0.5$, $P({\text{T}}) = 0.5$, $P({\text{H}, \text{T}}) = 1$.
:::Definition (Odds and chances).
Typically odds describe the ratio of two events, whereas chances describe the ratio of an event to all other events.
For example: let the number of distinct but equally probable events be finite $|\Omega| = N$, denote a subset of favourable events as $A \subseteq \Omega$ and unfavourable events as $B = \Omega - A$. We say the odds of winning are $|A| : |B|$ and we say the chances of winning are $|A| : N$.
:::
:::Definition (Laplace experiment).
Let $(\Omega, \mathcal{E}, P)$ be a finite probability space $|\Omega| = n$ and $P(A) = p$ for all $A \in \Omega$. Then for any event $x \in \mathcal{E}$ we have
$$P(x) = \frac{\text{günstige}}{\text{mögliche}} = \frac{|x|}{|\Omega|}$$
This is also known as the classical interpretation of probability.
:::
:::Definition (Conditional probability).
$$P(A|B) = \frac{P(A \cap B)}{P(B)} \qquad \text{with } P(B) > 0$$
:::
:::Definition (Multiplication rule).
$$P(A \cap B) = P(A|B) \cdot P(B)$$
:::
:::Definition (Bayes' theorem).
Let $A, B$ be some event. If we combine the conditional probability with the multiplication rule we get
$$P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)}$$
with $P(B) > 0$.
We have the following convention
- $P(A\mid B)$ is a conditional probability: the probability of event $A$ occurring given that $B$ is true. It is also called the posterior probability of $A$ given $B$.
- $P(B\mid A)$ is also a conditional probability: the probability of event $B$ occurring given that $A$ is true. It can also be interpreted as the likelihood of $A$ given a fixed $B$ because $P(B\mid A) = L(A\mid B)$.
- $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ respectively without any given conditions; they are known as the prior probability and marginal probability.
:::
:::Definition (Independence).
Two events $A, B$ are said to be independent if
$$P(A|B) = P(A) \Leftrightarrow P(A \cap B) = P(A) \cdot P(B)$$
for any $A, B \subseteq \mathcal{E}$. More generally, events $A_1, A_2, \dots, A_n$ are independent if and only if
$$P \left (\bigcap_{i \in I} A_i \right ) = \prod_{i \in I} P(E_i) \quad \forall I \subseteq {1, \dots, n}$$
:::
:::Definition (Law of Total Probability).
Let $C_1, C_2, \dots$ with $P(C_i) > 0$ be a partition, such that $C_i \cap C_j = \emptyset$ for $i \neq j$ and also that $\bigcup_{i=1}^\infty C_i = \Omega$. Then the total probability of an event $A$ is given by
$$P(A) = \sum_{i=1}^\infty P(A \cap C_i) = \sum_{i=1}^\infty P(A|C_i) \cdot P(C_i)$$
:::
:::Observation
The law of total probability is used for the marginal probability in Bayes' theorem since
$$P(B) = \sum_{i=1}^\infty P(B|C_i) \cdot P(C_i)$$
we get that
$$P(A|B) = \frac{P(B|A)\cdot P(A)}{\sum_{i=1}^\infty P(B|C_i) \cdot P(C_i)}$$
:::
:::Definition (Union bound).
Let $A_1, A_2, \dots, A_n$ be events of some probability space. Then we define
$$P \left (\bigcup_{i=1}^n A_i \right ) \leq \sum_{i = 1}^{n} P(A_i)$$
:::