Descriptive Statistics


Part of the Statistics for Machine Learning series.

Descriptive Statistics

:::Definition (Empirical mean).

Let $(x_1, \cdots, x_n)$ be a sample. Then the mean or center of mass of the data is given by

$$\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i$$

:::

:::Definition (Empirical variance).

The empirical variance with Bessel's correction of the sample is given by

$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\overline{x})^2$$

The (empirical) standard deviation is the root of the above

$$s = \sqrt{s^2}$$

:::

:::Definition (Empirical covariance).

The sample covariance matrix is a $K \times K$ matrix $\mathbf{Q}=\left[ q_{jk}\right]$ with entries

$$q_{jk}=\frac{1}{N-1}\sum_{i=1}^{N}\left( x_{ij}-\bar{x}j \right) \left( x_k \right)$$}-\bar{x

where $q_{jk}$ is an estimate of the covariance between the $j$-th variable and the $k$-th variable of the population underlying the data. In terms of the observation vectors, the sample covariance is

$$\mathbf{Q} = \frac{1}{N-1}\sum_{i=1}^{N}(\mathbf{x}{i} - \mathbf{\bar{x}})(\mathbf{x}$$} - \mathbf{\bar{x}})^{\mathrm{T}

:::

:::Definition (Empirical correlation coefficient).

For realizations of random variables $(x_i, y_i)_{i=1}^n$ we estimate $\mathbb{C}ov(X,Y)$ by

$$\begin{split} \mathbb{C}ov(X,Y) &= \mathbb{E}((X-\mathbb{E}(X))\cdot(Y-\mathbb{E}(Y))) \ &\approx \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\overline{x})(x_{i}-\overline{x})^{\mathrm{T}} \end{split}$$

and applying the empirical standard deviation we can estimate $\rho$ with

$$r = \frac{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\overline{x})(y_{i}-\overline{y})^{\mathrm{T}}}{s_x \cdot s_y}$$

:::

:::Definition (Median).

Let $(x_1, \dots, x_n)$ be a finite list of numbers such that $x_1 \leq \dots \leq x_n$. If $n$ is odd then

$$\text{median}(x) = x[(n+1)/2]$$

and if $n$ is even then

$$\text{median}(x) = \frac{x[n/2]+x[(n/2)+1]}{2}$$

:::

:::Definition (Mode).

The most frequent value in a data set.

:::