This is the first note of a series whose goal is to describe the close relation between machine learning and statistics. I describe here statistics in a nutshell from simple examples to a more advanced mathematical description.
The goal of statistics is, given a set of samples that is, points in a "space" \(X\), to find a "good" probability distribution that would have produced such samples. Let us draw two simple examples.
Imagine that your samples \((x_i)_{i=1}^N\) are daily records of the weather (either good weather or bad weather) in Paris for one year.
Imagine that we have a set of real values \((x_i)_{i=1}^N\) representing the height of a population. Their distribution is shown on the following histogram.
The reader familiar with statistics may use also the following variance estimator \[ \widetilde{\sigma^2} = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \overline{x})^2 \] which is unbiased, but is not what is provided by the maximum likelihood method.
Let us formulate this in mathematical terms. We have \(N\) samples in \(X\) that actually forms a single point \((x_i)_{i=1}^N\) in the product space \(X^N\). Our statistical model consists in a parametrized set of probabilities on \(X^N\) \[ p_\theta^{(N)}(z_1, \ldots, z_N), \ \theta \in \Theta, \ (z_1, \ldots, z_N) \in X^N \] where \(\Theta\) is the parameter space ; our goal is to find a parameter \(\widehat{\theta} \in \Theta\) that maximises the probability of the samples sequence \((x_i)_{i=1}^N\) \[ \widehat{\theta} = \mathrm{argmax}_\theta \left(p_\theta^{(N)}(x_1, \ldots, x_N) \right). \] In most case we assume that the samples are independent and identically distributed on \(X\). Putting that information into our statistical model just means that \(p_\theta^{(N)}(z_1, \ldots, z_N)\) has the form \[ p_\theta^{(N)}(z_1, \ldots, z_N) = \prod_{i=1}^N p_\theta(z_i) \] where \(p_\theta\) is a parametrized probability distribution on \(X\). In that context, we want to find \(\widehat{\theta}\) that maximises the model probability to get the actual samples \[ \mathcal L_\theta = \prod_{i=1}^{N} p_\theta(x_i) \] which is called the likelihood \(\mathcal L_\theta\). Equivalently, we want to maximize the log-likelihood which is the sum \[ \log\mathcal L_\theta = \sum_{i=1}^{N} \log(p_\theta(x_i)). \] Let us go back to the two examples of the first paragraph.
We can imagine that our samples are produced independently by a probability distribution \(p\). In that context, a mathematical result called the large number theorem tells us that when \(N\) is large enough the normalized log-likelihood \(\frac{1}{N}\log \mathcal{L}_\theta\) converges to the expectation value of the quantity \(\log\left({p_\theta(x)}\right)\) for \(x\) distributed according to the probability distribution \(p\): \[ \frac{1}{N}\log \mathcal{L}_\theta = \frac{1}{N}\sum_{i} \log\left({p_\theta(x_i)}\right) \xrightarrow{N \to \infty} \int_{x \in X} p(x)\log\left({p_\theta(x)}\right) dx. \] The opposite of such a quantity is called the relative entropy of \(p\) with respect to \(p_\theta\): for two probability distributions \(q_1, q_2\) the relative entropy of \(q_1\) with respect to \(q_2\) is \[ H(q_1, q_2) = -\int_{x \in X} q_1(x)\log\left({q_2(x)}\right) dx = \int_{x \in X} q_1(x)\log\left(\frac{1}{q_2(x)}\right) dx. \] We can consider the relative entropy as some kind of distance between \(q_1\) and \(q_2\). However, it does have the usual properties of distances in mathematics. For instance, it is not zero when \(q_1=q_2\) but it is equal to the entropy of \(q_1\) \[ H(q_1) = H(q_1,q_1) = \int_{x \in X} q_1(x)\log\left(\frac{1}{q_1(x)}\right) dx. \] To fix that, we can substract the entropy \(H(q_1)\) from the relative entropy \(H(q_1, q_2)\) to obtain the Kullback Leibler divergence \[ KL(q_1||q_2) = H(q_1, q_2)-H(q_1) = \int_{x \in X} q_1(x)\log\left(\frac{q_1(x)}{q_2(x)}\right) dx. \] Nor is it a distance in the mathematical sense of the word but the concavity of the log function implies that
Let us shift our point of view towards the Bayesian perspective. In that perspective, before any knowledge on the samples, we postulate a probability distribution on \(X^N\) \[ q(z_1, \ldots, z_N),\ (z_1, \ldots, z_N) \in X^N \] that does not represent a fictional probability distribution that would produce the samples but rather our own expectation to see a particular set of samples appear; we use the letter \(q\) instead of the letter \(p\) to emphasize this shift in interpretation. In the context of parametrized Bayesian statistics and if we postulate independence of sample, \(q(z_1, \ldots, z_N)\) is built from a parametrized distribution \[ q_\theta(z_1,\ldots, z_N) = \prod_{i=1}^{N}q_\theta(z_i) \] that is, a set of distributions on \(X\) that depends on a parameter \(\theta \in \Theta\). Before seeing the actual samples, we have initially a vague idea of what the parameter \(\theta\) should be; and this idea is encoded in a probability distribution \(q_{\mathrm{prior}}(\theta)\) on \(\Theta\) called the prior probability. Combining the parametrized distribution \(q_\theta\) on \(X^N\) and the prior distribution \(q_{prior}(\theta)\) on \(\Theta\) gives us a probability distribution on the product \(X^N \times \Theta\) \[ Q(z_1, \ldots, z_N, \theta) = q_{\mathrm{prior}}(\theta) \prod_{i=1}^{N}q_\theta(z_i). \] Then, \(q(z_1,\ldots, z_N)\) is the marginal probability distribution on \(X^N\) from \(Q\): \begin{align*} q(z_1, \ldots, z_N) =& \int_{\theta \in \Theta} Q(z_1, \ldots, z_N, \theta)d\theta. \\ =&\int_{\theta \in \Theta} q_{\mathrm{prior}}(\theta) \times \prod_{i=1}^{N} q_{\theta}(z_i)d\theta. \end{align*} Moreover, \(q_\theta(z_1, \ldots, z_N)\) is actually the conditional probability distribution on \(X^N\) from \(Q\) with respect to \(\theta\) \[ q_\theta(z_1, \ldots, z_N) = Q(z_1, \ldots, z_N |\theta) = \frac{Q(z_1, \ldots, z_N, \theta)}{q_{\mathrm{prior}}(\theta)}. \] The effect of seeing the samples \((x_i)_{i=1}^N\) is to make us revise our view on on the parameters. Concretely, our new expectation on the parameters \(\theta \in \Theta\), that we call the posterior probability distribution \(q_{\mathrm{post}}(\theta)\), is the conditional probability from \(Q\) with respect to these samples \[ q_{\mathrm{post}}(\theta) = Q(\theta|x_1, \ldots, x_N) = \frac{Q(x_1, \ldots, x_N, \theta)}{q(x_1, \ldots, x_N)}. \] This can be rewritten (with the reknowned Bayes rule) as \[ q_{\mathrm{post}}(\theta) = \frac{q_{\mathrm{prior}}(\theta)\prod_{i=1}^N q_\theta(x_i)}{q(x_1, \ldots x_N)} = \frac{q_{\mathrm{prior}}(\theta)\mathcal{L}_\theta}{q(x_1, \ldots x_N)} \] where \(\mathcal{L}_\theta = \prod_{i=1}^N q_\theta(x_i)\) is the likelihood. The posterior probability is actually intractable in most real life contexts (as it requires to compute \(q(x_1, \ldots x_N)\) which is an integral). To circumvent this difficulty, we can instead search for the parameter value \(\widehat{\theta}\) that maximises the posterior probability distribution \(q_{\mathrm{post}}\) on the actual sample \(x_1, \ldots, x_N\); equivalently it maximises the product of \(q_{\mathrm{prior}}(\theta)\) with the likelihood \(\mathcal{L}_\theta\) (since \(q(x_1, \ldots x_N)\) does not depend on \(\theta\)); or taking the log, \(\widehat{\theta}\) maximises the sum \[ \log \mathcal{L}_\theta + \log(q_{\mathrm{prior}}(\theta)) = \sum_{i=1}^N \log(q_\theta(x_i)) + \log(q_{\mathrm{prior}}(\theta)). \] Such a method is called the maximisation a posteriori. The additional term \(\log(q_{\mathrm{prior}}(\theta))\) that reflects our prior beliefs on \(\theta\) is called the regularisation term. Usually, the more specific the prior is, the more this term pulls the maximum a posteriori parameter from the maximum likelihood parameter. In extreme cases: