This is the first note of a series whose goal is to describe the close relation between machine learning and statistics. I describe here statistics in a nutshell from simple examples to a more advanced mathematical description.
The goal of statistics is, given a set of samples that is, points in a "space" \(X\), to find a "good" probability distribution that would have produced such samples. Let us draw two simple examples.
Imagine that your samples \((x_i)_{i=1}^N\) are daily records of the weather (either good weather or bad weather) in Paris for one year.
In that context, the space your are working with
is just the two elements set:
\[
\{\mathrm{bad}, \mathrm{good}\}.
\]
A probability distribution on such simple space consists in two numbers: the
probability of the daily weather to be good \(p(x=\mathrm{good})\)
and the probability of the weather to be bad
\[
p(x=\mathrm{bad}) = 1 - p(x=\mathrm{good}).
\]
Intuitively, the best probability distribution that fits our data is that
where the probability of the weather to be good is the average of good
weather days
\begin{align*}
p(x=\mathrm{good}) = \frac{\#\{i|\ x_i = \mathrm{good}\} }{N},
\\
p(x=\mathrm{bad}) = \frac{\#\{i|\ x_i = \mathrm{bad}\} }{N}.
\end{align*}
Imagine that we have a set of real values \((x_i)_{i=1}^N\) representing the height of a population. Their distribution is shown on the following histogram.
Such a distribution looks like that a Gaussian measure
of the form
\[
p_{\mu, \sigma}(x) = \frac{1}{\sigma \sqrt{2\pi}}
\exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)
\]
where \(\mu\) is the average value and \(\sigma>0\) is the standard
deviation around that average value.
We can then try to find the best Gaussian approximation
of our height distribution, that to find the best parameters
\(\mu, \sigma\) so that \(p_{\mu, \sigma}\) would be
more likely to produces our data
\((x_i)_{i=1}^N\) as independent samples.
Intuitively, one would choose \(\mu\)
to be the empirical average of the data
and \(\sigma^2\) to be the empirical variance around this empirical
average value:
\begin{align*}
\overline{x} &= \frac{1}{N} \sum_{i=1}^{N}x_i
\\
\overline{\sigma^2} &= \frac{1}{N} \sum_{i=1}^{N} (x_i - \overline{x})^2.
\end{align*}
The reader familiar with statistics may use also the following variance estimator \[ \widetilde{\sigma^2} = \frac{1}{N-1} \sum_{i=1}^{N} (x_i - \overline{x})^2 \] which is unbiased, but is not what is provided by the maximum likelihood method.
Let us formulate this in mathematical terms. We have \(N\) samples in \(X\) that actually forms a single point \((x_i)_{i=1}^N\) in the product space \(X^N\). Our statistical model consists in a parametrized set of probabilities on \(X^N\) \[ p_\theta^{(N)}(z_1, \ldots, z_N), \ \theta \in \Theta, \ (z_1, \ldots, z_N) \in X^N \] where \(\Theta\) is the parameter space ; our goal is to find a parameter \(\widehat{\theta} \in \Theta\) that maximises the probability of the samples sequence \((x_i)_{i=1}^N\) \[ \widehat{\theta} = \mathrm{argmax}_\theta \left(p_\theta^{(N)}(x_1, \ldots, x_N) \right). \] In most case we assume that the samples are independent and identically distributed on \(X\). Putting that information into our statistical model just means that \(p_\theta^{(N)}(z_1, \ldots, z_N)\) has the form \[ p_\theta^{(N)}(z_1, \ldots, z_N) = \prod_{i=1}^N p_\theta(z_i) \] where \(p_\theta\) is a parametrized probability distribution on \(X\). In that context, we want to find \(\widehat{\theta}\) that maximises the model probability to get the actual samples \[ \mathcal L_\theta = \prod_{i=1}^{N} p_\theta(x_i) \] which is called the likelihood \(\mathcal L_\theta\). Equivalently, we want to maximize the log-likelihood which is the sum \[ \log\mathcal L_\theta = \sum_{i=1}^{N} \log(p_\theta(x_i)). \] Let us go back to the two examples of the first paragraph.
We can imagine that our samples are produced independently by a probability distribution \(p\). In that context, a mathematical result called the large number theorem tells us that when \(N\) is large enough the normalized log-likelihood \(\frac{1}{N}\log \mathcal{L}_\theta\) converges to the expectation value of the quantity \(\log\left({p_\theta(x)}\right)\) for \(x\) distributed according to the probability distribution \(p\): \[ \frac{1}{N}\log \mathcal{L}_\theta = \frac{1}{N}\sum_{i} \log\left({p_\theta(x_i)}\right) \xrightarrow{N \to \infty} \int_{x \in X} p(x)\log\left({p_\theta(x)}\right) dx. \] The opposite of such a quantity is called the relative entropy of \(p\) with respect to \(p_\theta\): for two probability distributions \(q_1, q_2\) the relative entropy of \(q_1\) with respect to \(q_2\) is \[ H(q_1, q_2) = -\int_{x \in X} q_1(x)\log\left({q_2(x)}\right) dx = \int_{x \in X} q_1(x)\log\left(\frac{1}{q_2(x)}\right) dx. \] We can consider the relative entropy as some kind of distance between \(q_1\) and \(q_2\). However, it does have the usual properties of distances in mathematics. For instance, it is not zero when \(q_1=q_2\) but it is equal to the entropy of \(q_1\) \[ H(q_1) = H(q_1,q_1) = \int_{x \in X} q_1(x)\log\left(\frac{1}{q_1(x)}\right) dx. \] To fix that, we can substract the entropy \(H(q_1)\) from the relative entropy \(H(q_1, q_2)\) to obtain the Kullback Leibler divergence \[ KL(q_1||q_2) = H(q_1, q_2)-H(q_1) = \int_{x \in X} q_1(x)\log\left(\frac{q_1(x)}{q_2(x)}\right) dx. \] Nor is it a distance in the mathematical sense of the word but the concavity of the log function implies that
Let us shift our point of view towards the Bayesian perspective. In that perspective, before any knowledge on the samples, we postulate a probability distribution on \(X^N\) \[ q(z_1, \ldots, z_N),\ (z_1, \ldots, z_N) \in X^N \] that does not represent a fictional probability distribution that would produce the samples but rather our own expectation to see a particular set of samples appear; we use the letter \(q\) instead of the letter \(p\) to emphasize this shift in interpretation. In the context of parametrized Bayesian statistics and if we postulate independence of sample, \(q(z_1, \ldots, z_N)\) is built from a parametrized distribution \[ q_\theta(z_1,\ldots, z_N) = \prod_{i=1}^{N}q_\theta(z_i) \] that is, a set of distributions on \(X\) that depends on a parameter \(\theta \in \Theta\). Before seeing the actual samples, we have initially a vague idea of what the parameter \(\theta\) should be; and this idea is encoded in a probability distribution \(q_{\mathrm{prior}}(\theta)\) on \(\Theta\) called the prior probability. Combining the parametrized distribution \(q_\theta\) on \(X^N\) and the prior distribution \(q_{prior}(\theta)\) on \(\Theta\) gives us a probability distribution on the product \(X^N \times \Theta\) \[ Q(z_1, \ldots, z_N, \theta) = q_{\mathrm{prior}}(\theta) \prod_{i=1}^{N}q_\theta(z_i). \] Then, \(q(z_1,\ldots, z_N)\) is the marginal probability distribution on \(X^N\) from \(Q\): \begin{align*} q(z_1, \ldots, z_N) =& \int_{\theta \in \Theta} Q(z_1, \ldots, z_N, \theta)d\theta. \\ =&\int_{\theta \in \Theta} q_{\mathrm{prior}}(\theta) \times \prod_{i=1}^{N} q_{\theta}(z_i)d\theta. \end{align*} Moreover, \(q_\theta(z_1, \ldots, z_N)\) is actually the conditional probability distribution on \(X^N\) from \(Q\) with respect to \(\theta\) \[ q_\theta(z_1, \ldots, z_N) = Q(z_1, \ldots, z_N |\theta) = \frac{Q(z_1, \ldots, z_N, \theta)}{q_{\mathrm{prior}}(\theta)}. \] The effect of seeing the samples \((x_i)_{i=1}^N\) is to make us revise our view on on the parameters. Concretely, our new expectation on the parameters \(\theta \in \Theta\), that we call the posterior probability distribution \(q_{\mathrm{post}}(\theta)\), is the conditional probability from \(Q\) with respect to these samples \[ q_{\mathrm{post}}(\theta) = Q(\theta|x_1, \ldots, x_N) = \frac{Q(x_1, \ldots, x_N, \theta)}{q(x_1, \ldots, x_N)}. \] This can be rewritten (with the reknowned Bayes rule) as \[ q_{\mathrm{post}}(\theta) = \frac{q_{\mathrm{prior}}(\theta)\prod_{i=1}^N q_\theta(x_i)}{q(x_1, \ldots x_N)} = \frac{q_{\mathrm{prior}}(\theta)\mathcal{L}_\theta}{q(x_1, \ldots x_N)} \] where \(\mathcal{L}_\theta = \prod_{i=1}^N q_\theta(x_i)\) is the likelihood. The posterior probability is actually intractable in most real life contexts (as it requires to compute \(q(x_1, \ldots x_N)\) which is an integral). To circumvent this difficulty, we can instead search for the parameter value \(\widehat{\theta}\) that maximises the posterior probability distribution \(q_{\mathrm{post}}\) on the actual sample \(x_1, \ldots, x_N\); equivalently it maximises the product of \(q_{\mathrm{prior}}(\theta)\) with the likelihood \(\mathcal{L}_\theta\) (since \(q(x_1, \ldots x_N)\) does not depend on \(\theta\)); or taking the log, \(\widehat{\theta}\) maximises the sum \[ \log \mathcal{L}_\theta + \log(q_{\mathrm{prior}}(\theta)) = \sum_{i=1}^N \log(q_\theta(x_i)) + \log(q_{\mathrm{prior}}(\theta)). \] Such a method is called the maximisation a posteriori. The additional term \(\log(q_{\mathrm{prior}}(\theta))\) that reflects our prior beliefs on \(\theta\) is called the regularisation term. Usually, the more specific the prior is, the more this term pulls the maximum a posteriori parameter from the maximum likelihood parameter. In extreme cases: