This is the second note of our series about the close relation between machine learning and statistics. I describe how supervised learning may be understood in the framework of statistics.
Let us suppose that we have a space of features \(X\) and a space of labels \(Y\) together with samples \[ \left((x_i, y_i)\right)_{i=1}^N \in (X \times Y)^N \] that are pairs \((x,y)\) of a feature element \(x \in X\) and a label \(y \in Y\). In supervised learning, we usually assume that the samples produced independently by a probability distribution \(p\) on \(X \times Y\) that we refer to as the empirical distribution. We do not know \(p\) but we only have information about it through its samples. Our goal is to find a "good" conditional probability distribution on \(Y\) with respect to the elements \(x\) of \(X\) that fits our data. More concretely, we have a parametrized set of functions from \(X\) to probability distributions on \(Y\): \[ p_\theta(y|x),\quad x \in X, \ y \in Y,\ \theta \in \Theta \] where \(\Theta\) is our set of parameters. Combining this theoretical conditional probability distribution with the empirical distribution on \(X\) yields a mixed parametrized probability distribution on the product \(X \times Y\): \[ p_\theta (x,y) = p(x)p_\theta(y|x). \] Our goal can then be described as finding the parameter \(\theta\) that minimises some kind of distance between the empirical distribution and the mixed probability distribution on \(X \times Y\). The common practice is to use the Kullback-Leibler divergence. Let us compute it: \begin{align*} KL(p||p_\theta) &= \int_{(x, y)\in X \times Y } p(x,y) \log \left(\frac{p(x, y)}{p_\theta(x,y)}\right) \\ &= \int_{(x, y)\in X \times Y } p(x,y) \log \left(\frac{p(y|x)}{p_\theta(y|x)}\right) \\ &= -\int_{(x, y)\in X \times Y } p(x,y) \log \left({p_\theta(y|x)}\right) + \int_{(x, y)\in X \times Y } p(x,y) \log \left({p(y|x)}\right) \\ &=-\int_{(x, y)\in X \times Y } p(x,y) \log \left({p_\theta(y|x)}\right) + \mathrm{cst} \end{align*} where \(\mathrm{cst}\) represents a value that does not depends on \(\theta\). We cannot compute the Kullback-Leibler divergence as it stands. However, we can approximate the integral \[ \int_{(x, y)\in X \times Y } p(x,y) \log \left({p_\theta(y|x)}\right) \] with our data and a simple Monte Carlo technique. Indeed, if \(N\) is large enough, the law of large numbers tells us that the mean value of \(\log \left({p_\theta(y_i|x_i)}\right)\) converges to this same integral \[ \frac{1}{N}\sum_{i=1}^N \log \left({p_\theta(y_i|x_i)}\right) \xrightarrow{N \to \infty} \int_{(x, y)\in X \times Y } p(x,y) \log \left({p_\theta(y|x)}\right). \] Our task thus becomes finding the parameter \(\theta \in \Theta\) that maximises this mean value. Actually, we try to minimise its opposite that we call the loss \begin{align*} \mathrm{Loss}(\theta) &= - \frac{1}{N}\sum_{i=1}^N \log \left({p_\theta(y_i|x_i)}\right) \\ &= \frac{1}{N}\sum_{i=1}^N \log \left(\frac{1}{p_\theta(y_i|x_i)}\right) \end{align*} as we will see in the context of regression and classification. In many cases, the parametrized conditional distribution \(p_\theta(y|x)\) has the form \[ p_\theta(y|x) = \frac{\exp\left(-\frac{E_\theta(y|x)}{T}\right)}{Z_{\theta, T}(x)} \] where the function \(E_\theta(x|y)\) is called the energy, as a reference to Statistical Physics, and where \(Z_{\theta, T}(x)\) called the partition function is the normalisation integral \[ Z_{\theta, T}(x) = \int_{z\in Y}\exp\left(-\frac{E_\theta(z|x)}{T}\right)dz. \] Moreover, \(T\) is an hyperparameter called the temperature (again as a reference to Statistical Physics) that may be modified at inference to control the level of uncertainty of the model. In that context, the loss, becomes \[ \mathrm{Loss}(\theta) = \frac{1}{NT} \sum_{i=1}^{N} E_\theta(y_i|x_i) + \frac{1}{N} \sum_{i=1}^{N} \log(Z_{\theta, T}(x_i)). \] In some cases, for instance regression, \(Z_{\theta, T}(x_i)\) does not depend on \(\theta\) and we are thus left to minimise the mean energy value. Besides, the parameter \(\theta\) is usually optimised through gradient descent. For that purpose, one needs to compute the gradient of the loss with respect to the parameters \begin{align*} \frac{\partial \mathrm{Loss}(\theta)}{\partial \theta} &= \frac{1}{NT} \sum_{i=1}^{N} \frac{\partial E_\theta(y_i|x_i)}{\partial \theta} + \frac{1}{N} \sum_{i=1}^{N} \frac{1}{Z_{\theta, T}(x_i)} \frac{\partial Z_{\theta, T}(x_i)}{\partial \theta} \\ &= \frac{1}{N} \sum_{i=1}^{N} \left(\frac{1}{T} \frac{\partial E_\theta(y_i|x_i)}{\partial \theta} +\frac{1}{Z_{\theta, T}(x_i)} \frac{\partial Z_{\theta, T}(x_i)}{\partial \theta} \right). \end{align*}
We can also adopt a Bayesain point of view. In that perspective, the parametrized conditional probability distribution \[ p_\theta(y|x), \quad x \in X,\ y \in Y,\ \theta \in \Theta \] represents the possible ways that we assume \(y\) may depend on \(x\). Our initial beliefs on such a dependence is represented by a probability distribution on the set of parameters \(\Theta\) called the prior distribution \(p_{\mathrm{prior}}(\theta)\). Then, the appearance of samples will make us revise these beliefs. Concretely, the combination of the parametrized conditional probability distribution \(p_\theta(y|x)\) and the prior distribution \(p_{\mathrm{prior}}(\theta)\) yields a probability distribution on \(\Theta\times Y^N\) \[ P(\theta, v_1, \ldots, v_N|u_1, \ldots, u_N) = p_{\mathrm{prior}}(\theta)\prod_{i=1}^{N}p_\theta(v_i|u_i), \quad \theta \in \Theta,\ v_1, \ldots, v_N \in Y^N,\ u_1, \ldots, u_N \in X^N \] conditional to elements \(u_1, \ldots, u_N \in X^N\). The appearance of the actual labels \((x_1, y_1), \ldots, (x_N,y_N)\) yields a posterior probability distribution on the labels \begin{align*} p_{\mathrm{post}}(\theta) &= P(\theta|x_1, y_1, \ldots, x_N,y_N) \\ &=\frac{P(y_1, \ldots, y_N|x_1, \ldots, x_N)}{P(\theta, y_1, \ldots, y_N|x_1, \ldots, x_N)} \\ &= \frac{p_{\mathrm{prior}}(\theta)\prod_{i=1}^{N}p_\theta(y_i|x_i)}{\int_{\tau \in \Theta} p_{\mathrm{prior}}(\tau)\prod_{i=1}^{N}p_\tau(y_i|x_i)d\tau} \end{align*} which represents these revised beliefs about the dependence of \(y\) on \(x\). When the samples appear in a sequential fashion, this update of beliefs may also be performed in a sequential fashion. Imagine that we start again with our prior probability distribution and that we observe only the first sample. We then update our beliefs into \[ p_{\mathrm{post}}^{(1)}(\theta) = \frac{p_{\mathrm{prior}}(\theta)p_\theta(y_1|x_1)}{\int_{\tau \in \Theta} p_{\mathrm{prior}}(\tau)p_\tau(y_1|x_1)d\tau}. \] Next, we observe the second sample and update again our beliefs \begin{align*} p_{\mathrm{post}}^{(2)}(\theta) &= \frac{p_{\mathrm{post}}^{(1)}(\theta)p_\theta(y_2|x_2)}{\int_{\tau \in \Theta} p_{\mathrm{post}}^{(1)}(\tau)p_\tau(y_2|x_2)d\tau} \\ &= \frac{\frac{p_{\mathrm{prior}}(\theta)p_\theta(y_1|x_1)p_\theta(y_2|x_2)}{\int_{\tau \in \Theta} p_{\mathrm{prior}}(\tau)p_\tau(y_1|x_1)d\tau}}{\int_{\tau \in \Theta} \frac{p_{\mathrm{prior}}(\tau)p_\tau(y_1|x_1)}{\int_{\tau' \in \Theta} p_{\mathrm{prior}}(\tau')p_{\tau'}(y_1|x_1)d\tau'}p_\tau(y_2|x_2)d\tau} \\ &= \frac{p_{\mathrm{prior}}(\theta)p_\theta(y_1|x_1)p_\theta(y_2|x_2)}{\int_{\tau \in \Theta} p_{\mathrm{prior}}(\tau)p_\tau(y_1|x_1)p_\tau(y_2|x_2)d\tau}. \end{align*} We do the same when we observe the third, fourth, ... samples, and after \(N\) iterations, we get back the posterior probability distribution \[ p_{\mathrm{post}}(\theta) = p^{(N)}_{\mathrm{post}}(\theta) = \frac{p_{\mathrm{prior}}(\theta)p_\theta(y_1|x_1)\cdots p_\theta(y_N|x_N)}{\int_{\tau \in \Theta} p_{\mathrm{prior}}(\tau)p_\tau(y_1|x_1)\cdots p_\tau(y_N|x_N)d\tau}. \] Besides, it is much less convenient to work with a probability distribution on the set of parameters \(\Theta\) than with a single parameter \(\theta\). Worse, the posterior distribution is often intractable or hard to sample. To circumvent this difficulty, one can approximate the posterior distribution by the maximum a posteriori value that is the parameter \(\theta\) which maximises it \[ \widehat{\theta}_{\mathrm{MAP}} = \mathrm{argmax}_\theta \left(p_{\mathrm{post}}(\theta)\right). \] Equivalently, the maximum a posteriori value minimises the normalised regularised loss \begin{align*} \mathrm{Loss}_{\mathrm{reg}}(\theta) &= -\frac{1}{N}\log\left(p_{\mathrm{prior}}(\theta)p_\theta(y_1|x_1)\cdots p_\theta(y_N|x_N)\right) \\ &= \frac{1}{N}\log\left(\frac{1}{p_{\mathrm{prior}}}(\theta)\right) + \frac{1}{N}\sum_{i=1}^{N} \log\left(\frac{1}{p_{\theta}(y_i|x_i)}\right) \\ &= \mathrm{Reg}_{\theta, N} + \mathrm{Loss}(\theta) \end{align*} which is equal to the sum of the loss described in the previous paragraph with a regularisation term \[ \mathrm{Reg}_{\theta, N} = \frac{1}{N}\log\left(\frac{1}{p_{\mathrm{prior}}}(\theta)\right) \] that prevents the maximum a posteriori value to become overly improbable with respect to the prior distribution.
Regression is a setting of supervised learning where \(Y\) is the set of real numbers \(\mathbb R\) and the parametrized conditional probability distribution on \(Y=\mathbb R\) \[ p_\theta(y|x) = \frac{1}{\sqrt{2\pi}}\exp\left(\frac{-(y-f_\theta(x))^2}{2}\right) \] is the Gaussian distribution \( \mathcal N (f_\theta(x), 1)\) with variance \(1\) and average a parametrized function \[ f_\theta(x)\in Y,\quad \theta \in \Theta,\ x \in X. \] In Statistical Physics terms, the energy of the model is \[ E_\theta(y|x) = \frac{1}{2}(y-f_\theta(x))^2 \] with temperature \(1\). In that context, the loss function and its gradient with respect to \(\theta\) are respectively \begin{align*} \mathrm{Loss}(\theta) &= \frac{1}{2N} \sum_{i=1}^N (y_i - f_\theta(x_i))^2 + \mathrm{cst} \\ \frac{\partial \mathrm{Loss}(\theta)}{\partial \theta} &= \frac{1}{N} \sum_{i=1}^N \frac{\partial f_\theta(x_i)}{\partial \theta} (f_\theta(x_i)-y_i). \end{align*}
Classification is another setting of supervised learning where \(Y\) is a finite set \(\{0, \ldots, K-1\}\). In Statistical Physics terms, the parametrized conditional probability distribution \(p_\theta(y|x)\) on \(Y\) is built from energy functions \[ E_{\theta, k}(x)\in \mathbb{R},\quad \theta \in \Theta,\ x \in X, \ k \in \{0, \ldots, K-1\} \] and \[ p_\theta(y|x) = \frac{\exp\left(-E_{\theta, y}(x)\right)}{ \sum_{k=0}^{K-1} \exp\left(-E_{\theta, k}(x)\right)}. \] The associated loss function and its gradient with respect to \(\theta\) have respectively the form \begin{align*} \mathrm{Loss}(\theta) &= \frac{1}{N}\sum_{i=1}^{N} \left(E_{\theta, y_i}(x_i) + \log \left(\sum_{k=0}^{K-1} \exp\left(-E_{\theta, k}(x_i)\right)\right)\right) \\ \frac{\partial \mathrm{Loss}(\theta)}{\partial \theta} &= \frac{1}{N}\sum_{i=1}^{N} \left(\frac{\partial E_{\theta, y_i}(x_i)}{\partial \theta} - \sum_{k=0}^{K-1} \frac{\partial E_{\theta, k}(x_i)}{\partial \theta} p(k|x_i)\right) \\ &=\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{K-1} \frac{\partial E_{\theta, k}(x_i)}{\partial \theta} \left(\delta_{y_i, k} - p(k|x_i) \right) \end{align*} where the Kronecker symbol \(\delta_{y_i, k}\) is \(1\) if \(y_i = k\) and \(0\) otherwise. If one adds up to the energy a function \(f(x)\) that does not depend on \(y\) \[ E_{\theta, y}(x) \leftarrow E_{\theta, y}(x) + f(x) \] the model probabilities \(p_\theta(y|x)\) remain unchanged. In that sense, functions of \(x\) form a gauge group of our model. In many cases, there is a special class in our problem that represents the base case; for instance a negative result in a screening test in medicine, or non frauder in fraud detection. Let us say that it is the class numbered \(0\). In that context, one can perform a gauge fixing by removing from our energy functions the energy relative to this class \(E_{\theta, 0}(x)\) \[ E_{\theta, y}(x) \leftarrow E_{\theta,y}(x) - E_{\theta, 0}(x). \] In other words, we force the energy relative to the class \(0\) to be zero. In binary classification, this gauge fixing yields the probabilities \begin{align*} p_\theta(1|x) &= \sigma(-E_{\theta, 1}(x)) \\ p_\theta(0|x) &= 1 - p_\theta(1|x) = 1- \sigma(E_{\theta, 1}(x)) = \sigma (E_{\theta, 1}(x)) \end{align*} where \(\sigma\) is the sigmoid function \[ \sigma(x) = \frac{e^x}{1 + e^x}. \]
In the context of classification, the Kullback Leibler divergence between the empirical distribution \(p(x, k)\) on the feature-label space \(X\times Y\) and the mixed probability distribution \(p_\theta(x, k) = p(j)p_\theta(x|k)\) may be rewritten as the weighted sum \begin{align*} KL(p(x, k)||p_\theta(x, k)) &= \sum_{k=0}^{K-1}\int_{x\in X} p(x, k)\log \left(\frac{p(x,k)}{p_\theta(x, k)}\right) \\ &= \sum_{k=0}^{K-1}p(k)\int_{x\in X}p(x| k)\log \left(\frac{p(x,k)}{p_\theta(x, k)}\right) \\ &= \sum_{k=0}^{K-1}p(k)KL_k \end{align*} where \(KL_k\) is the class specific divergence \[ KL_k = \int_{x\in X}p(x| k)\log \left(\frac{p(x,k)}{p_\theta(x, k)}\right) = \int_{x\in X}- p(x| k)\log \left(p_\theta(k| x)\right) + \mathrm{cst}. \] In many cases, classes are equally important even if they are highly imbalanced in the dataset. For instance, in medical screening, there are usually much more negative results than positive results, but the two classes are equally important. In that context, minimising \(KL_k\) is as important as minimising \(KL_{k'}\) for two classes \(k \neq k'\) contrary to the behavior of the Kullback Leibler whose minimisation disregards the behaviour of the model on small classes. For instance, in the medical screening problem, the classifier that always predicts a negative result is a good model in the lens of the Kullback Leibler divergence. We thus should change our notion of a distance between probabilities. We can for instance use the sum of the class specific divergences with equal weights. \[ \frac{1}{K}\sum_{k=0}^{K-1} KL_k, \] which, up to a constant, may be approximated by the balanced loss \[ \mathrm{Loss}_{\mathrm{balanced}}(\theta) = \frac{1}{K}\sum_{k=1}^{K-1}\frac{1}{N_k}\sum_{i, y_i=k} - \log \left(p_\theta(y_i| x_i)\right). \] where \(N_k\) denotes the number of samples whose label belongs to the class \(k\). Another way, to handle class imbalance is to modify the dataset, for instance by oversampling the small classes or/and undersampling the large ones.