Machine learning as Statistics on steroids 3

Statistical encoding

This is the third note of our series about the close relation between machine learning and statistics. We deal with a framework of unsupervised learning that we call statistical encoding and that encompasses autoencoders, variational autoencoders and the EM algorithm.

The framework of statistical encoding.

Let us consider a dataset that consists in elements \(x_1, \ldots, x_N\) in a space \(X\). We assume that such elements are independent samples of a probability distribution \(p\) on \(X\) that we call the empirical probability distribution. The presentation of our dataset within the space \(X\) may be cumbersome and one would like to get a better description. For instance, one can imagine, that our samples occur on a submanifold of \(X\) that deploys itself in a quirky way within \(X\); in that context, the description of the samples directly on the coordinates of the submanifold may be much more informative than the presentation provided by \(X\). In this line of thoughts, our goal is to find a latent space \(Z\) so that our samples in \(X\) are produced by samples drawn from a simple probability distribution on \(Z\) that are then mapped to \(X\). One also wants, given an actual sample \(x \in X\) to get approximately the latent point in \(Z\) that would have produced \(x\). For that purpose, in addition to the space \(Z\) we introduce a parametrized probability distribution on the latent space \(Z\) \[ p_\lambda (z),\quad \lambda \in \Lambda, z \in Z, \] and two parametrized conditional probability distributions:

In that context, given three parameters \(\theta \in \Theta\), \(\phi \in \Phi\) and \(\lambda \in \Lambda\), one has two ways to produce samples in \(X\times Z\): In that context, our goal is to find three parameters that minimise the distance between the two probability distributions on \(X \times Z\) which are \(p_\phi(x, z)\) and \(p_{\theta, \lambda}(x, z)\). The Kullback-Leibler divergence between \(p_\phi(x, z)\) and \(p_{\theta, \lambda}(x, z)\) is \begin{align*} &KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) \\ &= \int_{x\in X, z \in Z}p_{\phi}(x,z)\log \\ &= \int_{x\in X, z \in Z}p(x)p_{\phi}(z|x) \log\left(\frac{p_\phi(z|x_i)}{p_\theta(x_i|z)p_\lambda(z)}\right) dxdz - H(p(x)) \end{align*} Using the law of large numbers as the samples are assumed to be independent, the quantity \[ KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z))- H(p(x)) = \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p_\phi(z|x)}{p_\theta(x|z)p_\lambda(z)}\right)dzdx \] may be approximated by the loss function \[ \mathrm{Loss}(\theta, \lambda, \phi) = \frac{1}{N} \sum_{i=1}^{N} \int_{z \in Z} p_\phi(z|x_i) \log\left(\frac{p_\phi(z|x_i)}{p_\theta(x_i|z)p_\lambda(z)}\right)dz. \] This loss seems hard to handle as it involves computing as many integrals as there are samples. However it may be simplified or further approximated in some situations.

Several perspectives on the loss function.

The likelihood and the Bayesian encoder.

The purely parametrized probability distribution \(p_{\theta, \lambda}(x,z)\) on \(X \times Z\) yields

The Kullback-Leibler divergence between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\) that we try to minimise may be decomposed as the sum of the divergence from the empirical distribution \(p(x)\) to the parametrized distribution \begin{align*} KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) =& \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p(x)}{p_{\theta, \lambda}(x)}\right)dzdx \\ &+ \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p_\phi(z|x)}{p_{\theta, \lambda(z|x)}}\right)dzdx \\ =& KL\left(p(x)|| p_{\theta, \lambda}(x)\right) + \int_{x \in X}p(x) KL \left(p_\phi(z|x)|| p_{\theta, \lambda}(z|x)\right). \end{align*} In the same line of thoughts, we can decompose the loss as the sum and \[ \mathrm{Loss}(\theta, \lambda, \phi) = - \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X + \frac{1}{N}\sum_{i=1}^{N} KL(p_{\phi}(z|x_i)||p_{\theta, \lambda}(z|x_i)) \] of the rescaled opposite of the log likelihood \[ \log\mathcal{L}_{\theta, \lambda}^X = \sum_{i=1}^{N}\log p_{\theta, \lambda}(x_i) \] and the average with respect to \(x_i\) of the Kullback-Leibler divergences with respect to \(z\) between the encoder \(p_\phi(z|x_i)\) and the Bayes encoder \(p_{\theta, \lambda}(z|x_i)\). Thus, minimising the loss may be understood as to achieving two objectives

Remark

The opposite of loss is always lower than the rescaled log likelihood \[ -\mathrm{Loss}(\theta, \lambda, \phi) = \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X - \frac{1}{N}\sum_{i=1}^{N} KL(p_{\phi}(z|x_i)||p_{\theta, \lambda}(z|x_i)) \leq \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X \] as the Kullback-Leibler divergence is non negative. For that reason, this opposite loss is often called the evidence lower bound.

The encoder-decoder perspective.

One can decompose the Kullback-Leibler divergence between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\) that we try to minimise as \begin{align*} KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) =& - \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dzdx \\ & + \int_{x \in X}p(x) KL(p_{\phi}(z|x) || p_\lambda(z))dx \\ &- H(p(x)). \end{align*} Minimising the first term tends, for every \(x\in X\), to make the following sampling procedure, yield a \(y\in X\) that is close to \(x\):

For instance, if \(p_\theta(x|z)\) is a Gaussian distribution with fixed isotropic covariance matrix and average a parametrized function \(g_\theta(z)\), this first term becomes \[ - \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dzdx =\kappa \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\lVert x - g_\theta(z) \rVert^2 dzdx + \kappa' \] where \(\kappa, \kappa'\) are constants. Besides, minimising the second term of the Kullback-Leibler divergence tends to bring the encoder \(p_\phi(z|x)\) closer to the distribution on the latent space \(p_\lambda(z)\). The last term of the divergence is constant.

The EM-algorithm.

The Expectation-Maximisation (EM) algorithm (see [1]) consists in minimising the loss alternatively with respect to \(\phi\) and with respect to \(\theta\) and \(\lambda\). Precisely, it consists in the following two steps, that are performed alternatively

Autoencoders.

The autoencoder architecture (see [2]) is a particular incarnation of the statistical encoding framework wher \(X\) and \(Z\) are Euclidian spaces: \(X= \mathbb R^F\) and \(Z = \mathbb R^J\). Moreover,

In that context, the Kullback-Leibler divergence between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\) is \begin{align*} KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) =& \int_{x \in X}p(x) \log\left(\frac{p(x)}{p_{\theta}(x|f_\phi(x)) p_{\lambda}(f_\phi(x))}\right)dzdx \\ =& \frac{1}{2}\int_{x \in X} p(x)\left( \lVert x- g_\theta f_\phi(x)\rVert^2 + \frac{1}{S}\lVert f_\phi(x)\rVert^2\right)dx + \mathrm{cst} \end{align*} and the Loss function (that do not depends on \(\lambda\)) becomes \[ \mathrm{Loss}(\theta, \phi) = \frac{1}{2N}\sum_{i=1}^{N}\lVert x_i- g_\theta f_\phi(x_i)\rVert^2 + \frac{1}{2NS} \sum_{i=1}^{N} \lVert f_\phi(x_i)\rVert^2 + \mathrm{cst}. \] As \(S\) is very large the second term of is very low and thus, and we thus actually try to minimise the autoencoder loss \[ \widetilde{\mathrm{Loss}}(\theta, \phi)= \frac{1}{2N}\sum_{i=1}^{N}\lVert x_i- g_\theta f_\phi(x_i)\rVert^2. \] A modification of the classical autoencoder is given by the denoising autoencoder ([3]) where the encoder consists in applying some random noise on the input \(x\) and then apply the function \(f_\phi\). The encoder conditional probability distribution is then the probability associated to this sampling procedure.

Variational autoencoders

As autoencoders, variational autoencoders (see [4]) assume that \(X\) and \(Z\) are still Euclidian spaces \(X= \mathbb R^F\) and \(Z = \mathbb R^J\). Besides

In that context, for every \(x \in X\), the Kullback-Leibler divergence with respect to \(z\) between \(p_{\phi}(z|x)\) and \(p_\lambda(z)\), is given, in the by the formula \[ KL\left(p_{\phi}(z|x)||p_{\lambda}(z)\right) = \frac{1}{2} \sum_{j=1}^{J} \left( \mu_\phi^{(j)}(x)^2 + \sigma_\theta^{(j)}(x)^2 - 1 - \log(\sigma_\phi^{(j)}(x)) \right). \] Besides, for every \(x \in X\), the integral \(\int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dz\) may be approximated by a Monte Carlo technique. Indeed, if a random variable \(\epsilon \in X\) follows a standard Gaussian law, then the random variable \[ \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon \in Z \] whose jth component is \[ \left(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon\right)^{(j)} = \mu_\phi(x)^{(j)} + \sigma_\phi^{(j)}(x) \epsilon^{(j)} \] follows the law \(p_\phi(z|x)\). Then, one can approximate the integral by computing the sum \[ \frac{1}{L}\sum_{l=1}^{L} \log\left(p_{\theta}(x|\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon_{x,l})\right) \] where \(\epsilon_{x,1}, \ldots, \epsilon_{x,L}\) are independent samples of the standard Gaussian distribution on \(Z\). Thus, the Loss function is, in the encoder-decoder perspective, given by the sum \begin{align*} \mathrm{Loss}(\theta, \phi) =& - \frac{1}{N}\sum_{i=1}^{N}\sum_{l=1}^{L} \log\left(p_{\theta}(x_i|\mu_\phi(x_i) + \sigma_\phi(x_i) \odot \epsilon_{i,l})\right)dzdx \\ & + \frac{1}{2N} \sum_{i=1}^{N}\sum_{j=1}^{J} \left( \mu_\phi^{(j)}(x_i)^2 + \sigma_\theta^{(j)}(x_i)^2 - 1 - \log(\sigma_\phi^{(j)}(x_i)) \right) dx. \end{align*}

The Gaussian mixture model.

The Gaussian mixture model is a soft clustering method that leverages the framework described above. We assume that \(X\) is a Euclidian space \(X= \mathbb{R}^F\), and \(Z= \{0, \ldots, K-1\}\) is a finite set that represents the clusters. Besides

Thus the whole parameter space is made up of the tuples \[ \left(\pi_0, \mu_0, S_0,\pi_0^{(e)}, \mu_0^{(e)}, S_0^{(e)}, \ldots, \pi_{K-1}, \mu_{K-1}, S_{K-1},\pi_{K-1}^{(e)}, \mu_{K-1}^{(e)}, S_{K-1}^{(e)} \right) \] where, for \(z \in Z\), \(\pi_z, \pi_z^{(e)}\) are non negative numbers, \(\mu_z, \mu_z^{(e)}\) are vectors in \(X\), \(S_z, S_z^{(e)}\) are positive definite symmetric \(F\times F\) matrices and \[ \sum_{z=0}^{K-1} \pi_z = 1 = \sum_{z=0}^{K-1} \pi_z^{(e)}. \] The elements that have an \((e)\) superscript are the parameters of the encoder. In this context, one can perform the EM-algorithm as follows:

Bibliography.

  1. Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1–38.
  2. Geoffrey E. Hinton, and Ruslan R. Salakhutdinov. “Reducing the dimensionality of data with neural networks.” Science 313.5786 (2006): 504-507.
  3. Vincent, Pascal \& Larochelle, Hugo \& Bengio, Y. \& Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning. 1096-1103. 10.1145/1390156.1390294.
  4. Diederik P. Kingma, and Max Welling. “Auto-encoding variational bayes.” ICLR 2014.
  5. Bishop, Christopher M. Pattern Recognition and Machine Learning. New York :Springer, 2006.