Machine learning as Statistics on steroids 3

Statistical encoding

This is the third note of our series about the close relation between machine learning and statistics. We deal with a framework of unsupervised learning that we call statistical encoding and that encompasses autoencoders, variational autoencoders and the EM algorithm.

The framework of statistical encoding.

Let us consider a dataset that consists in elements \(x_1, \ldots, x_N\) in a space \(X\). We assume that such elements are independent samples of a probability distribution \(p\) on \(X\) that we call the empirical probability distribution. The presentation of our dataset within the space \(X\) may be cumbersome and one would like to get a better description. For instance, one can imagine, that our samples occur on a submanifold of \(X\) that deploys itself in a quirky way within \(X\); in that context, the description of the samples directly on the coordinates of the submanifold may be much more informative than the presentation provided by \(X\). In this line of thoughts, our goal is to find a latent space \(Z\) so that our samples in \(X\) are produced by samples drawn from a simple probability distribution on \(Z\) that are then mapped to \(X\). One also wants, given an actual sample \(x \in X\) to get approximately the latent point in \(Z\) that would have produced \(x\). For that purpose, in addition to the space \(Z\) we introduce a parametrized probability distribution on the latent space \(Z\) \[ p_\lambda (z),\quad \lambda \in \Lambda, z \in Z, \] and two parametrized conditional probability distributions:

On the one hand, the encoder that produces samples in \(Z\) from samples in \(X\). It is a collection of probability distributions on \(Z\) that depend on \(x \in X\) and parameters \(\phi \in \Phi\) \[ p_\phi(z|x), \quad z \in Z. \]
On the other hand, the decoder that produces samples in \(X\) from samples in \(Z\). It is a collection of probability distributions on \(X\) that depend on \(z \in Z\) and parameters \(\theta \in \Theta\) \[ p_\theta(x|z), \quad x \in X. \]

In that context, given three parameters \(\theta \in \Theta\), \(\phi \in \Phi\) and \(\lambda \in \Lambda\), one has two ways to produce samples in \(X\times Z\):

On the one hand, one can sample an element \(x\) in \(X\) from the empirical distribution \(p\), and then use the encoder on \(x\) to sample an element \(z \in Z\) using \(p_\phi(z|x)\). It amounts to draw a sample \((x,z)\) in \(X \times Z\) from the probability distribution \[ p_\phi(x, z) = p(x)p_\phi(z|x). \]
On the other hand, one can sample an element \(z\) in \(Z\) from the distribution \(p_\lambda\), and then use the decoder on \(z\) to sample an element \(x \in X\) using \(p_\theta(x|z)\). It amounts to draw a sample \((x,z)\) in \(X \times Z\) from the probability distribution \[ p_{\theta, \lambda}(x, z) = p_\lambda(z)p_\theta(x|z). \]

In that context, our goal is to find three parameters that minimise the distance between the two probability distributions on \(X \times Z\) which are \(p_\phi(x, z)\) and \(p_{\theta, \lambda}(x, z)\). The Kullback-Leibler divergence between \(p_\phi(x, z)\) and \(p_{\theta, \lambda}(x, z)\) is \begin{align*} &KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) \\ &= \int_{x\in X, z \in Z}p_{\phi}(x,z)\log \\ &= \int_{x\in X, z \in Z}p(x)p_{\phi}(z|x) \log\left(\frac{p_\phi(z|x_i)}{p_\theta(x_i|z)p_\lambda(z)}\right) dxdz - H(p(x)) \end{align*} Using the law of large numbers as the samples are assumed to be independent, the quantity \[ KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z))- H(p(x)) = \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p_\phi(z|x)}{p_\theta(x|z)p_\lambda(z)}\right)dzdx \] may be approximated by the loss function \[ \mathrm{Loss}(\theta, \lambda, \phi) = \frac{1}{N} \sum_{i=1}^{N} \int_{z \in Z} p_\phi(z|x_i) \log\left(\frac{p_\phi(z|x_i)}{p_\theta(x_i|z)p_\lambda(z)}\right)dz. \] This loss seems hard to handle as it involves computing as many integrals as there are samples. However it may be simplified or further approximated in some situations.

Several perspectives on the loss function.

The likelihood and the Bayesian encoder.

The purely parametrized probability distribution \(p_{\theta, \lambda}(x,z)\) on \(X \times Z\) yields

a parametrized probability distribution on \(X\) which is the marginal probability distribution \[ p_{\theta, \lambda}(x) = \int_{z \in Z}p_{\theta,\lambda}(x,z)dz =\int_{z \in Z}p_{\theta}(x|z)p_{\lambda}(z)dz, \quad x \in X; \]
another encoder, that is a probability distribution on \(Z\) conditional to elements \(x \in X\), obtained through the Bayes rule and that we call the Bayes encoder \[ p_{\theta, \lambda}(z|x) = \frac{p_{\theta, \lambda}(x, z)}{p_{\theta, \lambda}(x)} = \frac{p_{\theta}(x|z)p_{\lambda}(z)}{\int_{u \in Z}p_{\theta}(x|u)p_{\lambda}(u)du}. \]

The Kullback-Leibler divergence between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\) that we try to minimise may be decomposed as the sum of the divergence from the empirical distribution \(p(x)\) to the parametrized distribution \begin{align*} KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) =& \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p(x)}{p_{\theta, \lambda}(x)}\right)dzdx \\ &+ \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p_\phi(z|x)}{p_{\theta, \lambda(z|x)}}\right)dzdx \\ =& KL\left(p(x)|| p_{\theta, \lambda}(x)\right) + \int_{x \in X}p(x) KL \left(p_\phi(z|x)|| p_{\theta, \lambda}(z|x)\right). \end{align*} In the same line of thoughts, we can decompose the loss as the sum and \[ \mathrm{Loss}(\theta, \lambda, \phi) = - \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X + \frac{1}{N}\sum_{i=1}^{N} KL(p_{\phi}(z|x_i)||p_{\theta, \lambda}(z|x_i)) \] of the rescaled opposite of the log likelihood \[ \log\mathcal{L}_{\theta, \lambda}^X = \sum_{i=1}^{N}\log p_{\theta, \lambda}(x_i) \] and the average with respect to \(x_i\) of the Kullback-Leibler divergences with respect to \(z\) between the encoder \(p_\phi(z|x_i)\) and the Bayes encoder \(p_{\theta, \lambda}(z|x_i)\). Thus, minimising the loss may be understood as to achieving two objectives

modelling the empirical distribution \(p(x)\) with the parametrized distribution \(p_{\theta, \lambda}(x)\);
this model yields a Bayes encoder \(p_{\theta, \lambda}(z|x)\); our second goal is to bring it closer to the parametrized encoder \(p_\phi(z|x)\).

Remark

The opposite of loss is always lower than the rescaled log likelihood \[ -\mathrm{Loss}(\theta, \lambda, \phi) = \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X - \frac{1}{N}\sum_{i=1}^{N} KL(p_{\phi}(z|x_i)||p_{\theta, \lambda}(z|x_i)) \leq \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X \] as the Kullback-Leibler divergence is non negative. For that reason, this opposite loss is often called the evidence lower bound.

The encoder-decoder perspective.

One can decompose the Kullback-Leibler divergence between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\) that we try to minimise as \begin{align*} KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) =& - \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dzdx \\ & + \int_{x \in X}p(x) KL(p_{\phi}(z|x) || p_\lambda(z))dx \\ &- H(p(x)). \end{align*} Minimising the first term tends, for every \(x\in X\), to make the following sampling procedure, yield a \(y\in X\) that is close to \(x\):

first take \(x\);
then sample \(z \in Z\) from the encoder \(p_\phi(z|x)\);
finally sample \(y \in X\) from the decoder \(p_\theta(y |z)\).

For instance, if \(p_\theta(x|z)\) is a Gaussian distribution with fixed isotropic covariance matrix and average a parametrized function \(g_\theta(z)\), this first term becomes \[ - \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dzdx =\kappa \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\lVert x - g_\theta(z) \rVert^2 dzdx + \kappa' \] where \(\kappa, \kappa'\) are constants. Besides, minimising the second term of the Kullback-Leibler divergence tends to bring the encoder \(p_\phi(z|x)\) closer to the distribution on the latent space \(p_\lambda(z)\). The last term of the divergence is constant.

The EM-algorithm.

The Expectation-Maximisation (EM) algorithm (see [1]) consists in minimising the loss alternatively with respect to \(\phi\) and with respect to \(\theta\) and \(\lambda\). Precisely, it consists in the following two steps, that are performed alternatively

The Expectation step consists in minimising the loss with respect to \(\phi\). This amounts to minimise only with respect to \(\phi\) the average Kullback Leibler divergence between the encoder \(p_\phi(z|x)\) and the Bayes encoder \[ \frac{1}{N} \sum_{i=1}^{N} KL(p_\phi(z|x_i) || p_{\theta, \lambda}(z|x_i)). \]
The maximisation step consists in minimising the loss with respect to \(\theta\) and \(\lambda\), leaving \(\phi\) constant. This amounts to maximise the quantity \[ \frac{1}{N} \sum_{i=1}^{N} \int_{z \in Z} p_\phi(z|x_i) \log\left({p_\theta(x_i|z)p_\lambda(z)}\right)dz \] with respect to \(\theta\) and \(\lambda\).

Autoencoders.

The autoencoder architecture (see [2]) is a particular incarnation of the statistical encoding framework wher \(X\) and \(Z\) are Euclidian spaces: \(X= \mathbb R^F\) and \(Z = \mathbb R^J\). Moreover,

the space \(\Phi\) parametrizes a function \(f_\phi(x)\in Z\) so that the probability measure \(p_\phi(z|x)\) on \(Z\) is the dirac measure at \(f_\phi(x)\) in the sense that any function \(G(z)\) on \(Z\) \[ \int_{z \in Z}p_\phi(z|x) G(z)dz = G(f_\phi(x)); \]
the parameter space \(\Lambda\) has a single element \(\lambda\) and the related distribution \(p_\lambda(z)\) on the latent space is almost invariant by translation; for instance, one can take a Gaussian measure \(\mathcal{N}(0, S\mathcal I)\) with average \(0\) and an isotropic covariance matrix having a very large scale \(S >> 1\): \[ p_\lambda(z) = \frac{1}{\sqrt{2\pi S}^J} \exp \left(-\frac{\lVert z\rVert^2}{2S}\right) \]
the space \(\Theta\) parametrizes a function \(g_\theta(z)\in X\) so that the decoder \(p_\theta(x|z)\) is the Gaussian distribution with fixed isotropic covariance matrix \(\mathcal I\) and average the parametrized function \(g_\theta(z)\): \[ p_\theta(x|z) = \frac{1}{\sqrt{2\pi}^F} \exp \left(-\frac{\lVert x- g_\theta(z)\rVert^2}{2}\right). \]

In that context, the Kullback-Leibler divergence between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\) is \begin{align*} KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z)) =& \int_{x \in X}p(x) \log\left(\frac{p(x)}{p_{\theta}(x|f_\phi(x)) p_{\lambda}(f_\phi(x))}\right)dzdx \\ =& \frac{1}{2}\int_{x \in X} p(x)\left( \lVert x- g_\theta f_\phi(x)\rVert^2 + \frac{1}{S}\lVert f_\phi(x)\rVert^2\right)dx + \mathrm{cst} \end{align*} and the Loss function (that do not depends on \(\lambda\)) becomes \[ \mathrm{Loss}(\theta, \phi) = \frac{1}{2N}\sum_{i=1}^{N}\lVert x_i- g_\theta f_\phi(x_i)\rVert^2 + \frac{1}{2NS} \sum_{i=1}^{N} \lVert f_\phi(x_i)\rVert^2 + \mathrm{cst}. \] As \(S\) is very large the second term of is very low and thus, and we thus actually try to minimise the autoencoder loss \[ \widetilde{\mathrm{Loss}}(\theta, \phi)= \frac{1}{2N}\sum_{i=1}^{N}\lVert x_i- g_\theta f_\phi(x_i)\rVert^2. \] A modification of the classical autoencoder is given by the denoising autoencoder ([3]) where the encoder consists in applying some random noise on the input \(x\) and then apply the function \(f_\phi\). The encoder conditional probability distribution is then the probability associated to this sampling procedure.

Variational autoencoders

As autoencoders, variational autoencoders (see [4]) assume that \(X\) and \(Z\) are still Euclidian spaces \(X= \mathbb R^F\) and \(Z = \mathbb R^J\). Besides

the space \(\Phi\) parametrizes \(J+1\) functions \(\mu_\phi(x)\in Z\) and \(\sigma_\phi^{(j)}(x) \in \mathbb{R}_+^\ast\) for \(j=1 \ldots, J\), so that the probability measure \(p_\phi(z|x)\) on \(Z\) is the Gaussian measure with average \(\mu_\phi(x)\) and covariance the diagonal matrix with diagonal elements \((\sigma_\phi^{(j)}(x)^2)_{j=1}^J\): \[ p_\phi(z|x) = \frac{1}{\sqrt{2\pi}^J\sigma_\phi^{(1)}(x)\cdots \sigma_\phi^{(J)}(x)} \exp \left(-\sum_{j=1}^{J}\frac{\left(z^{(j)}- \mu_\phi^{(j)}(x)\right)^2}{2\sigma_\phi^{(j)}(x)^2}\right) \] where \(z^{(j)}\) represents the jth component of the vector \(z \in Z = \mathbb R^J\);
the parameter space \(\Lambda\) has a single element \(\lambda\) and the related distribution \(p_\lambda(z)\) on the latent space is the standard Gaussian measure \(\mathcal{N}(0, \mathcal I)\): \[ p_\lambda(z) = \frac{1}{\sqrt{2\pi }^J} \exp \left(-\frac{\lVert z\rVert^2}{2}\right) \]
we do not hold such strong assumptions on the decoders but a usual setting is a similar one as the encoder, where \(\Theta\) parametrizes \(F+1\) functions \(\mu_\theta(z)\in X\) and \(\sigma_\theta^{(k)}(z) \in \mathbb{R}_+^\ast\) for \(k=1 \ldots, N\), so that the probability measure \(p_\theta(x|z)\) on \(X\) is the Gaussian measure with average \(\mu_\theta(z)\) and covariance the diagonal matrix with diagonal elements \((\sigma_\theta^{(k)}(x)^2)_{k=1}^F\).

In that context, for every \(x \in X\), the Kullback-Leibler divergence with respect to \(z\) between \(p_{\phi}(z|x)\) and \(p_\lambda(z)\), is given, in the by the formula \[ KL\left(p_{\phi}(z|x)||p_{\lambda}(z)\right) = \frac{1}{2} \sum_{j=1}^{J} \left( \mu_\phi^{(j)}(x)^2 + \sigma_\theta^{(j)}(x)^2 - 1 - \log(\sigma_\phi^{(j)}(x)) \right). \] Besides, for every \(x \in X\), the integral \(\int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dz\) may be approximated by a Monte Carlo technique. Indeed, if a random variable \(\epsilon \in X\) follows a standard Gaussian law, then the random variable \[ \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon \in Z \] whose jth component is \[ \left(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon\right)^{(j)} = \mu_\phi(x)^{(j)} + \sigma_\phi^{(j)}(x) \epsilon^{(j)} \] follows the law \(p_\phi(z|x)\). Then, one can approximate the integral by computing the sum \[ \frac{1}{L}\sum_{l=1}^{L} \log\left(p_{\theta}(x|\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon_{x,l})\right) \] where \(\epsilon_{x,1}, \ldots, \epsilon_{x,L}\) are independent samples of the standard Gaussian distribution on \(Z\). Thus, the Loss function is, in the encoder-decoder perspective, given by the sum \begin{align*} \mathrm{Loss}(\theta, \phi) =& - \frac{1}{N}\sum_{i=1}^{N}\sum_{l=1}^{L} \log\left(p_{\theta}(x_i|\mu_\phi(x_i) + \sigma_\phi(x_i) \odot \epsilon_{i,l})\right)dzdx \\ & + \frac{1}{2N} \sum_{i=1}^{N}\sum_{j=1}^{J} \left( \mu_\phi^{(j)}(x_i)^2 + \sigma_\theta^{(j)}(x_i)^2 - 1 - \log(\sigma_\phi^{(j)}(x_i)) \right) dx. \end{align*}

The Gaussian mixture model.

The Gaussian mixture model is a soft clustering method that leverages the framework described above. We assume that \(X\) is a Euclidian space \(X= \mathbb{R}^F\), and \(Z= \{0, \ldots, K-1\}\) is a finite set that represents the clusters. Besides

The parameter space \(\Lambda\) is the set \[ \Lambda = \left\{(\pi_0, \ldots, \pi_{K-1})| \pi_k \geq 0,\ \sum_k \pi_k = 1\right\}. \] It is isomorphic to the set of all probability distributions on \(Z\) and for \(\lambda = (\pi_0, \ldots, \pi_{K-1})\) and \(z\in Z\) \[ p_\lambda (z) = \pi_z. \]
The parameter space \(\Theta\) is the product set whose elements are tuples \(\theta = (\mu_0, S_0, \ldots, \mu_{K-1}, S_{K-1})\) made up of vectors \(\mu_0, \ldots, \mu_{K-1}\) in \(X = \mathbb R^F\) and \(F\)-size covariance matrices (that is, positive definite symmetric square matrices) \(S_0, \ldots, S_{K-1}\). Then, the decoder probability distribution conditional to \(z \in Z\) is the Gaussian distribution with average \(\mu_z\) and covariance matrix \(S_z\) \[ p_\theta(x|z) = \mathcal{N}(x|\mu_z, S_z) = \frac{1}{\sqrt{2\pi}^F\sqrt{\det(S_z)}} \exp \left(-\frac{1}{2} (x-\mu_z)^tS_z^{-1}(x-\mu_z)\right). \]
The parameter space \(\Phi\) is a copy of the product space \(\Theta \times \Lambda\) and given \[ \phi = (\theta^{(e)}, \lambda^{(e)}), \] where \begin{align*} \theta^{(e)} = (\mu_0^{(e)}, S_0^{(e)}, \ldots, \mu_{K-1}^{(e)}, S_{K-1}^{(e)}) \\ \lambda^{(e)} = (\pi_0^{(e)}, \ldots, \pi_{K-1}^{(e)}) \end{align*} then the encoder conditional probability distribution is equivalent to the Bayes encoder \[ p_\phi(z|x) = \frac{\pi_z^{(e)} \mathcal{N}(x|\mu_z^{(e)}, S_z^{(e)})}{\sum_{k=0}^{K-1} \pi_k^{(e)} \mathcal{N}(x|\mu_k^{(e)}, S_k^{(e)})}. \]

Thus the whole parameter space is made up of the tuples \[ \left(\pi_0, \mu_0, S_0,\pi_0^{(e)}, \mu_0^{(e)}, S_0^{(e)}, \ldots, \pi_{K-1}, \mu_{K-1}, S_{K-1},\pi_{K-1}^{(e)}, \mu_{K-1}^{(e)}, S_{K-1}^{(e)} \right) \] where, for \(z \in Z\), \(\pi_z, \pi_z^{(e)}\) are non negative numbers, \(\mu_z, \mu_z^{(e)}\) are vectors in \(X\), \(S_z, S_z^{(e)}\) are positive definite symmetric \(F\times F\) matrices and \[ \sum_{z=0}^{K-1} \pi_z = 1 = \sum_{z=0}^{K-1} \pi_z^{(e)}. \] The elements that have an \((e)\) superscript are the parameters of the encoder. In this context, one can perform the EM-algorithm as follows:

The expectation step consists in replacing the encoder parameters by their counterparts in the decoder and the latent distributions \begin{align*} \pi_z^{(e)} &\leftarrow \pi_z \\ \mu_z^{(e)} &\leftarrow \mu_z \\ S_z^{(e)} &\leftarrow S_z. \end{align*}
The maximisation step consists in maximising \[ \frac{1}{N} \sum_{i=1}^{N} \sum_{z=0}^{K-1} p_\phi(z|x_i) \left(\log(\pi_z) + \log\mathcal{N}(x_i|\mu_z, S_z)\right). \] with respect to \(\pi_z, \mu_z, S_z\) for \(z=0, \ldots, K-1\) and leaving constant \(\phi\) that is \(\pi_z^{(e)}, \mu_z^{(e)}, S_z^{(e)}\) for \(z=0, \ldots, K-1\). This yields for every cluster \(z\) \begin{align*} \pi_z &= \frac{1}{N} \sum_{i=1}^{N} p_\phi(z|x_i) \\ \mu_z &= \frac{\sum_{i=1}^{N} p_\phi(z|x_i) x_i}{\sum_{i=1}^{N} p_\phi(z|x_i)} \\ S_z &= \frac{\sum_{i=1}^{N} p_\phi(z|x_i) (x_i-\mu_z)(x_i-\mu_z)^t}{\sum_{i=1}^{N} p_\phi(z|x_i)}. \end{align*}

Bibliography.

Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1–38.
Geoffrey E. Hinton, and Ruslan R. Salakhutdinov. “Reducing the dimensionality of data with neural networks.” Science 313.5786 (2006): 504-507.
Vincent, Pascal \& Larochelle, Hugo \& Bengio, Y. \& Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning. 1096-1103. 10.1145/1390156.1390294.
Diederik P. Kingma, and Max Welling. “Auto-encoding variational bayes.” ICLR 2014.
Bishop, Christopher M. Pattern Recognition and Machine Learning. New York :Springer, 2006.