Machine learning as Statistics on steroids 3
Statistical encoding
This is the third note of our series about the
close relation between machine learning and statistics.
We deal with a framework of unsupervised learning that we call statistical encoding
and that encompasses autoencoders, variational autoencoders and
the EM algorithm.
The framework of statistical encoding.
Let us consider a dataset that consists in elements \(x_1, \ldots, x_N\)
in a space \(X\). We assume that such elements are independent samples
of a probability distribution \(p\) on \(X\) that we call the empirical
probability distribution.
The presentation of our dataset within the space \(X\)
may be cumbersome and one would like to get a better description.
For instance, one can imagine, that our samples occur on
a submanifold of \(X\) that deploys itself in a quirky way within \(X\);
in that context, the description of the samples directly on the
coordinates of the submanifold may be much more informative
than the presentation provided by \(X\).
In this line of thoughts, our goal is to find a latent space \(Z\)
so that our samples in \(X\) are produced by samples drawn from
a simple probability distribution
on \(Z\) that are then mapped to \(X\). One also wants, given an actual sample \(x \in X\)
to get approximately the latent point in \(Z\) that would have produced \(x\).
For that purpose, in addition to the space \(Z\)
we introduce a parametrized probability
distribution on the latent space \(Z\)
\[
p_\lambda (z),\quad \lambda \in \Lambda, z \in Z,
\]
and two parametrized conditional probability distributions:
- On the one hand, the encoder that produces samples in \(Z\) from samples in \(X\).
It is a collection of probability distributions on \(Z\) that depend on \(x \in X\)
and parameters \(\phi \in \Phi\)
\[
p_\phi(z|x), \quad z \in Z.
\]
- On the other hand, the decoder that produces samples in \(X\) from samples in \(Z\).
It is a collection of probability distributions on \(X\) that depend on \(z \in Z\)
and parameters \(\theta \in \Theta\)
\[
p_\theta(x|z), \quad x \in X.
\]
In that context, given three parameters \(\theta \in \Theta\), \(\phi \in \Phi\) and
\(\lambda \in \Lambda\), one has two ways to produce samples in \(X\times Z\):
- On the one hand, one can sample an element \(x\) in \(X\) from
the empirical distribution \(p\), and then use the encoder
on \(x\) to sample an element \(z \in Z\) using \(p_\phi(z|x)\). It amounts to draw
a sample \((x,z)\) in \(X \times Z\) from the probability distribution
\[
p_\phi(x, z) = p(x)p_\phi(z|x).
\]
- On the other hand, one can sample an element \(z\) in \(Z\) from
the distribution \(p_\lambda\), and then use the decoder
on \(z\) to sample an element \(x \in X\) using \(p_\theta(x|z)\). It amounts to draw
a sample \((x,z)\) in \(X \times Z\) from the probability distribution
\[
p_{\theta, \lambda}(x, z) = p_\lambda(z)p_\theta(x|z).
\]
In that context, our goal is to find three parameters that minimise
the distance between the two probability distributions on \(X \times Z\)
which are \(p_\phi(x, z)\) and \(p_{\theta, \lambda}(x, z)\).
The Kullback-Leibler divergence between \(p_\phi(x, z)\)
and \(p_{\theta, \lambda}(x, z)\) is
\begin{align*}
&KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z))
\\
&= \int_{x\in X, z \in Z}p_{\phi}(x,z)\log
\\
&= \int_{x\in X, z \in Z}p(x)p_{\phi}(z|x)
\log\left(\frac{p_\phi(z|x_i)}{p_\theta(x_i|z)p_\lambda(z)}\right)
dxdz - H(p(x))
\end{align*}
Using the law of large numbers as the samples are assumed to be independent,
the quantity
\[
KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z))- H(p(x))
= \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p_\phi(z|x)}{p_\theta(x|z)p_\lambda(z)}\right)dzdx
\]
may be approximated by the loss function
\[
\mathrm{Loss}(\theta, \lambda, \phi) = \frac{1}{N} \sum_{i=1}^{N} \int_{z \in Z} p_\phi(z|x_i) \log\left(\frac{p_\phi(z|x_i)}{p_\theta(x_i|z)p_\lambda(z)}\right)dz.
\]
This loss seems hard to handle as it involves computing as many integrals
as there are samples. However it may be simplified
or further approximated in some situations.
Several perspectives on the loss function.
The likelihood and the Bayesian encoder.
The purely parametrized probability distribution
\(p_{\theta, \lambda}(x,z)\)
on \(X \times Z\) yields
- a parametrized probability distribution on \(X\) which is the
marginal probability distribution
\[
p_{\theta, \lambda}(x) = \int_{z \in Z}p_{\theta,\lambda}(x,z)dz
=\int_{z \in Z}p_{\theta}(x|z)p_{\lambda}(z)dz, \quad x \in X;
\]
- another encoder, that is a probability distribution on \(Z\) conditional
to elements \(x \in X\), obtained through the Bayes rule and
that we call the Bayes encoder
\[
p_{\theta, \lambda}(z|x) = \frac{p_{\theta, \lambda}(x, z)}{p_{\theta, \lambda}(x)}
= \frac{p_{\theta}(x|z)p_{\lambda}(z)}{\int_{u \in Z}p_{\theta}(x|u)p_{\lambda}(u)du}.
\]
The Kullback-Leibler divergence
between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\)
that we try to minimise may be decomposed as
the sum of the divergence from the empirical distribution \(p(x)\) to the parametrized
distribution
\begin{align*}
KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z))
=& \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p(x)}{p_{\theta, \lambda}(x)}\right)dzdx
\\
&+ \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(\frac{p_\phi(z|x)}{p_{\theta, \lambda(z|x)}}\right)dzdx
\\
=& KL\left(p(x)|| p_{\theta, \lambda}(x)\right)
+ \int_{x \in X}p(x) KL \left(p_\phi(z|x)|| p_{\theta, \lambda}(z|x)\right).
\end{align*}
In the same line of thoughts, we can decompose the loss as the sum
and
\[
\mathrm{Loss}(\theta, \lambda, \phi)
= - \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X
+ \frac{1}{N}\sum_{i=1}^{N} KL(p_{\phi}(z|x_i)||p_{\theta, \lambda}(z|x_i))
\]
of the rescaled opposite of the log likelihood
\[
\log\mathcal{L}_{\theta, \lambda}^X = \sum_{i=1}^{N}\log p_{\theta, \lambda}(x_i)
\]
and the average with respect to \(x_i\)
of the Kullback-Leibler divergences with respect to \(z\) between the
encoder \(p_\phi(z|x_i)\) and the Bayes encoder \(p_{\theta, \lambda}(z|x_i)\).
Thus, minimising the loss may be understood as to achieving two objectives
- modelling the empirical distribution \(p(x)\) with the parametrized
distribution \(p_{\theta, \lambda}(x)\);
- this model yields a Bayes encoder \(p_{\theta, \lambda}(z|x)\); our second goal
is to bring it closer to the parametrized encoder \(p_\phi(z|x)\).
Remark
The opposite of loss is always lower than the rescaled log likelihood
\[
-\mathrm{Loss}(\theta, \lambda, \phi)
= \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X
- \frac{1}{N}\sum_{i=1}^{N} KL(p_{\phi}(z|x_i)||p_{\theta, \lambda}(z|x_i))
\leq \frac{1}{N}\log\mathcal{L}_{\theta, \lambda}^X
\]
as the Kullback-Leibler divergence is non negative. For that reason, this opposite
loss is often called the evidence lower bound.
The encoder-decoder perspective.
One can decompose the Kullback-Leibler divergence
between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\)
that we try to minimise
as
\begin{align*}
KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z))
=& - \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dzdx
\\
& + \int_{x \in X}p(x) KL(p_{\phi}(z|x) || p_\lambda(z))dx
\\
&- H(p(x)).
\end{align*}
Minimising the first term tends, for every \(x\in X\),
to make the following sampling procedure, yield a \(y\in X\) that is close to \(x\):
- first take \(x\);
- then sample \(z \in Z\) from the encoder \(p_\phi(z|x)\);
- finally sample \(y \in X\) from the decoder \(p_\theta(y |z)\).
For instance, if \(p_\theta(x|z)\) is a Gaussian distribution
with fixed isotropic covariance matrix
and average a parametrized function \(g_\theta(z)\), this first term becomes
\[
- \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dzdx
=\kappa \int_{x \in X}p(x) \int_{z \in Z}p_{\phi}(z|x)\lVert x - g_\theta(z) \rVert^2 dzdx + \kappa'
\]
where \(\kappa, \kappa'\) are constants.
Besides, minimising the second term of the Kullback-Leibler divergence tends
to bring the encoder \(p_\phi(z|x)\) closer to the distribution on the latent space
\(p_\lambda(z)\). The last term of the divergence is constant.
The EM-algorithm.
The Expectation-Maximisation (EM) algorithm (see [1]) consists in minimising the loss alternatively
with respect to \(\phi\) and with respect to \(\theta\) and \(\lambda\). Precisely,
it consists in the following two steps, that are performed alternatively
- The Expectation step consists in minimising the loss with respect to \(\phi\).
This amounts to minimise only with respect to \(\phi\) the average Kullback Leibler divergence between the
encoder \(p_\phi(z|x)\) and the Bayes encoder
\[
\frac{1}{N} \sum_{i=1}^{N} KL(p_\phi(z|x_i) || p_{\theta, \lambda}(z|x_i)).
\]
- The maximisation step consists in minimising the loss with respect to \(\theta\) and \(\lambda\),
leaving \(\phi\) constant.
This amounts to maximise the quantity
\[
\frac{1}{N} \sum_{i=1}^{N} \int_{z \in Z} p_\phi(z|x_i) \log\left({p_\theta(x_i|z)p_\lambda(z)}\right)dz
\]
with respect to \(\theta\) and \(\lambda\).
Autoencoders.
The autoencoder architecture (see [2]) is a particular incarnation
of the statistical encoding framework
wher \(X\) and \(Z\) are Euclidian spaces:
\(X= \mathbb R^F\) and \(Z = \mathbb R^J\). Moreover,
- the space \(\Phi\) parametrizes a function
\(f_\phi(x)\in Z\) so that the probability measure \(p_\phi(z|x)\)
on \(Z\) is the dirac measure at \(f_\phi(x)\) in the sense that any function
\(G(z)\) on \(Z\)
\[
\int_{z \in Z}p_\phi(z|x) G(z)dz = G(f_\phi(x));
\]
- the parameter space \(\Lambda\) has a single element \(\lambda\)
and the related distribution \(p_\lambda(z)\)
on the latent space is almost invariant by translation; for
instance, one can take a Gaussian measure \(\mathcal{N}(0, S\mathcal I)\) with average \(0\)
and an isotropic covariance matrix having a very large scale \(S >> 1\):
\[
p_\lambda(z) = \frac{1}{\sqrt{2\pi S}^J} \exp \left(-\frac{\lVert z\rVert^2}{2S}\right)
\]
- the space \(\Theta\) parametrizes a function \(g_\theta(z)\in X\)
so that the decoder \(p_\theta(x|z)\) is the Gaussian distribution with fixed
isotropic covariance matrix \(\mathcal I\)
and average the parametrized function \(g_\theta(z)\):
\[
p_\theta(x|z) = \frac{1}{\sqrt{2\pi}^F} \exp \left(-\frac{\lVert x- g_\theta(z)\rVert^2}{2}\right).
\]
In that context,
the Kullback-Leibler divergence between \(p_{\phi}(x, z)\) and \(p_{\theta, \lambda}(x, z)\)
is
\begin{align*}
KL(p_{\phi}(x, z)|| p_{\theta, \lambda}(x, z))
=& \int_{x \in X}p(x) \log\left(\frac{p(x)}{p_{\theta}(x|f_\phi(x)) p_{\lambda}(f_\phi(x))}\right)dzdx
\\
=& \frac{1}{2}\int_{x \in X} p(x)\left( \lVert x- g_\theta f_\phi(x)\rVert^2 + \frac{1}{S}\lVert f_\phi(x)\rVert^2\right)dx + \mathrm{cst}
\end{align*}
and
the Loss function (that do not depends on \(\lambda\)) becomes
\[
\mathrm{Loss}(\theta, \phi) =
\frac{1}{2N}\sum_{i=1}^{N}\lVert x_i- g_\theta f_\phi(x_i)\rVert^2
+ \frac{1}{2NS} \sum_{i=1}^{N} \lVert f_\phi(x_i)\rVert^2 + \mathrm{cst}.
\]
As \(S\) is very large the second term of is very low and thus, and we thus actually
try to minimise the autoencoder loss
\[
\widetilde{\mathrm{Loss}}(\theta, \phi)=
\frac{1}{2N}\sum_{i=1}^{N}\lVert x_i- g_\theta f_\phi(x_i)\rVert^2.
\]
A modification of the classical autoencoder is given by the denoising
autoencoder ([3]) where the encoder consists in applying some random noise on the input \(x\)
and then apply the function \(f_\phi\). The encoder conditional probability
distribution is then the probability associated to this sampling procedure.
Variational autoencoders
As autoencoders, variational autoencoders (see [4]) assume that
\(X\) and \(Z\) are still Euclidian spaces
\(X= \mathbb R^F\) and \(Z = \mathbb R^J\).
Besides
- the space \(\Phi\) parametrizes \(J+1\) functions
\(\mu_\phi(x)\in Z\) and
\(\sigma_\phi^{(j)}(x) \in \mathbb{R}_+^\ast\) for \(j=1 \ldots, J\),
so that the probability measure \(p_\phi(z|x)\)
on \(Z\) is the Gaussian measure with average \(\mu_\phi(x)\)
and covariance the diagonal matrix with diagonal elements
\((\sigma_\phi^{(j)}(x)^2)_{j=1}^J\):
\[
p_\phi(z|x) = \frac{1}{\sqrt{2\pi}^J\sigma_\phi^{(1)}(x)\cdots \sigma_\phi^{(J)}(x)}
\exp \left(-\sum_{j=1}^{J}\frac{\left(z^{(j)}- \mu_\phi^{(j)}(x)\right)^2}{2\sigma_\phi^{(j)}(x)^2}\right)
\]
where \(z^{(j)}\) represents the jth component of the vector \(z \in Z = \mathbb R^J\);
- the parameter space \(\Lambda\) has a single element \(\lambda\)
and the related distribution \(p_\lambda(z)\)
on the latent space is the standard Gaussian measure \(\mathcal{N}(0, \mathcal I)\):
\[
p_\lambda(z) = \frac{1}{\sqrt{2\pi }^J} \exp \left(-\frac{\lVert z\rVert^2}{2}\right)
\]
- we do not hold such strong assumptions on the decoders but a usual setting
is a similar
one as the encoder, where \(\Theta\) parametrizes \(F+1\) functions
\(\mu_\theta(z)\in X\) and
\(\sigma_\theta^{(k)}(z) \in \mathbb{R}_+^\ast\) for \(k=1 \ldots, N\),
so that the probability measure \(p_\theta(x|z)\)
on \(X\) is the Gaussian measure with average \(\mu_\theta(z)\)
and covariance the diagonal matrix with diagonal elements
\((\sigma_\theta^{(k)}(x)^2)_{k=1}^F\).
In that context, for every \(x \in X\), the Kullback-Leibler divergence
with respect to \(z\) between \(p_{\phi}(z|x)\) and \(p_\lambda(z)\),
is given, in the by the formula
\[
KL\left(p_{\phi}(z|x)||p_{\lambda}(z)\right) =
\frac{1}{2} \sum_{j=1}^{J} \left( \mu_\phi^{(j)}(x)^2 + \sigma_\theta^{(j)}(x)^2 - 1 - \log(\sigma_\phi^{(j)}(x)) \right).
\]
Besides, for every \(x \in X\), the integral
\(\int_{z \in Z}p_{\phi}(z|x)\log\left(p_{\theta}(x|z)\right)dz\)
may be approximated by a Monte Carlo technique. Indeed, if a random variable \(\epsilon \in X\)
follows a standard Gaussian law, then the random variable
\[
\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon \in Z
\]
whose jth component is
\[
\left(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon\right)^{(j)} =
\mu_\phi(x)^{(j)} + \sigma_\phi^{(j)}(x) \epsilon^{(j)}
\]
follows the law \(p_\phi(z|x)\). Then, one can
approximate the integral by computing the sum
\[
\frac{1}{L}\sum_{l=1}^{L} \log\left(p_{\theta}(x|\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon_{x,l})\right)
\]
where \(\epsilon_{x,1}, \ldots, \epsilon_{x,L}\) are
independent samples of the
standard Gaussian distribution on \(Z\).
Thus, the Loss function
is, in the encoder-decoder perspective, given by the sum
\begin{align*}
\mathrm{Loss}(\theta, \phi)
=& - \frac{1}{N}\sum_{i=1}^{N}\sum_{l=1}^{L} \log\left(p_{\theta}(x_i|\mu_\phi(x_i) + \sigma_\phi(x_i) \odot \epsilon_{i,l})\right)dzdx
\\
& + \frac{1}{2N} \sum_{i=1}^{N}\sum_{j=1}^{J}
\left( \mu_\phi^{(j)}(x_i)^2 + \sigma_\theta^{(j)}(x_i)^2 - 1 - \log(\sigma_\phi^{(j)}(x_i)) \right) dx.
\end{align*}
The Gaussian mixture model.
The Gaussian mixture model is a soft clustering method
that leverages the framework described above. We assume that \(X\) is a Euclidian space
\(X= \mathbb{R}^F\), and \(Z= \{0, \ldots, K-1\}\) is a finite set that represents
the clusters. Besides
- The parameter space \(\Lambda\) is the set
\[
\Lambda = \left\{(\pi_0, \ldots, \pi_{K-1})| \pi_k \geq 0,\ \sum_k \pi_k = 1\right\}.
\]
It is isomorphic to the set of all probability distributions on \(Z\)
and for \(\lambda = (\pi_0, \ldots, \pi_{K-1})\) and \(z\in Z\)
\[
p_\lambda (z) = \pi_z.
\]
- The parameter space \(\Theta\) is the product set whose elements are tuples
\(\theta = (\mu_0, S_0, \ldots, \mu_{K-1}, S_{K-1})\) made up of vectors
\(\mu_0, \ldots, \mu_{K-1}\) in \(X = \mathbb R^F\) and \(F\)-size covariance matrices
(that is, positive definite symmetric square matrices)
\(S_0, \ldots, S_{K-1}\). Then, the decoder probability distribution conditional to \(z \in Z\)
is the Gaussian distribution with average \(\mu_z\) and covariance matrix \(S_z\)
\[
p_\theta(x|z) = \mathcal{N}(x|\mu_z, S_z)
= \frac{1}{\sqrt{2\pi}^F\sqrt{\det(S_z)}} \exp \left(-\frac{1}{2} (x-\mu_z)^tS_z^{-1}(x-\mu_z)\right).
\]
- The parameter space \(\Phi\) is a copy of the product space \(\Theta \times \Lambda\)
and given
\[
\phi = (\theta^{(e)}, \lambda^{(e)}),
\]
where
\begin{align*}
\theta^{(e)} = (\mu_0^{(e)}, S_0^{(e)}, \ldots, \mu_{K-1}^{(e)}, S_{K-1}^{(e)})
\\
\lambda^{(e)} = (\pi_0^{(e)}, \ldots, \pi_{K-1}^{(e)})
\end{align*}
then the encoder conditional probability distribution
is equivalent to the Bayes encoder
\[
p_\phi(z|x) = \frac{\pi_z^{(e)} \mathcal{N}(x|\mu_z^{(e)}, S_z^{(e)})}{\sum_{k=0}^{K-1} \pi_k^{(e)} \mathcal{N}(x|\mu_k^{(e)}, S_k^{(e)})}.
\]
Thus the whole parameter space is made up of the tuples
\[
\left(\pi_0, \mu_0, S_0,\pi_0^{(e)}, \mu_0^{(e)}, S_0^{(e)},
\ldots, \pi_{K-1}, \mu_{K-1}, S_{K-1},\pi_{K-1}^{(e)}, \mu_{K-1}^{(e)}, S_{K-1}^{(e)} \right)
\]
where, for \(z \in Z\), \(\pi_z, \pi_z^{(e)}\) are non negative numbers,
\(\mu_z, \mu_z^{(e)}\) are vectors in \(X\), \(S_z, S_z^{(e)}\)
are positive definite symmetric \(F\times F\) matrices and
\[
\sum_{z=0}^{K-1} \pi_z = 1 = \sum_{z=0}^{K-1} \pi_z^{(e)}.
\]
The elements that have an \((e)\) superscript are the parameters of the encoder.
In this context, one can perform the EM-algorithm as follows:
- The expectation step consists in replacing the encoder parameters
by their counterparts in the decoder and the latent distributions
\begin{align*}
\pi_z^{(e)} &\leftarrow \pi_z
\\
\mu_z^{(e)} &\leftarrow \mu_z
\\
S_z^{(e)} &\leftarrow S_z.
\end{align*}
- The maximisation step consists in maximising
\[
\frac{1}{N} \sum_{i=1}^{N} \sum_{z=0}^{K-1} p_\phi(z|x_i) \left(\log(\pi_z) + \log\mathcal{N}(x_i|\mu_z, S_z)\right).
\]
with respect to \(\pi_z, \mu_z, S_z\) for \(z=0, \ldots, K-1\) and leaving constant
\(\phi\) that is \(\pi_z^{(e)}, \mu_z^{(e)}, S_z^{(e)}\) for \(z=0, \ldots, K-1\).
This yields for every cluster \(z\)
\begin{align*}
\pi_z &= \frac{1}{N} \sum_{i=1}^{N} p_\phi(z|x_i)
\\
\mu_z &= \frac{\sum_{i=1}^{N} p_\phi(z|x_i) x_i}{\sum_{i=1}^{N} p_\phi(z|x_i)}
\\
S_z &= \frac{\sum_{i=1}^{N} p_\phi(z|x_i) (x_i-\mu_z)(x_i-\mu_z)^t}{\sum_{i=1}^{N} p_\phi(z|x_i)}.
\end{align*}
Bibliography.
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B. 39 (1): 1–38.
- Geoffrey E. Hinton, and Ruslan R. Salakhutdinov. “Reducing the dimensionality of data with neural networks.” Science 313.5786 (2006): 504-507.
- Vincent, Pascal \& Larochelle, Hugo \& Bengio, Y. \& Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning. 1096-1103. 10.1145/1390156.1390294.
- Diederik P. Kingma, and Max Welling. “Auto-encoding variational bayes.” ICLR 2014.
- Bishop, Christopher M. Pattern Recognition and Machine Learning. New York :Springer, 2006.