Normalizing Flows

What’s the difference between GANs and normalizing flows?

GANs are trained to take a random vector (ex. sampled from Gaussian noise) and produce a data point (ex. an image) via a generator that produces images and a discriminator that judges the quality of these. Normalizing flow models are trained to take a data point (ex. an image) and produce a simple distribution (ex. Gaussian) that minimizes the log-likelihood of the probability of the transformed samples. At inference time for GANs you use the same procedure as for training and sample noise and then pass it to a generator to produce an image. For normalizing flow models, the entire model is invertible so you use a reverse process to what you used during training and sample from noise to produce an image.

Different from autoregressive model and variational autoencoders, deep normalizing flow models require specific architectural structures.

The input and output dimensions must be the same.
The transformation must be invertible.
Computing the determinant of the Jacobian needs to be efficient (and differentiable). This is due to the change of basis formula. See youtube for more details.

Math

In simple words, normalizing flows is a series of simple functions which are invertible, or the analytical inverse of the function can be calculated.

What is an invertible function

For example, $f (x) = x + 2$ is a reversible function because for each input, a unique output exists and vice-versa whereas $f (x) = x^{2}$ is not a reversible function because both $9$ can correspond to either 3 or -3. The inverse of $f$ exists if and only if $f$ is a Bijective Function (maps each input to exactly one output and vice-versa).

Let $x$ be a high-dimensional random vector with an unknown true distribution $x \sim p^{*} (x)$ . We collect an i.i.d. dataset $D$ , and choose a model $p_{θ} (x)$ with parameters $θ$ .

In the context of a dataset of images, $x$ would represent a high-dimensional vector that encodes the pixel values of an image. Each element of the vector corresponds to a pixel in the image, and the dimensionality of the vector is equal to the total number of pixels in the image. The true distribution $p^{*} (x)$ would represent the distribution of all possible images that could be generated from the dataset (this is unknown), and the goal of a flow-based generative model would be to learn a model $p_{θ} (x)$ that can generate new images that are similar to the images in the dataset.

The dataset $D$ being i.i.d. means it is “independent and identically distributed.” In the context of an image dataset, it means that the presence of one image doesn’t affect the probability of the next image and each image has the same likelihood. Stats Stack Exchange Post.

In case of discrete data $x$ , the log-likelihood objective is then equivalent to minimizing:

$L (D) = \frac{1}{N} \sum_{i = 1}^{N} - lo g p_{θ} (x^{(i)})$ This is taking the average (summing over all examples and then dividing by the number of examples $N$ ) of the negative log of the probability that your learned model produces training example $x^{(i)}$ .

Optimization is done through stochastic gradient descent using minibatches of data.

In most flow-based generative models the generative process is defined as:

where:

$z$ is the latent variable
$p_{θ} (z)$ has a simple density (ex. a 0-1 normal distribution).
$g_{θ} (\dots)$ is an invertible (aka bijective) function that to produce a latent variable $z$ , you compute $z = f_{θ} (x) = g_{θ}^{- 1} (x)$ . This means $f_{θ}$ is the inverse of $g_{θ}$ .

What is a latent variable?

A latent variable, in the context of statistics and data analysis, is a variable that is not directly observed but is inferred or estimated from other observed variables.

During the training process of the model, you take an image $x$ and then pass it to the inverse of $g_{θ}$ which is $f_{θ}$ (you are learning the model $f_{θ}$ during training). $f_{θ} (x)$ then produces $z$ which is a vector that belongs to the simple density function $p_{θ} (z)$ you are trying to learn. For instance, if your density function has dimension = 256, then $f_{θ}$ will produce a vector of dimension = 256 that corresponds to the images representation in the density function space.

We focus on functions where $f$ (and, likewise, $g$ ) is composed of a sequence of transformations $f = f_{1} \circ f_{2} \circ \cdot \cdot \cdot \circ f_{K}$ such that the relationship between $x$ and $z$ can be written as:

$x ⟷ f_{1} h_{1} ⟷ f_{2} h_{2} \dots ⟷ f_{K} z$ This just means that you compose a series of invertible functions in a sequence such that you can take $x$ to $z$ in a way that you can then perform the inverse computation.

Such a sequence of invertible transformations is also called a (normalizing) flow

Under the change of variables formula, the probability density function (pdf) of the model given a datapoint can be written as:

$lo g p_{θ} (x) = lo g p_{θ} (z) + lo g ∣ det (d z / d x) ∣ = lo g p_{θ} (z) + i = 1 \sum K lo g ∣ det (d h_{i} / d h_{i - 1}) ∣$

which means that the log probability of a datapoint $p_{θ} (x)$ is equivalent to something you can represent in terms of the log probability of the simple distribution of your latent variable $p_{θ} (z)$ .

You can therefore represent the log-likelihood objective you are minimizing as:

$L (D) = \frac{1}{N} \sum_{i = 1}^{N} - lo g p_{θ} (x^{(i)})$

which is equivalent to:

$L (D) = - \frac{1}{N} \sum_{i = 1}^{N} (lo g p_{θ} (z) + \sum_{i = 1}^{K} lo g ∣ det (d h_{i} / d h_{i - 1}) ∣)$

which is useful since you can find out the probability of a point from a normal distribution while you couldn’t find out the probability of an image ( $p_{θ} (x^{(i)})$ ) so you can now actually calculate your objective.

Eric's AI Notes

Explorer

Normalizing Flows

Math

Graph View

Backlinks

Eric's AI Notes

Explorer

Normalizing Flows

Math §

Graph View

Backlinks

Math