1 Flow Matching and Diffusion

This section explores flow matching and diffusion models based on the MIT lecture series. Also see lecture notes (Holderrieth and Erives 2025) and materials. More advanced, supplemental proofs are referenced from (Li 2025).

  1. Flow models parameterize the vector field (in the same space as data, for Euclidean data) which induces a specified probability path. Diffusion models generalize by adding Brownian noise, PDE \(\to\) SDE.
  2. Crucial bridge between vector field and probability path: continuity equation 1.1 resp. 1.2 Fokker-Planck in SDE case.
    • Given the probabaility velocity vector field \(u_t\) and \(p_{\mathrm{init}}\), we can directly compute \(\partial_{t} p_t\) via the Fokker-Planck equation. Conversely, given \(p_t\), we can always construct (non-unique) vector field which matches the probability path.
  3. Conditioning on a single data point \(z\) with dirac delta target \(\delta_z\), the conditional vector field has simple analytic solution. Apply the marginalization trick 1.3 to generalize to whole empirical data distribution.
    • Very important remark: fixing \(p_{\mathrm{init}}\), the mapping \(u_t\mapsto p_t\) is not linear. The vector field \(u_t\) specifies the probability velocity, while mixture linearly mixes the probability flux \(u_tp_t\).
  4. The marginal vector field 1.3 and score generally have intractable integrals, but crucially, optimizing MSE to the conditional v. marginal distribution have the same desired behavior (theorem 1.4).
  5. Key leverage via deep learning : having solved the simple conditional case analytically, we’re relying on the model to:
    • Model marginal quantities by optimization equivalence, without resorting to intractable integrals (very smart theme!!)
    • Generalize / interpolate the vector field to assign mass to unseen regions of the data manifold.
  6. Under the Gaussian probability path, the network essentially learns to predict the noise which is used to corrupt data; conversion between score vs target network targets 1.3.
  7. Classifier free guidance 1.3 contrastively amplifies the guided v. unguided vector fields. Under Gaussian path, this has the nice interpretation of manually upweighing the score contribution from classification log-likelihood. See detailed training procedure 1.4.

Modeling

  • We presume an underlying, generally unknown, data distribution \(p_{\mathrm{data}}\) supported on \(\mathbb R^d\).
  • Dataset consists of finite samples \(\{z_1, \dots, z_N\}\sim p_{\mathrm{data}}\), effectively yielding an empirical distribution \(\hat p_{\mathrm{data}}\).
  • Unconditional generation consists of (1) estimating \(p_{\mathrm{data}}\) from \(\hat p_{\mathrm{data}}\), then drawing samples from the estimate.
  • Assuming a joint distribution with some conditioning variable \(y\sim p_y\), conditional distribution consists of sampling from the conditional distribution \(p(\cdot \mid y)\).

In flow models, we model the data via a mapping \(u:\mathbb R^d\to \mathbb R^d\) between standard Gaussian initial distribution \(p_{\mathrm{init}}=\mathcal N(0, I_d)\) and the data distribution. This mapping is modeled as a flow whose time-derivative is parameterized by a deep model. Given a vector field \[ (X_t, t)\mapsto u_t(X_t) \quad \text{of type}\quad u_{(\cdot)}(\cdot): \mathbb R^d\times \mathbb R\to \mathbb R^d \] Treating the vector field \(u_t\) as an integrable operator, the flow \(\Phi_t: \mathbb R^d\to \mathbb R^d\) is obtained by integrating the vector field from \(0\) to time \(t\): \[ \Phi_t(x_0) = \exp \left[\int_0^t u_\tau\, d\tau\right]x_0 \, \iff \partial_{t} \Phi_t(x_0) = u_t\left[\Phi_t(x_0)\right] \] Such model above have deterministic dynamics, i.e. once \(x_0\sim p_{\mathrm{init}}\) is fixed, the sample is fixed. We may instead consider stochastic dynamics: \[ dX_t = u_t(X_t)\, dt + \sigma_t\, dW_t \iff \Phi_t(x_0) = \exp \left[ \int_0^t u_\tau \, d\tau + \int_0^t \sigma_\tau \, dW_\tau \right] x_0 \] where \(W_t\) is the standard Brownian motion. This concludes our specification of generation procedure as a function of model parameters \(\theta\) and \(\sigma_t\). To generate a new data sample from our model:

  1. Sample \(\epsilon\sim p_{\mathrm{init}}\).
  2. Compute \(x_1 = \Phi_1(x_0=\epsilon) = \exp \left[ \int_0^1 u_\tau \, d\tau + \int_0^1 \sigma^t_\tau \, dW_\tau \right] \epsilon\) using SDE approximation methods.

Vector fields and distributions

The next order of business is to train \(\theta\) so that \(x_1=\Phi_1(p_{\mathrm{init}})\approx p_{\mathrm{data}}\). The main idea is to integrate tractable conditional solutions to obtain the full, marginalized solution. Given a (continuous) probability path \(p_{(\cdot)}(\cdot): \mathbb R\to \Delta(\mathbb R^d)\), we can relate it to a vector field \(u_{(\cdot)}:\mathbb R\times \mathbb R^d\to \mathbb R^d\) which realizes it as follows:

Theorem 1.1 (continuity equation) Under regularity conditions, define \(x_0\sim p_{\mathrm{init}}\) (note that \(x_0\) is a random variable while \(p_{\mathrm{init}}\) is a distribution) and \(x_t\) as the image of the flow \[ x_t = \exp \left[ \int_0^\tau u_\tau \, d\tau \right]\, x_0\iff \partial_{t} x_t = u_t(x_t) \] Let \(p_t\) denote the induced distribution of \(x_t\), then \(\partial_{t} \big|_\tau\, p_t = -\nabla \cdot (u_\tau \rho_\tau)\). The converse also holds, i.e. t |, p_t = -(u_)t x_t = u_t(x_t)$.

Use test function and reverse product rule, divergence theorem. The main idea is to write \(\partial_{t} \mathbb{E}_{p_t}[f]\) as \(\mathbb{E}_{\partial_{t} p_t}[f]\) as well as a \(f\)-integral. Expand the \(f\)-integral using chain rule and product rule, and apply divergence theorem to rid of one term Consider arbitrary, time-invariant smooth test function \(f\). First convince ourselves that we have \[ \int f(x_t) p_0(x_0)\, dx_0 = \int f(x_t) p_t(x_t)\, dx_t \] To see this rigorously, let \(x_t = \Phi_t(x_0)\), then by standard chain rule we have \(dx_t = \det (D\Phi_t)\, dx_0\). On the other hand, by pushforward measure we have \(p_t(\Phi_t(x_0)) = p_0(x_0) \det^{-1} (D\Phi_t)\). Proceeding: \[\begin{align} \partial_{t} \mathbb{E}_{p_t} [f] &= \partial_{t} \int f(x_t) p_0(x_0)\, dx_0 = \partial_{t} \int f\left[\exp \left( \int_0^t u_t\, d\tau \right) x_0 \right] p_0(x_0)\, dx_0 \\ &= \int \left[\nabla \big|_{x_t} f \right]\cdot u_t(x_t) \left[p_0(x_0)\, dx_0\right] = \int \left[\nabla \big|_{x_t} f \right]\cdot \left[u_t(x_t) p_t(x_t)\right]\, dx_t \\ &= \int \nabla \cdot (f u_t p_t)\big|_{x_t} - f(x_t) \nabla\cdot (u_tp_t)\big|_{x_t} \, dx_t = -\int f(x)\, \nabla \cdot (u_t p_t)\, dx \end{align}\] The first part vanishes since \(p_t\) vanishes on far boundaries. On the other hand, we also have \[\begin{align} \partial_{t} \mathbb{E}_{p_t} [f] &= \partial_{t} \int f(x) p_t(x)\, dx = \int f(x)\partial_{t} p_t(x)\, dx \end{align}\] Combining, we obtain \(\partial_{t} p_t = -\nabla \cdot (u_t p_t)\), as desired.

Salient points:

  1. Given \(u_t\), we can use it to compute \(\partial_{t} p_t\); given \(p_0\), this gives us the rest of \(p_t\).
  2. Given \(\{p_t\}\), we can obtain a (non-unique) vector field which generates it by computing \(\partial_{t} p_t\) and fitting it to the divergence. This is always possible under regularity conditions because vector fields have more degrees of freedom.

Theorem 1.2 (Fokker-Planck) Under regularity conditions, let \(x_0\sim p_{\mathrm{init}}\) and similarly define \[ x_t = \exp \left[ \int_0^t u_\tau \, d\tau + \int_0^t \sigma_\tau dW_\tau \right]\, x_0 \iff dx_t = u_t(x_t)\, dt + \sigma_t\, dW_t \] Define \(p_t\) such that \(x_t\sim p_t\) and define the Lagrangian operator for scalar function \(\nabla^2 p \equiv \nabla \cdot (\nabla p) = \mathrm{tr}(H_p)\). Then \[ \forall x, \tau: \partial_{t} \big|_\tau\, p_t = -\nabla \cdot (u_\tau \rho_\tau) + \dfrac{\sigma_\tau^2}{2} \nabla^2 p_\tau \] Again, note that \(u_t\) is a vector field (differentiation operator) thus acts on scalar fields (or random variables) like \(p_t\) by application (so we need to apply chain rule), while \(\sigma_t\) is simply a scalar, thus acting by multiplication.

Use test function and reverse product rule, divergence theorem To the first order, we obtain \[ x_{t+h} \approx x_t + h\, u_t(x_t) + \sigma_t (W_{t+h} - W_t) \] Here \(u_t(x_t)\) is the tangent random variable denoting the application of \(u_t\) to the crystallization of \(x_t\). Given a scalar function \(f\), we obtain \[\begin{aligned} f(x_{t+h}) - f(x_t) &\approx (\nabla \big|_{x_t} f) \cdot \left[ h\, u_t(x_t) + \sigma_t(W_{t+h} - W_t)\right] \\ &\quad + \dfrac 1 2 \left[ h\, u_t(x_t) + \sigma_t(W_{t+h} - W_t)\right]^T \left(H_f \big|_{x_t}\right) \left[ h\, u_t(x_t) + \sigma_t(W_{t+h} - W_t)\right] \\ &= (\nabla f) \cdot \left[ h\, u_t(x_t) + \sigma_t(W_{t+h} - W_t)\right] + \dfrac{h^2} 2 u_t(x_t)^T \, H_f \, u_t(x_t) \\ &\quad + h \sigma_t \, u_t(x_t) H_f (W_{t+h} - W_t) + \dfrac{\sigma_t^2}{2} (W_{t+h} - W_t)^T H_f (W_{t+h} - W_t) \end{aligned}\] Taking expectation, note that \(\mathbb{E}[W_{t+h} - W_t]=0\) since \(W_{t+h} - W_t\sim \mathcal N(0, \sigma^2=h)\). Also note that \(\mathbb{E}[h^t Ah]=\mathbb{E}\mathrm{tr}(A hh^T)=\mathrm{tr}(A)\) for \(h\sim \mathcal N(0, I_d)\), then \[\begin{aligned} \mathbb{E}[f(x_{t+h}) - f(x_t)] &= \mathbb{E}\left[ (\nabla f) \cdot u_t(x_t)\, h + \dfrac{\sigma_t^2}{2} \mathrm{tr}(H_f) \, h + O(h^2)\right] \\ \partial_{t} \mathbb{E}[f(x_t)] &= \mathbb{E}\left[(\nabla f)\cdot u_t(x_t) + \dfrac{\sigma_t^2}{2} \mathrm{tr}(H_f)\right] \end{aligned}\] Great! Next up, expand the integral and apply reverse product rule piece by piece. The first term is familiar from continuity equation 1.1: \[\begin{aligned} \mathbb{E}[(\nabla f) \cdot u_t(x_t)] &= \int p_t(x) (\nabla_f) \cdot u_t(x)\, dx = -\int f(x) \nabla \cdot (p_t u_t)\, dx \end{aligned}\] By applying reverse product rule twice, the second term is seen to be \[ \mathbb{E}[(\nabla f) \cdot u_t(x_t)] = \dfrac{\sigma_t^2}{2} \int (\Delta \big|_x f) p_t(x)\, dx = \dfrac{\sigma_t^2}{2} \int f(x) \Delta p_t\, dx \] Combining, we have proved that for any regular scalar function \(f\), we have \[\begin{aligned} \partial_{t} \mathbb{E}[f(x_t)] = \int f(x) \left(\nabla \cdot (p_tu_t) + \dfrac{\sigma^2}{2} \Delta p_t\right)\, dx = \int f(x) \partial_{t} p_t(x)\, dx \end{aligned}\] This shows that \(\partial_{t} p_t=\dots\) almost surely; the necessary direction follows from uniqueness of SDEs.

Remark (Langevin dynamics). Fixing \(p_{\mathrm{init}}\) and set \(\partial_{t} p_t=0\) yields that the dynamics \[ dx_t = \dfrac{\sigma_t^2}{2} \nabla \log p(x_t)\, dt + \sigma_t\, dW_t \] has stationary distribution \(p_{\mathrm{init}}\). This is called Langevin dynamics. In fact, under mild regularity conditions this SDE pulls any initial law \(p'\) to the fixed point.

Marginalization

The following theorem is the main tool for converting conditional models into marginal ones.

Theorem 1.3 (marginalization trick) Fixing \(p_{\mathrm{data}}\) and conditional probability paths \(p_t(\cdot\mid z)\) interpolating between \(p_{\mathrm{init}}\) and \(\delta_z\) (e.g. definition 1.1). For each \(z\in \mathbb R^d\) let \(u^*_t(\cdot\mid z):\mathbb R^d\to \mathbb R^d\) denote a vector field which realizes the conditional probability path, i.e.  \[ \exp \left[ \int_0^t u^*_\tau(\cdot\mid z)\, d\tau \, d\tau \right]\, (\epsilon \sim p_{\mathrm{init}}) \sim p_t(\cdot \mid z) \] Then the marginal vector field defined as follows \[ u_t(x) \equiv \int u_t^*(x\mid z) p_t(z\mid x)\, dz, \quad p_t(z\mid x) = \dfrac{p_t(x\mid z) p_{\mathrm{data}}(z)}{p_t(x)} \] realizes the marginal probability distribution \[ \exp \left[ \int_0^t u^*_\tau\, d\tau \right]\, (\epsilon \sim p_{\mathrm{init}}) \sim p_t = \int p_t(\cdot\mid z)\, dz \]

Per continuity equation 1.1, it suffices to check that the divergence of the proposed vector field matches the time-derivative of the marginal distribution \(p_t\): \[\begin{aligned} -\nabla \cdot (u_t p_t) \big|_x &= -\nabla \cdot \left[ p_t(x) \int u_t^*(x\mid z) p_t(z\mid x)\, dz \right] \\ &= -\int \nabla \cdot \left[u_t^*(x\mid z) p_t(x\mid z) p_{\mathrm{data}}(z) \right]\, dz \\ &= -\int p_{\mathrm{data}}(z)\, \nabla \cdot \left[u_t^*(x\mid z) p_t(x\mid z)\right]\, dz \\ &= \int p_{\mathrm{data}}(z) \partial_{t} p_t(x\mid z)\, dz = \partial_{t} p_t(x) \end{aligned}\]

Several important points:

  1. Fixing \(t\), we’re working with the joint distribution over \(z\sim p_{\mathrm{data}}\) and \(p_t(x_t\mid z)\). This underpins the simplification \(p_t(x) p_t(z\mid x) = p_t(x\mid z) p_{\mathrm{data}}(z)\).
  2. Divergence is taken w.r.t. the \(x\) variable! This is why \(p_{\mathrm{data}}(z)\) slides out of divergence as a constant.
  3. Proof sketch: compute the divergence by substituting ansatz for the marginal vector field \(u_t\), apply continuity equation to conditional vector field to convert divergence to time-derivative, then marginalize.
IMPORTANT: why \(u_t\neq \int u_t^*(\cdot \mid z) p_{\mathrm{data}}(z)\, dz\) Zeroth-order intuition seems to suggest that “since everything is linear”, let’s just substitute \(u_t(x) = \int u_t^*(x\mid z) p_{\mathrm{data}}(z)\, dz\). However, invoking the same machinery yields: \[\begin{aligned} -\nabla \cdot (u_t p_t) \big|_x &= -\nabla \cdot \left[ p_t(x) \int u_t^*(x\mid z) p_{\mathrm{data}}(z) \, dz \right] \\ &= -\int \nabla \cdot \left[u_t^*(x\mid z) p_t(x\mid z) \dfrac{p_t(x)p_{\mathrm{data}}(z)}{p_t(x\mid z)} \right]\, dz \end{aligned}\] Importantly, we don’t obtain the clean refactoring of \(p_{\mathrm{data}}(z)\). The main reason why linear superposition is not correct is because the operator \(u_t\mapsto \partial_{t} p_t = -\nabla \cdot (u_tp_t)\) which maps \(u_t\) to the distribution it effects has additional dependence upon \(p_t\), which depends nontrivially upon \(x\). Switching from conditional to marginal, we have \(p_t(x \mid z)\mapsto p_t(x)=\int p_t(x\mid z)p_{\mathrm{data}}(z)\, dz\); this introduces nontrivial \(x\)-dependence inside the divergence operator.

Alternative explanation: the marginal velocity at a point is the average over all conditional velocities, so \(u_t(x) = \mathbb{E}_{z}[u_t(x\mid z) \mid X=x]\). It’s critical to recognize here that the ensembling and conditioning is done w.r.t. the joint \(p_t(x, z)\).

Proposition 1.1 (marginalization of score function) Given a distribution \(p\), define its score function \(\nabla \log p(x)\). The marginal score is connected to conditional score by \[ \nabla \log p_t(x_t) = \int p_t(z\mid x_t) \nabla \log p_t(x_t\mid z)\, dz, \quad p_t(z\mid x_t) = \dfrac{p_t(x\mid z)p_{\mathrm{data}}(z)}{p_t(x)} \]

Optimization

Fixing \(p_{\mathrm{data}}\) and \(p_t^\mathrm{target}\), we can mathematically formulate \(u_t^\mathrm{target}\) using the continuity equation and marginalization trick. We parameterize \(u_t^\theta\) in hopes of approximating \(u_t^\mathrm{target}\). This can be done by minimizing the (marginal) flow matching loss \[ \mathcal L_\mathrm{FM}(\theta) = \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}, x\sim p_t(\cdot\mid z)} \left\|u_t^\theta(x) - u_t^\mathrm{target}(x)\right\|^2 \] The only problem being that \(u_t^\mathrm{target}\) is an intractable integral (sum for discrete datasets). A much more tractable quantity is the conditional flow matching loss which, upon drawing \(z\sim p_{\mathrm{data}}\), only asks the model to regress the conditional vector field for that \(z\): \[ \mathcal L_\mathrm{CFM}(\theta) = \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}, x\sim p_t(\cdot\mid z)} \left\|u_t^\theta(x) - u_t^\mathrm{target}(x\mid z)\right\|^2 \] Similarly define the score matching and conditional score matching losses: \[\begin{aligned} \mathcal L_{\mathrm{SM}}(\theta) &= \mathbb{E}_{t\sim \mathrm{Unif}, x\sim p_t} \left\|\nabla \log p_t^\theta(x) - \nabla \log p_t^{\mathrm{target}}(x)\right\|^2 \\ \mathcal L_{\mathrm{CSM}}(\theta) &= \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}, x\sim p_t(\cdot\mid z)} \left\|\nabla \log p_t^\theta(x) - \nabla \log p_t^{\mathrm{target}}(x\mid z)\right\|^2 \end{aligned}\]

Theorem 1.4 (optimization equivalence) \(\nabla_\theta \mathcal L_\mathrm{FM}(\theta) = \nabla_\theta \mathcal L_\mathrm{CFM}(\theta)\). Similarly, \(\nabla_\theta \mathcal L_\mathrm{SM}(\theta) = \nabla_\theta \mathcal L_\mathrm{CSM}(\theta)\).

Proof: Expanding the squared norm for both losses, it suffices to show that \[ \mathbb{E}[u_t^\theta(x)^T u_t^\mathrm{target}(x)] = \mathbb{E}[u_t^\theta(x)^T u_t^\mathrm{target}(x\mid z)] \] Expand integrals and apply the marginalization equation 1.3: \[\begin{aligned} \mathbb{E}_{t, x\sim p_t}[u_t^\theta(x)^T u_t^\mathrm{target}(x)] &= \iint p_t(x) u_t^\theta(x)^T \left[u_t^\mathrm{target}(x)\right] \, dx \, dt \\ &= \iint p_t(x) u_t^\theta(x)^T \int u_t^*(x\mid z) p_t(z\mid x)\, dz \, dx \, dt \\ &= \iiint u_t^\theta(x)^T u_t^*(x\mid z) p_t(x, z)\, dz\, dx\, dt \\ &= \mathbb{E}_{t, z\sim p_{\mathrm{data}}, x\sim p_t(\cdot\mid z)} [u_t^\theta(x)^T u_t^*(x\mid z)] \end{aligned}\]

Gaussian example

The next order of business is to work out a concrete example; this is almost the only example we’ll ever need to work out. We begin by specifying an almost trivial conditional probability path:

Definition 1.1 (conditional Gaussian probability path) Consider a single point \(z\in \mathbb R^d\), the following Gaussian probability path continuously collapses \(p_{\mathrm{init}}\) to \(\delta_z\): \[ p_t^\mathrm{target}(x\mid z) = \mathcal N(x; \alpha_t z, \beta_t^2 I_d) = \dfrac 1 {(2\pi \beta_t^2)^{d/2}} \exp \left(-\dfrac{\|x-\alpha_t z\|^2}{2\beta_t^2}\right) \] where \(\alpha_t, \beta_t\) are noise schedulers satisfying \(\alpha_0=\beta_1=0\) and \(\alpha_1=\beta_0=1\). Note that \(x, z\in \mathbb R^d\). Fixing \(z\), note that \(p_t\) is the pushforward measure of \(p_{\mathrm{init}}\) under the flow \[ \psi_t^\mathrm{target}(x\mid z) = \alpha_t z + \beta_t x \]

Proposition 1.2 (Gaussian conditional vector field and score) Given \(\alpha_t, \beta_t\), the following vector field realizes the conditional Gaussian probability path: \[ u_t^\mathrm{target}(x\mid z) = \left(\dot \alpha_t - \dfrac{\dot \beta_t}{\beta_t} \alpha_t\right) z + \dfrac{\dot \beta_t}{\beta_t} x \] For the special linear-scheduler case \(\alpha_t=t, \beta_t=1-t\), we obtain \[ u_t(x\mid z) = \left(1 + \dfrac{t}{1-t}\right) z - \dfrac{1}{1-t} x = \dfrac{z-x} {1-t} \] The score function is \[ \nabla \log p_t^\mathrm{target}(x\mid z) = -\dfrac{x - \alpha_t z}{\beta_t^2} \]

IMPORTANT proof: instead of directly verifying the continuity equation, given \(p_t\) construct vector field by (1) constructing a flow \(\psi_t(x_0)\) whose pushforward measure of \(p_0\) is \(p_t\), and (2) finding \(u_t\) such that \(\partial_{t} \psi_t(x_0) = u_t(\psi_t(x_0))\) Drop the \(\mathrm{target}\) superscript to reduce clutter. Instead of doing the hairy calculation for \(\partial_{t} p_t\), we recall that \(p_t\) is the pushforward of \(p_0=\mathcal N(0, I_d)\) under \(\psi_t\), so it suffices to show that \[ \partial_{t} \psi_t(x\mid z) = u_t(\psi_t(x\mid z)\mid z) \] Unpacking this a bit more: fixing \(z\) throughout, the map \(x\mapsto \psi_t(x\mid z)\) sends initial \(x\sim p_{\mathrm{init}}\) to where it would be at time \(t\) such that \(x_t=\psi_t(x_0\mid z)\sim p_t\) for all \(t\). Also recall that given flow \(x_0\mapsto \Phi_t(x_0)\), the vector field generator of the flow \(u_t\) satisfies \[ \partial_{t} \Phi_t(x_0) = \psi_t(x_t) = \psi_t(\Phi_t(x_0)) \] Returning to our original equation, we have \[\begin{aligned} u_t(\psi_t(x\mid z)\mid z) &= \left( \dot \alpha_t - \dfrac{\dot \beta_t}{\beta_t} \alpha_t \right) z + \dfrac{\dot \beta_t}{\beta_t} (\alpha_t z + \beta_t x) \\ &= \dot \alpha_t z + \dot \beta_t x = \partial_{t} \psi_t \end{aligned}\]
  • Again, there are many vector fields which can realize the Gaussian conditional probability path. This gauge freedom is embedded in our choice of \(\psi_t\): there are many flow maps which realize the same specified pushforward measure.

Proposition 1.3 (flow and score matching for Gaussian paths)

Applying 1.2, we obtain the losses which make training protocol self-evident: \[\begin{aligned} \mathcal L_\mathrm{CFM}(\theta) &= \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}, x\sim \mathcal N(\alpha_tz, \beta_t^2I_d)} \left\| u_t^\theta(x) - \left(\dot \alpha(t) - \dfrac{\dot \beta_t}{\beta_t} \alpha_t \right)z - \dfrac{\dot \beta_t}{\beta_t} x \right\|^2 \\ &= \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}, \epsilon \sim \mathcal N(0, I_d)} \left\| u_t^\theta(\alpha_t z + \beta_t \epsilon) + (\dot \alpha_t z + \dot \beta_t \epsilon) \right\| \end{aligned}\] For simple linear interpolation, this instantiates to \(\mathcal L_\mathrm{CFM}(\theta) = \mathbb{E}\|u_t^\theta(tz + \bar t\epsilon) - (z-\epsilon)\|^2\). In terms of score, we obtain \[\begin{aligned} \mathcal L_\mathrm{CSM}(\theta) &= \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}, x\sim p_t(\cdot\mid z)} \left\| s_t^\theta(x) + \dfrac{x - \alpha_tz}{\beta_t^2} \right\| \\ &= \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}, \epsilon \sim \mathcal N(0, I_d)} \left\| s_t^\theta(\alpha_t z + \beta_t\epsilon) + \dfrac{\epsilon}{\beta_t} \right\|^2 \end{aligned}\]

The score network essentially tries to learn the noise \(\epsilon\) which used to corrupt data. To avoid \(\beta\to 0\) instability, Denoising Diffusion Probabilistic Models (DDPM) proposed to drop \(1/\beta\) coefficient and predict \(\epsilon_t^\theta\). Some algebra further shows that predicting the score is equivalent to predicting the target vector field: \[ u_t^\mathrm{target}(x\mid z) = \left(\beta_t^2 \dfrac{\dot \alpha_t}{\alpha_t} - \dot \beta_t \beta_t\right) \nabla \log p_t(x\mid z) + \dfrac{\dot \alpha_t}{\alpha_t} x \] Since both quantities share marginalization formula, the same equivalence holds for marginal targets.

Guidance

We would further like to be able to condition generation on prompts / features / past data \(y\in \mathcal Y\). The generation process becomes \[ Y\xrightarrow{p_{\mathrm{data}}(z\mid y)} Z\xrightarrow{p_t(x\mid z)} X \]

Definition 1.2 (guided diffusion model) A guided diffusion (flow) model consists of a guided vector field \(u_t^\theta(\cdot \mid y)\) which accepts \(t, y\in \mathcal Y\) to output change in \(X\). Essentially we’re trying to jointly model \(p_{\mathrm{data}}(z, y)\).

Fixing \(y\), the conditional flow matching loss writes \[ \mathcal L_{\mathrm{CFM}, y}(\theta) = \mathbb{E}_{t\sim \mathrm{Unif}, z\sim p_{\mathrm{data}}(\cdot\mid y), x\sim p_t(\cdot\mid z)} \left\|u_t^\theta(x) - u_t^\mathrm{target}(x\mid z)\right\|^2 \] Take expectation over \(y\) to obtain the guided conditional loss \[ \mathcal L_\mathrm{CFM}^\mathrm{guided}(\theta) = \mathbb{E}_{t\sim \mathrm{Unif}, (z, y)\sim p_{\mathrm{data}}(z, y), x\sim p_t(\cdot\mid z)} \|u_t^\theta(x\mid y) - u_t^\mathrm{target}(x\mid z) \|^2 \] Empirically, people have observed that the loss above does not enforce guidance strongly enough. Recalling the linear Gaussian formula 1.3, we can rewrite \[ u_t^\mathrm{target}(x\mid y) = a_t x + b_t \nabla_x \log p_t(x\mid y), \quad (a_t, b_t) = \left(\dfrac{\dot \alpha_t}{\alpha_t}, \dfrac{\dot \alpha_t \beta_t^2 - \dot \beta_t \beta_t \alpha_t}{\alpha_t}\right) \] Rewrite guided score as \(\nabla \log p_t(x\mid y) = \nabla \log p_t(y\mid x) + \nabla \log p_t(x)\): - Guided score at noisily guided data \(x_t\) is equivalent to the un-guided score of \(x_t\), plus log-likelihood of the guidance label \(y\), given \(x_t\).

To artificially amplify guidance, consider introducing the guidance scale \(w>1\) and rearrange \[\begin{aligned} \tilde u_t(x\mid y) &= a_t x + b_t \left( \nabla \log p_t(x) + w\, \nabla \log p_t(y\mid x) \right) \\ &= a_t x + b_t \left( \nabla \log p_t(x) + w\, (\nabla \log p_t(x\mid y) - \nabla \log p_t(x)) \right) \\ &= a_t x + b_t(1-w) \nabla \log p_t(x) + w \nabla \log p_t(x\mid y) \\ &= w\, u^\mathrm{target}_t(x\mid y) + \bar w\, u^\mathrm{target}_t(x) \end{aligned}\]

So for Gaussian conditional probability path, up-scaling guidance from classifier = linearly interpolating between \(u_t^\mathrm{target}(x\mid y)\) and \(u_t^\mathrm{target}(x)\).$.

Definition 1.3 (classifier-free guidance)

Per the discussion above, the classifier-free guided vector field and scores are defined, given guidance scale \(w>1\), as \[\begin{aligned} \tilde u_t(x\mid y) &= w\, u^\mathrm{target}_t(x\mid y) + \bar w\, u^\mathrm{target}_t(x), \quad \tilde s_t(x\mid y) = w\, s^\mathrm{target}_t(x\mid y) + \bar w\, s^\mathrm{target}_t(x) \end{aligned}\]

Definition 1.4 (training protocol for linear Gaussian path) Given label-data pairs \((y_j, z_j)\), guidance scale \(w>1\), and probability \(\eta\in (0, 1)\), the training procedure for CFG-flow matching with linear Gaussian probability path 1.1 is as follows:

  1. Sample a batch \((y_j, z_j)\) from dataset.
  2. Given the batch, sample a boolean mask \((\xi_1, \dots, \xi_n)\) with probability \(\eta\) denoting whether we’re ignoring the guidance label on the sample. Replace \(y_j\mapsto \xi_j \emptyset + \bar \xi_j y_j\).
  3. Sample \(t\sim \mathrm{Unif}, \epsilon \sim \mathcal N(0, I_d)\) batch-wise.
  4. Minimize flow matching loss: \[\begin{aligned} \mathcal L^{\mathrm{CFG}}_\mathrm{CSM} = \dfrac 1 n \sum_{j=1}^n \left\| u_{t_j}^\theta(t_jz_j + \bar t_j\epsilon_j\mid y_j) - (z_j-\epsilon_j) \right\|^2 \end{aligned}\]

This trains the model to match flow of \(u_t^\mathrm{target}(x\mid y)\) for conditional case, and \(u_t^\mathrm{target}(x\mid \emptyset)\) for unconditional case. Guided sampling given \(y\):

  1. Initialize \(x_0\sim p_{\mathrm{init}}\).
  2. Simulate ODE \(dx_t = \left[ w\, u^\theta_t(x_t\mid y) + \bar w\, u^\theta_t(x_t\mid \emptyset) \right]\, dt\)
  3. Output \(x_1\).

Empirically, we want at least O(1) unconditional samples per batch on average. The choice \(\eta=0.1\) is common.

Bibliography

Holderrieth, Peter, and Ezra Erives. 2025. “Introduction to Flow Matching and Diffusion Models.”
urlhttps://diffusion.csail.mit.edu/docs/lecture-notes.pdf.
Li, Marvin. 2025. “In the Blink of an Eye: A Unified Theory for Feature Emergence in Generative Models.” Honors thesis, Cambridge, MA: Harvard College. https://marvinfli.github.io/files/thesis.pdf.