9 Statistical Decision Applications

Key takeaways:

A statistical model is a family $\{P_\theta\}$ of distributions indexed by parameter values $\theta \in \Theta$.
$R_\theta(\hat \theta)$ is the expectation of loss over $X\sim P_\theta$.
There are two ways turn $R_\theta(\hat\theta)$ into a scalar:
- Eliminating $\theta$ with $\sup_\theta$, then taking $\inf$ over all estimators leads to the minimax risk $R^*$.
- Replacing $\theta$ by averaging over a prior $\pi$, taking $\inf$ over $\hat \theta$, then taking $\sup_\pi$ adversarially over all priors $\pi$ leads to the Bayes risk.
The minimax and Bayes risks are related by duality.
Under mild assumptions, $R^* = R^*_{\mathrm{Bayes}}$ (proposition 9.1).
Sample complexity 9.3 is the number of samples to achieve $R^*_n< \epsilon$.
Tensor product experiments 9.4 have independent but not necessarily identically distributed observations; theoreom 9.3 yields a bound on its minimax risk.
Given i.i.d. observations, an analogue of Cramer-Rao (theorem 9.7) still holds even for biased estimators.

Minimax and Bayes risks

A statistical model is a collection $\mathcal P = \{P_\theta:\theta \in \Theta\}$ of distributions over some measurable space $(\mathcal X, \Sigma)$. Here $\Theta$ is the parameter space.
An estimand $T(\theta)$ has signature $T:\Theta\to \mathcal Y$; in the trivial case $T$ is just the identity.
An estimator (decision rule) has type $\hat T:\Omega\to \hat {\mathcal Y}$. The action space $\hat {\mathcal Y}$ does not have to be the estimand space $\mathcal Y$ (e.g. when we’re estimating a confidence interval).
- $\hat T$ can be deterministic or randomized.
To evaluate the estimator, the loss function $l:\mathcal Y\times \hat{\mathcal Y}\to \mathbb R$ defines the risk of $\hat T$ for estimating $T$.
The (expected) risk of the estimator $\hat T$ for estimating $T$ is a function of the true parameter $\theta$ and estimator $\hat T$ : \[ R_\theta(\hat T) = \mathbb E_{X\sim \theta}[l(T(\theta), \hat T(X)] = \int P_\theta(dx) P_{\hat T|X}(\hat t|x) l(T(\theta), \hat t) \]

Two comments in order:

For the minimax risk, it is sometimes necessary to randomize. To minimize the average risk over a prior (Bayesian approach), though, it suffices to consider deterministic functions.
The space of randomized estimators (as Markov kernels) is convex. This is necessary for minimax theorems.
For $\hat T\mapsto l(T, \hat T)$ convex, we can derandomize it by consider $\mathbb E[\hat T|X]$, then \[ R_\theta(\hat T) = \mathbb E_\theta l(T, \hat T) \geq \mathbb E_\theta l(T, \mathbb E[\hat T|X]) \]

Definition 9.1 (Bayes risk) Given a prior $\pi$, the average risk w.r.t. $\pi$ of an estimator is \[ R_\pi(\hat \theta) = \mathbb E_{\theta \sim \pi} R_\theta(\hat \theta) = \mathbb E_{\theta \sim \pi, X\sim P_\theta} l(\theta, \hat \theta(X)) \] The Bayes risk of a prior is its minimal average risk $R_\pi^* = \inf_{\hat \theta} R_\pi(\hat \theta)$. The optimal $\hat \theta$ is the Bayes estimator.

Definition 9.2 (minimax risk) Given a parameter family $P_{\theta\in \Theta}$ and loss function $l$, the minimax risk is
\[ R^* = \inf_{\hat \theta} \sup_{\theta \in \Theta} R_\theta(\hat \theta) = \inf_{\hat \theta} \sup_{\theta \in \Theta} \mathbb E_{X\sim P_\theta} l(\hat \theta(X), \theta) \] To prove minimax risk, one needs to establish for arbitrary $\epsilon$

an estimator $\hat \theta^*$ satisfying $R_\theta(\hat \theta^*)\leq R^*+\epsilon$.
For arbitrary $\hat \theta$, $\sup_{\theta} R_\theta(\hat \theta)\geq R^*-\epsilon$.

Theorem 9.1 (duality between minimax and Bayes risk) Let $\mathcal P(\Theta)$ denote the set of probability distributions on $\Theta$, then \[ R^* = \inf_{\hat \theta}\sup_{\theta \in \Theta} R_\theta(\hat \theta) \geq R^*_{\mathrm{Bayes}} = \sup_{\pi \in \mathcal P(\theta)} \inf_{\hat \theta} R_\pi(\hat \theta) \]

Proof: $R^* = \inf_{\hat \theta}\sup_{\theta \in \Theta} R_\theta(\hat \theta) = \inf_{\hat \theta}\sup_{\pi \in \mathcal P(\Theta)} R_\pi(\hat \theta) \geq R^*_{\mathrm{Bayes}}$ by $\inf\sup \geq \sup\inf$.

Example 9.1 (strict inequality) Let $\theta, \hat \theta \in \{1, 2, \cdots\}$ and $l(\theta, \hat \theta) = 1_{\hat \theta < \theta}$, then $R^* = 1$ while $R^*_{\mathrm{Bayes}} = 0$.

A duality perspective

For simplicity, let $\Theta$ be a finite set and $l$ be convex, then \[ R^* = \min_{P_{\hat \theta|X}} \max_{\theta \in \Theta} \mathbb E_\theta \, l(\theta, \hat \theta) \] This is a convex optimization problem since \[ P_{\hat \theta|X}\mapsto \mathbb E_\theta \, l(\theta, \hat \theta) = \mathbb E_{X\sim P_\theta, \hat \theta\sim P_{\hat \theta|X}}[l(\theta, \hat \theta)] \] is linear and the pointwise supremum of convex functions is convex. Considering its dual, rewrite \[ R^* = \min_{P_{\hat \theta|X}, t} \, t\quad \text{s.t. } \mathbb E_\theta\, l(\theta, \hat \theta) \leq t, \quad \forall \theta \in \Theta \] Let $\pi_\theta\geq 0$ for each inequality constraint. The Lagrangian is \[ \mathcal L(P_{\hat \theta|X}, t, \pi) = t + \sum_{\theta \in \Theta} \pi_\theta\cdot \left( \mathbb E_\theta\, l(\theta, \hat \theta) - t \right) = \left( 1 - \sum_{\theta \in \Theta} \pi_\theta \right)t + \sum_{\theta \in \Theta} \pi_\theta \mathbb E_\theta\, l(\theta, \hat \theta) \] The first term implies that $\pi$ must be a probability measure, yielding the dual problem \[ \max_\pi \min_{P_{\hat \theta|X}} \sum_{\theta \in \Theta} \pi_\theta \mathbb E_\theta\, l(\theta, \hat \theta) = \max_{\pi \in \mathcal P(\Theta)} R_\pi^* \]

Theorem 9.2 (general minimax equality) $R^* = R^*_{\mathrm{Bayes}}$ if the following conditions hold:

The experiment is dominated: $P_{\forall \theta} \ll \nu$ for some $\nu$.
The action space $\hat \Theta$ (codomain of the estimator) is a locally compact topological space with a countable basis.
The loss function is level compact: for each $\theta\in \Theta, l(\theta, \cdot)$ is bounded from below and $\{\hat \theta:l(\theta, \hat \theta)\leq a\}$ is compact for each $a$.

We will prove the following special case for demonstration.

Proposition 9.1 (special minimax equality) $R^* = R^*_{\mathrm{Bayes}}$ if $\Theta$ is finite and $l$ is bounded from below (e.g. quadratic).

Proof ideas: work in the vector space of risk vectors with inner product given by average risk. Next separate the convex sets of (1) all risk vectors, and (2) all vectors component-wise less than $R^*$.

Proof

First consider the edge case $R^*=\infty \iff R^*_{\mathrm{Bayes}} = \infty$; this is established by considering the uniform prior $\pi$ on $\Theta$.
Next consider $R^*<\infty$. Given an estimator $\hat \theta$, denote its risk vector $R(\hat \theta)_\theta = \mathbb E_{X\sim P_\theta} l(\theta, \hat \theta)$ with components in $\theta$.
The average risk is given by the inner product $\langle R(\hat \theta), \pi\rangle$.
Define the set $S$ of all possible risk vectors for randomized estimators; Note that $S$ is convex because linear interpolations of risk vectors are given by mixtures of their corresponding random estimators.
The set $T=\{t\in R^\Theta: t_{\forall \theta} < R^*\}$ is convex, and $S\cap T=\emptyset$ since for every valid estimator at least one component achieves $R^*$.
By the hyperplane separation theorem, there exists $\pi \in R^\Theta$ and $c\in \mathbb R$ such that {sS}, sc{tT}, t$. Now $\pi$ must be componentwise positive else $\sup_{t\in T}\langle\pi, t\rangle= \infty$, so w.l.o.g. $\pi$ is a probability vector.
Thus we have established $R^*_{\mathrm{Bayes}} \geq R_\pi^* \geq R^*$.

Sample complexity, tensor products

Given an experiment $P_{\theta\in \Theta}$, the independent sampling model refers to the experiment \[ \mathcal P_n = \{P_{\theta \in \Theta}^{\otimes n}\}, \quad n\geq 1 \] Define $R_n^*(\Theta) = \inf_\hat \theta \sup_{\theta \in \Theta} \mathbb E_\theta\, l(\theta, \hat \theta)$ to be the minimax risk when $\hat \theta$ consumes $X=(X_1, \cdots, X_n)$ consisting of independent observations. Note that:

$n\mapsto R_n^*(\Theta)$ is non-increasing since we can always discard extra observations.
We typically expect $R_n^*(\Theta)\to 0$ as $n\to \infty$.

Definition 9.3 (sample complexity) The sample complexity is the minimum sample size required to obtain a prescribed error $\epsilon$ in the worst case (of actual parameter): \[ n^*(\epsilon) = \min\{n\in \mathbb N: R^*_n(\Theta)\leq \epsilon\} \]

Definition 9.4 (tensor product experiment) Given statistical experiments $\mathcal P_i = \{P_{\theta_i \in \Theta_i}\}$ and loss $l_i$ for each $i\in [d]$, the tensor product experiment \[ \mathcal P = \left\{\prod P_{\theta_j}, \quad (\theta_j)\in \Theta = \prod \Theta_j\right\} \] the loss function is $l(\theta, \hat \theta) = \sum_j l_j(\theta_j, \hat \theta_j)$. In this model, the observation $(X_1, \cdots, X_d)$ are independent but not necessarily identically distributed.

Theorem 9.3 (minimax risk of tensor product) \[ \sum_{j=1}^d R^*_{\mathrm{Bayes}}(\mathcal P_j) \leq R^*(\mathcal P) \leq \sum_{j=1}^d R^*(\mathcal P_j) \] If the minimax theorem $R^*(\mathcal P_i) = R^*_{\mathrm{Bayes}}(\mathcal P_i)$, then it holds for the product experiment and the minimax risk additively decomposes.

Proof idea: right inequality follows by unrolling definitions. For the left inequality, choose $\theta_j$’s independent from a product prior and argue that each component of the estimator only obtain information from $X_j$ alone.

Proof

For the right inequality, choose $\hat \theta^* = (\hat \theta_j^*)$ to be the component minimax estimators without considering any other observations. For the left inequality, consider for each $(\pi_j)$ a product prior $\pi = \prod \pi_j$; under this prior, both $\theta_j$’s and $X_j$’s are independent. For any component $\hat \theta_j = \hat \theta_j(X, U_j)$ of any randomized estimator $\hat \theta$, the non-$X_j$ components $(U_j, X_{\bar j})$ by independence can be viewed as a randomized estimator based on $X_i$ alone (all other $X_{\bar j}$’s do not yield any information about $\theta_j$ by independence of $\theta$’s). Thus its average risk satisfies $R_{\pi_j}(\hat \theta_j) \geq R^*_{\pi_j}$. Take supremum over all $\pi_j$ and sum over $j$ to obtain the left inequality.

HCR lower bound

Theorem 9.4 (HCR lower bound) The quadratic loss of any estimator $\hat \theta$ at $\theta\in \Theta\subset \mathbb R^d$ satisfies \[ R_\theta(\hat \theta) = \mathbb E_{ \theta, X\sim P_\theta } [\hat \theta(X) - \theta]^2 \geq \mathrm{Var}_\theta(\hat \theta) \geq \sup_{\theta'\neq \theta} \dfrac{ (\mathbb E_\theta \hat \theta - \mathbb E_{\theta'} \hat \theta)^2 }{\chi^2(P_{\theta'} \| P_\theta)} \]

Proof idea: fix $\theta' \neq \theta$ and treat $X\mapsto \hat \theta(X)$ as a data-processor; apply the variational characterization of $\chi^2$.

Proof

Recall the variational characterization of $\chi^2$ (proposition 8.11): for all $g:\mathcal X\to \mathbb R$ \[ \mathrm{Var}_Q[g^2] \geq \dfrac{(\mathbb E_P g - \mathbb E_Q g)^2}{\chi^2(P\|Q)} \] Given two different underlying parameters $\theta, \theta'$, let $P_X=P_\theta, Q_X=P_{\theta'}$, then applying the data processor $X\mapsto \hat \theta(X)$ and the variational characterization \[ \chi^2(P_X \| Q_X) \geq \chi^2(P_{\hat \theta} \| Q_{\hat \theta}) \geq \dfrac{(\mathbb E_\theta\hat \theta - \mathbb E_{\theta'} \hat \theta)^2} {\mathrm{Var}_\theta(\hat \theta)} \]

Corollary 9.1 (Cramer-Rau lower-bound) For an unbiased estimator $\hat \theta$ satisfying $\mathbb E_{\forall \theta}[\hat \theta] = \theta$, we obtain \[ \mathrm{Var}_\theta(\hat \theta) \geq \mathcal J_F(\theta)^{-1} \]

Proof

Apply the unbiased condition to theorem 9.4, take $\theta = \theta + \xi_{\to 0}$ and apply the local Fisher behavior of divergence theorem 8.12 \[\begin{align} \mathrm{Var}_\theta(\hat \theta) &\geq \dfrac{ \xi^2 }{\chi^2(P_{\theta + \xi} \| P_\theta)} = \lim_{\xi\to 0} \xi\dfrac{ \xi^2 }{\mathcal J_F(0) \xi^2 + \cdots} = \mathcal J_F(\theta)^{-1} \end{align}\]

To generalize to $\theta \in \Theta\subset \mathbb R^d$, assume $\hat \theta$ _is unbiased and \[ \mathrm{Cov}_\theta(\hat \theta) = \mathbb E_\theta[(\hat \theta - \theta)(\hat \theta - \theta)^T] \] is positive-semidefinite, apply theorem 9.4 to each data-processor $\langle a, \hat \theta(X)\rangle, a\in \mathbb R^d$ to obtain \[ \chi^2(P_\theta \| P_{\theta'}) \geq \dfrac{\langle a, \theta - \theta'\rangle^2}{a^T \mathrm{Cov}_\theta(\hat \theta)a} \] Optimizing over $a$ and apply, for $0\preceq \Sigma$ \[ \sup_{x\neq 0} \dfrac{\langle x, y\rangle^2}{x^T\Sigma x} = y^T \Sigma^{-1} y, \quad x = \Sigma^{-1} y \] yields $\chi^2(P_\theta \| P_{\theta'}) \geq (\theta - \theta')^T \mathrm{Cov}_\theta(\hat \theta)^{-1} (\theta - \theta')$. Applying the multivariate local approximation again yields $\mathcal J_F^{-1}(\theta) \preceq \mathrm{Cov}_\theta(\hat \theta)$.

Further recall that the Fisher information is additive under i.i.d. tensorization; taking the trace on both sides, we obtain, for unbiased estimators, \[ \mathbb E_\theta \|\hat \theta - \theta\|_2^2 \geq \dfrac 1 2 \mathrm{tr}\mathcal J_F^{-1}(\theta) \]

Bayesian perspective

In minimax settings, it is often wise to trade bias with variance to achieve a smaller overall risk. Fixing a prior $\pi \in \mathcal P(\Theta)$,
consider the following joint distributions for $\theta, X)$:

Under $Q$, $\theta\sim \pi$ and $X\sim P_\theta$ after sampling $\theta$.
Under $P, \theta\sim T_\delta\pi$ where $T_\delta\pi$ is the displaced prior $T_\delta \pi(A) = \pi(A - \delta)$, and $X\sim P_{\theta - \delta}$ conditioning on $\theta$.

Theorem 9.5 (van Trees inequality) Fixing a differentiable prior $\pi$ and using the setup for $P_{X\theta}, Q_{X\theta}$ above, under regularity conditions \[ R^* \geq R^*_\pi = \inf_{\hat \theta} \mathbb E_\pi[(\hat \theta - \theta)^2] \geq \dfrac 1 {J(\pi) + \mathbb E_{\theta\sim \pi} \mathcal J_F(\theta)} \] Note that $\mathbb E_\pi[(\hat \theta - \theta)^2]$ is in fact expectation over $\theta\sim \pi$, then $X\sim P_\theta$, and $J(\pi)$ is the Fisher information of the location family $\pi$. The regularity conditions are:

$\pi$ is differentiable, supported on $[\theta_0, \theta_1]$ and $\pi(\theta_0) = \pi(\theta_1) = 0$.
The location-family Fisher information of the prior is finite \[ J(\pi) = \int_{\theta_0}^{\theta_1} \dfrac{\pi'(\theta)^2}{\pi(\theta)}\, d\theta < \infty \]
The family has a dominating measure: $P_\theta = P_\theta\mu$, and $p_\theta(x)$ is differentiable in $\theta$ a.e.
For $\pi$-almost every $\theta$: $\int \partial_{\theta }p_\theta(x)\, d\mu = 0$.

Proof by data-processing

Fix any (possibly randomized) estimator $\hat \theta$, the marginals (but not the joint!) $P_X = Q_X$ so $\mathbb E_P\hat \theta = \mathbb E_Q\hat \theta$ while $\mathbb E_P \theta = \mathbb E_Q \theta + \delta$. Then \[\begin{align} \chi^2(P_{\theta X} \| Q_{\theta X}) &\geq \chi^2(P_{\theta \hat \theta} \| Q_{\theta\hat \theta}) \geq \chi^2(P_{\theta - \hat \theta} \| Q_{\theta - \hat \theta}) \\ &\geq \dfrac{(\mathbb E_P[\theta - \hat \theta] - \mathbb E_Q[\theta - \hat \theta])^2}{\mathrm{Var}_Q(\hat \theta - \theta)} = \dfrac{\delta^2}{\mathrm{Var}_Q(\hat \theta - \theta)} \geq \dfrac{\delta^2}{\mathbb E_Q[\hat \theta - \theta]^2} \end{align}\] To obtain the second inequality, first consider the local expansions:

$\chi^2(P_\theta \| Q_\theta) = \chi^2(T_\delta \pi \| \pi) \approx J(\pi) \delta^2$.
$\chi^2(P_{X|\theta} \| Q_{X|\theta}) \approx \mathcal J_F(\theta) \delta^2$. Next, apply the $\chi^2$ chain rule (proposition 8.6) to obtain \[\begin{align} \chi^2(P_{X\theta} \| Q_{X\theta}) &= \chi^2(P_\theta \| Q_{\theta}) + \mathbb E_Q \left[ \chi^2(P_{X|\theta} \| Q_{X|\theta}) \left(\dfrac{dP_\theta}{dQ_\theta}\right)^2 \right] \\ R_\pi^* &\geq \dfrac{\delta^2}{J(\pi) \delta^2 + \mathbb E_Q \left[\mathcal J_F(\theta) \delta^2 (dP_\theta/dQ_\theta)^2\right]}\\ &\geq \dfrac 1 {J(\pi) + \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta)} \end{align}\] In the last step, we can also take $dP_\theta/dQ_\theta\to 1$ by continuity of $\pi$.

Direct proof

For the Bayes setting, w.l.o.g. assume that the estimator is deterministic. For each $x$, integrate by parts to obtain \[ \int_{\theta_0}^{\theta_1} (\hat \theta - \theta) \partial_{\theta}[p_\theta \pi(\theta)] = \int_{\theta_0}^{\theta_1} p_\theta \pi(\theta)\, d\theta \] Next integrate both sides over $\mu(dx)$ to obtain \[ \mathbb E_{\theta, X}[(\hat \theta - \theta)V(\theta, X)] = 1, \quad V(\theta, x) = \partial_{\theta }\log[p_\theta(x) \pi(\theta)] \] Next apply Cauchy-Schwarz to obtain $\mathbb E[(\hat \theta - \theta)^2]\mathbb E[V(\theta, X)^2]\geq 1$ and \[ \mathbb E[V(\theta, X)^2] = \mathbb E[V(\theta, X)^2] = \mathbb E[(\partial_{\theta }\log P_\theta(X))^2] + \mathbb E[(\partial_{\theta }\log \pi(\theta))^2] = \mathbb E_\theta \mathcal J_F(\theta) + J(\pi) \]

We assume a prior density that vanishes at the boundary. A uniform prior yields the Chernoff-Rubin-Stein inequality.
To obtain the tightest lower bound, one shouold use regular prior with minimum Fisher information, which is known to be $g(u) = \cos^2(\pi u/2)$ supported on $[-1, 1]$ with Fisher information $\pi^2$.

We only state the multivariate version of the theorem above.

Theorem 9.6 (multivariate BCR) Given a product prior density $\pi(\theta) = \prod_{j=1}^d \pi_j(\theta_j)$ such that each $\pi_j$ is compactly supported and vanishes on the boundary; suppose that for $\pi$-almost every $\theta$ \[ \int \mu(dx) \nabla_\theta p_\theta(x) = 0 \] Then, for $J(\pi) = \mathrm{diag}(\{J(\pi_j)\})$ we have (first inverse then trace) \[ R_\pi^* = \inf_{\hat \theta} \mathbb E_\pi \|\theta - \hat \theta\|_2^2 \geq \mathrm{tr}\left( \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta) + J(\pi) \right)^{-1} \]

Theorem 9.7 (lower bound on asymptotic minimax risk) Assuming $\theta \mapsto \mathcal J_F(\theta)$ continuous; let $X_1, \cdots X_n\sim P_\theta$ i.i.d. and define the minimax risk $R_n^* = \inf_{\hat \theta} \sup_{\theta \in \Theta} \mathbb E_\theta \|\hat \theta - \theta\|_2^2$, then \[ R_n^* \geq \dfrac{1+o(1)}{n} \sup_{\theta \in \Theta} \mathrm{tr}\mathcal J_F^{-1}(\theta) \]

Proof

Fix $\theta\in \Theta$ and choose a small $\delta$ such that $\theta + [-\delta, \delta]^d \subset \Theta$ and let $\pi_j(\theta_j)$ be the compact minimal-Fisher information cosine prior. Then \[ J(\pi_j) = \dfrac 1 {\delta^2} J(g) = \dfrac{\pi^2}{\delta^2}\implies J(\pi) = \dfrac{\pi^2}{\delta^2}I_d \] Next, continuity $\theta \mapsto \mathcal J_F(\theta)$ guarantees the hypothesis of the HCR bound, then applying the additivity of Fisher information, we obtain \[ R_n^* \geq \mathrm{tr}\left( n \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta) + J(\pi) \right)^{-1} = \dfrac 1 n \mathrm{tr}\left( \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta) + \dfrac{\pi^2}{n\delta^2}I_d \right)^{-1} \] Choose $\delta = n^{-1/4}$ and apply the continuity of $\mathcal J_F(\theta)$ in at $\theta$.

Miscellany: MLE

Consider $X^n=(X_j)\sim P_{\theta_0}$ i.i.d., the maximum likelihood estimator (MLE) is \[ \hat \theta_{\mathrm{MLE}} = \mathrm{arg}\max_{\theta\in \Theta} L_\theta(X^n), \quad L_\theta(X^n) = \sum_{j=1}^n \log p_\theta(X_j) \] For discrete distributions, we also have $\hat \theta_{\mathrm{MLE}} \in \mathrm{arg}\min_{\theta \sin \Theta} D(\hat P_n \| P_\theta)$. This implies that the expected log-likelihood is maximized at the true parameter value $\theta_0$. - Suppose $\theta \mapsto P_\theta$ is injective, then \[ \mathbb E_{\theta_0}[L_\theta - L_{\theta_0}] = \mathbb E_{\theta_0} \left[ -\sum \log \dfrac{p_{\theta_0}(X_j)}{p_\theta(X_j)} \right] = -nD(P_{\theta_0} \| P_\theta) < 0 \] Assuming regularity conditions and together with l.l.n, this ensures consistency $\hat \theta_{n\to \infty}\to \theta_0$.

Assuming more regularity conditions, we can approximate \[ L_\theta = L_{\theta_0} + (\theta - \theta_0)^T \sum_j V(\theta_0, X_j) + \dfrac 1 2 (\theta - \theta_0^T \left( \sum H(\theta_0, X_j) \right)(\theta - \theta_0) + \cdots \] Under regularity conditions, we have $\sum H(\theta_0, X_j) \to -n \mathcal J_F(\theta_0)$ and the linear term vanishing. This yields the stochastic approximation \[ L_\theta \approx L_{\theta_0} + \langle\sqrt{n\mathcal J_F(\theta_0)} Z, \theta - \theta_0\rangle - \dfrac n 2 (\theta - \theta_0)^T \mathcal J_F(\theta_0) (\theta - \theta_0) \] Maximizing the RHS yields the asymptotic normal expression \[ \hat \theta_{\mathrm{MLE}} \approx \theta_0 + \dfrac 1 {\sqrt n} \mathcal J_F(\theta_0)^{-1/2} Z \]