9 Statistical Decision Applications
Key takeaways:
- A statistical model is a family \(\{P_\theta\}\) of distributions indexed by parameter values \(\theta \in \Theta\).
- \(R_\theta(\hat \theta)\) is the expectation of loss over \(X\sim P_\theta\).
- There are two ways turn \(R_\theta(\hat\theta)\) into a scalar:
- Eliminating \(\theta\) with \(\sup_\theta\), then taking \(\inf\) over all estimators leads to the minimax risk \(R^*\).
- Replacing \(\theta\) by averaging over a prior \(\pi\), taking \(\inf\) over \(\hat \theta\), then taking \(\sup_\pi\) adversarially over all priors \(\pi\) leads to the Bayes risk.
- The minimax and Bayes risks are related by duality.
- Under mild assumptions, \(R^* = R^*_{\mathrm{Bayes}}\) (proposition 9.1).
- Sample complexity 9.3 is the number of samples to achieve \(R^*_n< \epsilon\).
- Tensor product experiments 9.4 have independent but not necessarily identically distributed observations; theoreom 9.3 yields a bound on its minimax risk.
- Given i.i.d. observations, an analogue of Cramer-Rao (theorem 9.7) still holds even for biased estimators.
Minimax and Bayes risks
- A statistical model is a collection \(\mathcal P = \{P_\theta:\theta \in \Theta\}\) of distributions over some measurable space \((\mathcal X, \Sigma)\). Here \(\Theta\) is the parameter space.
- An estimand \(T(\theta)\) has signature \(T:\Theta\to \mathcal Y\); in the trivial case \(T\) is just the identity.
- An estimator (decision rule) has type \(\hat T:\Omega\to \hat {\mathcal Y}\).
The action space \(\hat {\mathcal Y}\) does not have to be
the estimand space \(\mathcal Y\) (e.g. when we’re estimating a confidence interval).
- \(\hat T\) can be deterministic or randomized.
- To evaluate the estimator, the loss function \(l:\mathcal Y\times \hat{\mathcal Y}\to \mathbb R\) defines the risk of \(\hat T\) for estimating \(T\).
- The (expected) risk of the estimator \(\hat T\) for estimating \(T\) is a function of the true parameter \(\theta\) and estimator \(\hat T\) : \[ R_\theta(\hat T) = \mathbb E_{X\sim \theta}[l(T(\theta), \hat T(X)] = \int P_\theta(dx) P_{\hat T|X}(\hat t|x) l(T(\theta), \hat t) \]
Two comments in order:
- For the minimax risk, it is sometimes necessary to randomize. To minimize the average risk over a prior (Bayesian approach), though, it suffices to consider deterministic functions.
- The space of randomized estimators (as Markov kernels) is convex. This is necessary for minimax theorems.
- For \(\hat T\mapsto l(T, \hat T)\) convex, we can derandomize it by consider \(\mathbb E[\hat T|X]\), then \[ R_\theta(\hat T) = \mathbb E_\theta l(T, \hat T) \geq \mathbb E_\theta l(T, \mathbb E[\hat T|X]) \]
Definition 9.1 (Bayes risk) Given a prior \(\pi\), the average risk w.r.t. \(\pi\) of an estimator is \[ R_\pi(\hat \theta) = \mathbb E_{\theta \sim \pi} R_\theta(\hat \theta) = \mathbb E_{\theta \sim \pi, X\sim P_\theta} l(\theta, \hat \theta(X)) \] The Bayes risk of a prior is its minimal average risk \(R_\pi^* = \inf_{\hat \theta} R_\pi(\hat \theta)\). The optimal \(\hat \theta\) is the Bayes estimator.
Definition 9.2 (minimax risk) Given a parameter family \(P_{\theta\in \Theta}\) and
loss function \(l\), the minimax risk is
\[
R^* = \inf_{\hat \theta} \sup_{\theta \in \Theta}
R_\theta(\hat \theta)
= \inf_{\hat \theta} \sup_{\theta \in \Theta}
\mathbb E_{X\sim P_\theta} l(\hat \theta(X), \theta)
\]
To prove minimax risk, one needs to establish for arbitrary \(\epsilon\)
- an estimator \(\hat \theta^*\) satisfying \(R_\theta(\hat \theta^*)\leq R^*+\epsilon\).
- For arbitrary \(\hat \theta\), \(\sup_{\theta} R_\theta(\hat \theta)\geq R^*-\epsilon\).
Theorem 9.1 (duality between minimax and Bayes risk) Let \(\mathcal P(\Theta)\) denote the set of probability distributions on \(\Theta\), then \[ R^* = \inf_{\hat \theta}\sup_{\theta \in \Theta} R_\theta(\hat \theta) \geq R^*_{\mathrm{Bayes}} = \sup_{\pi \in \mathcal P(\theta)} \inf_{\hat \theta} R_\pi(\hat \theta) \]
Proof: \(R^* = \inf_{\hat \theta}\sup_{\theta \in \Theta} R_\theta(\hat \theta) = \inf_{\hat \theta}\sup_{\pi \in \mathcal P(\Theta)} R_\pi(\hat \theta) \geq R^*_{\mathrm{Bayes}}\) by \(\inf\sup \geq \sup\inf\).
Example 9.1 (strict inequality) Let \(\theta, \hat \theta \in \{1, 2, \cdots\}\) and \(l(\theta, \hat \theta) = 1_{\hat \theta < \theta}\), then \(R^* = 1\) while \(R^*_{\mathrm{Bayes}} = 0\).
A duality perspective
For simplicity, let \(\Theta\) be a finite set and \(l\) be convex, then \[ R^* = \min_{P_{\hat \theta|X}} \max_{\theta \in \Theta} \mathbb E_\theta \, l(\theta, \hat \theta) \] This is a convex optimization problem since \[ P_{\hat \theta|X}\mapsto \mathbb E_\theta \, l(\theta, \hat \theta) = \mathbb E_{X\sim P_\theta, \hat \theta\sim P_{\hat \theta|X}}[l(\theta, \hat \theta)] \] is linear and the pointwise supremum of convex functions is convex. Considering its dual, rewrite \[ R^* = \min_{P_{\hat \theta|X}, t} \, t\quad \text{s.t. } \mathbb E_\theta\, l(\theta, \hat \theta) \leq t, \quad \forall \theta \in \Theta \] Let \(\pi_\theta\geq 0\) for each inequality constraint. The Lagrangian is \[ \mathcal L(P_{\hat \theta|X}, t, \pi) = t + \sum_{\theta \in \Theta} \pi_\theta\cdot \left( \mathbb E_\theta\, l(\theta, \hat \theta) - t \right) = \left( 1 - \sum_{\theta \in \Theta} \pi_\theta \right)t + \sum_{\theta \in \Theta} \pi_\theta \mathbb E_\theta\, l(\theta, \hat \theta) \] The first term implies that \(\pi\) must be a probability measure, yielding the dual problem \[ \max_\pi \min_{P_{\hat \theta|X}} \sum_{\theta \in \Theta} \pi_\theta \mathbb E_\theta\, l(\theta, \hat \theta) = \max_{\pi \in \mathcal P(\Theta)} R_\pi^* \]
Theorem 9.2 (general minimax equality) \(R^* = R^*_{\mathrm{Bayes}}\) if the following conditions hold:
- The experiment is dominated: \(P_{\forall \theta} \ll \nu\) for some \(\nu\).
- The action space \(\hat \Theta\) (codomain of the estimator) is a locally compact topological space with a countable basis.
- The loss function is level compact: for each \(\theta\in \Theta, l(\theta, \cdot)\) is bounded from below and \(\{\hat \theta:l(\theta, \hat \theta)\leq a\}\) is compact for each \(a\).
We will prove the following special case for demonstration.
Proposition 9.1 (special minimax equality) \(R^* = R^*_{\mathrm{Bayes}}\) if \(\Theta\) is finite and \(l\) is bounded from below (e.g. quadratic).
Proof
- First consider the edge case \(R^*=\infty \iff R^*_{\mathrm{Bayes}} = \infty\); this is established by considering the uniform prior \(\pi\) on \(\Theta\).
- Next consider \(R^*<\infty\). Given an estimator \(\hat \theta\), denote its risk vector \(R(\hat \theta)_\theta = \mathbb E_{X\sim P_\theta} l(\theta, \hat \theta)\) with components in \(\theta\).
- The average risk is given by the inner product \(\langle R(\hat \theta), \pi\rangle\).
- Define the set \(S\) of all possible risk vectors for randomized estimators; Note that \(S\) is convex because linear interpolations of risk vectors are given by mixtures of their corresponding random estimators.
- The set \(T=\{t\in R^\Theta: t_{\forall \theta} < R^*\}\) is convex, and \(S\cap T=\emptyset\) since for every valid estimator at least one component achieves \(R^*\).
- By the hyperplane separation theorem, there exists \(\pi \in R^\Theta\) and \(c\in \mathbb R\) such that {sS}, sc{tT}, t$. Now \(\pi\) must be componentwise positive else \(\sup_{t\in T}\langle\pi, t\rangle= \infty\), so w.l.o.g. \(\pi\) is a probability vector.
- Thus we have established \(R^*_{\mathrm{Bayes}} \geq R_\pi^* \geq R^*\).
Sample complexity, tensor products
Given an experiment \(P_{\theta\in \Theta}\), the independent sampling model refers to the experiment \[ \mathcal P_n = \{P_{\theta \in \Theta}^{\otimes n}\}, \quad n\geq 1 \] Define \(R_n^*(\Theta) = \inf_\hat \theta \sup_{\theta \in \Theta} \mathbb E_\theta\, l(\theta, \hat \theta)\) to be the minimax risk when \(\hat \theta\) consumes \(X=(X_1, \cdots, X_n)\) consisting of independent observations. Note that:
- \(n\mapsto R_n^*(\Theta)\) is non-increasing since we can always discard extra observations.
- We typically expect \(R_n^*(\Theta)\to 0\) as \(n\to \infty\).
Definition 9.3 (sample complexity) The sample complexity is the minimum sample size required to obtain a prescribed error \(\epsilon\) in the worst case (of actual parameter): \[ n^*(\epsilon) = \min\{n\in \mathbb N: R^*_n(\Theta)\leq \epsilon\} \]
Definition 9.4 (tensor product experiment) Given statistical experiments \(\mathcal P_i = \{P_{\theta_i \in \Theta_i}\}\) and loss \(l_i\) for each \(i\in [d]\), the tensor product experiment \[ \mathcal P = \left\{\prod P_{\theta_j}, \quad (\theta_j)\in \Theta = \prod \Theta_j\right\} \] the loss function is \(l(\theta, \hat \theta) = \sum_j l_j(\theta_j, \hat \theta_j)\). In this model, the observation \((X_1, \cdots, X_d)\) are independent but not necessarily identically distributed.
Theorem 9.3 (minimax risk of tensor product) \[ \sum_{j=1}^d R^*_{\mathrm{Bayes}}(\mathcal P_j) \leq R^*(\mathcal P) \leq \sum_{j=1}^d R^*(\mathcal P_j) \] If the minimax theorem \(R^*(\mathcal P_i) = R^*_{\mathrm{Bayes}}(\mathcal P_i)\), then it holds for the product experiment and the minimax risk additively decomposes.
Proof
For the right inequality, choose \(\hat \theta^* = (\hat \theta_j^*)\) to be the component minimax estimators without considering any other observations. For the left inequality, consider for each \((\pi_j)\) a product prior \(\pi = \prod \pi_j\); under this prior, both \(\theta_j\)’s and \(X_j\)’s are independent. For any component \(\hat \theta_j = \hat \theta_j(X, U_j)\) of any randomized estimator \(\hat \theta\), the non-\(X_j\) components \((U_j, X_{\bar j})\) by independence can be viewed as a randomized estimator based on \(X_i\) alone (all other \(X_{\bar j}\)’s do not yield any information about \(\theta_j\) by independence of \(\theta\)’s). Thus its average risk satisfies \(R_{\pi_j}(\hat \theta_j) \geq R^*_{\pi_j}\). Take supremum over all \(\pi_j\) and sum over \(j\) to obtain the left inequality.HCR lower bound
Theorem 9.4 (HCR lower bound) The quadratic loss of any estimator \(\hat \theta\) at \(\theta\in \Theta\subset \mathbb R^d\) satisfies \[ R_\theta(\hat \theta) = \mathbb E_{ \theta, X\sim P_\theta } [\hat \theta(X) - \theta]^2 \geq \mathrm{Var}_\theta(\hat \theta) \geq \sup_{\theta'\neq \theta} \dfrac{ (\mathbb E_\theta \hat \theta - \mathbb E_{\theta'} \hat \theta)^2 }{\chi^2(P_{\theta'} \| P_\theta)} \]
Proof
Recall the variational characterization of \(\chi^2\) (proposition 8.11): for all \(g:\mathcal X\to \mathbb R\) \[ \mathrm{Var}_Q[g^2] \geq \dfrac{(\mathbb E_P g - \mathbb E_Q g)^2}{\chi^2(P\|Q)} \] Given two different underlying parameters \(\theta, \theta'\), let \(P_X=P_\theta, Q_X=P_{\theta'}\), then applying the data processor \(X\mapsto \hat \theta(X)\) and the variational characterization \[ \chi^2(P_X \| Q_X) \geq \chi^2(P_{\hat \theta} \| Q_{\hat \theta}) \geq \dfrac{(\mathbb E_\theta\hat \theta - \mathbb E_{\theta'} \hat \theta)^2} {\mathrm{Var}_\theta(\hat \theta)} \]Corollary 9.1 (Cramer-Rau lower-bound) For an unbiased estimator \(\hat \theta\) satisfying \(\mathbb E_{\forall \theta}[\hat \theta] = \theta\), we obtain \[ \mathrm{Var}_\theta(\hat \theta) \geq \mathcal J_F(\theta)^{-1} \]
Proof
Apply the unbiased condition to theorem 9.4, take \(\theta = \theta + \xi_{\to 0}\) and apply the local Fisher behavior of divergence theorem 8.12 \[\begin{align} \mathrm{Var}_\theta(\hat \theta) &\geq \dfrac{ \xi^2 }{\chi^2(P_{\theta + \xi} \| P_\theta)} = \lim_{\xi\to 0} \xi\dfrac{ \xi^2 }{\mathcal J_F(0) \xi^2 + \cdots} = \mathcal J_F(\theta)^{-1} \end{align}\]To generalize to \(\theta \in \Theta\subset \mathbb R^d\), assume \(\hat \theta\) _is unbiased and \[ \mathrm{Cov}_\theta(\hat \theta) = \mathbb E_\theta[(\hat \theta - \theta)(\hat \theta - \theta)^T] \] is positive-semidefinite, apply theorem 9.4 to each data-processor \(\langle a, \hat \theta(X)\rangle, a\in \mathbb R^d\) to obtain \[ \chi^2(P_\theta \| P_{\theta'}) \geq \dfrac{\langle a, \theta - \theta'\rangle^2}{a^T \mathrm{Cov}_\theta(\hat \theta)a} \] Optimizing over \(a\) and apply, for \(0\preceq \Sigma\) \[ \sup_{x\neq 0} \dfrac{\langle x, y\rangle^2}{x^T\Sigma x} = y^T \Sigma^{-1} y, \quad x = \Sigma^{-1} y \] yields \(\chi^2(P_\theta \| P_{\theta'}) \geq (\theta - \theta')^T \mathrm{Cov}_\theta(\hat \theta)^{-1} (\theta - \theta')\). Applying the multivariate local approximation again yields \(\mathcal J_F^{-1}(\theta) \preceq \mathrm{Cov}_\theta(\hat \theta)\).
Further recall that the Fisher information is additive under i.i.d. tensorization; taking the trace on both sides, we obtain, for unbiased estimators, \[ \mathbb E_\theta \|\hat \theta - \theta\|_2^2 \geq \dfrac 1 2 \mathrm{tr}\mathcal J_F^{-1}(\theta) \]
Bayesian perspective
In minimax settings, it is often wise to
trade bias with variance to achieve a smaller
overall risk. Fixing a prior \(\pi \in \mathcal P(\Theta)\),
consider the following joint distributions for
\(\theta, X)\):
- Under \(Q\), \(\theta\sim \pi\) and \(X\sim P_\theta\) after sampling \(\theta\).
- Under \(P, \theta\sim T_\delta\pi\) where \(T_\delta\pi\) is the displaced prior \(T_\delta \pi(A) = \pi(A - \delta)\), and \(X\sim P_{\theta - \delta}\) conditioning on \(\theta\).
Theorem 9.5 (van Trees inequality) Fixing a differentiable prior \(\pi\) and using the setup for \(P_{X\theta}, Q_{X\theta}\) above, under regularity conditions \[ R^* \geq R^*_\pi = \inf_{\hat \theta} \mathbb E_\pi[(\hat \theta - \theta)^2] \geq \dfrac 1 {J(\pi) + \mathbb E_{\theta\sim \pi} \mathcal J_F(\theta)} \] Note that \(\mathbb E_\pi[(\hat \theta - \theta)^2]\) is in fact expectation over \(\theta\sim \pi\), then \(X\sim P_\theta\), and \(J(\pi)\) is the Fisher information of the location family \(\pi\). The regularity conditions are:
- \(\pi\) is differentiable, supported on \([\theta_0, \theta_1]\) and \(\pi(\theta_0) = \pi(\theta_1) = 0\).
- The location-family Fisher information of the prior is finite \[ J(\pi) = \int_{\theta_0}^{\theta_1} \dfrac{\pi'(\theta)^2}{\pi(\theta)}\, d\theta < \infty \]
- The family has a dominating measure: \(P_\theta = P_\theta\mu\), and \(p_\theta(x)\) is differentiable in \(\theta\) a.e.
- For \(\pi\)-almost every \(\theta\): \(\int \partial_{\theta }p_\theta(x)\, d\mu = 0\).
Proof by data-processing
Fix any (possibly randomized) estimator \(\hat \theta\), the marginals (but not the joint!) \(P_X = Q_X\) so \(\mathbb E_P\hat \theta = \mathbb E_Q\hat \theta\) while \(\mathbb E_P \theta = \mathbb E_Q \theta + \delta\). Then \[\begin{align} \chi^2(P_{\theta X} \| Q_{\theta X}) &\geq \chi^2(P_{\theta \hat \theta} \| Q_{\theta\hat \theta}) \geq \chi^2(P_{\theta - \hat \theta} \| Q_{\theta - \hat \theta}) \\ &\geq \dfrac{(\mathbb E_P[\theta - \hat \theta] - \mathbb E_Q[\theta - \hat \theta])^2}{\mathrm{Var}_Q(\hat \theta - \theta)} = \dfrac{\delta^2}{\mathrm{Var}_Q(\hat \theta - \theta)} \geq \dfrac{\delta^2}{\mathbb E_Q[\hat \theta - \theta]^2} \end{align}\] To obtain the second inequality, first consider the local expansions:
- \(\chi^2(P_\theta \| Q_\theta) = \chi^2(T_\delta \pi \| \pi) \approx J(\pi) \delta^2\).
- \(\chi^2(P_{X|\theta} \| Q_{X|\theta}) \approx \mathcal J_F(\theta) \delta^2\). Next, apply the \(\chi^2\) chain rule (proposition 8.6) to obtain \[\begin{align} \chi^2(P_{X\theta} \| Q_{X\theta}) &= \chi^2(P_\theta \| Q_{\theta}) + \mathbb E_Q \left[ \chi^2(P_{X|\theta} \| Q_{X|\theta}) \left(\dfrac{dP_\theta}{dQ_\theta}\right)^2 \right] \\ R_\pi^* &\geq \dfrac{\delta^2}{J(\pi) \delta^2 + \mathbb E_Q \left[\mathcal J_F(\theta) \delta^2 (dP_\theta/dQ_\theta)^2\right]}\\ &\geq \dfrac 1 {J(\pi) + \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta)} \end{align}\] In the last step, we can also take \(dP_\theta/dQ_\theta\to 1\) by continuity of \(\pi\).
Direct proof
For the Bayes setting, w.l.o.g. assume that the estimator is deterministic. For each \(x\), integrate by parts to obtain \[ \int_{\theta_0}^{\theta_1} (\hat \theta - \theta) \partial_{\theta}[p_\theta \pi(\theta)] = \int_{\theta_0}^{\theta_1} p_\theta \pi(\theta)\, d\theta \] Next integrate both sides over \(\mu(dx)\) to obtain \[ \mathbb E_{\theta, X}[(\hat \theta - \theta)V(\theta, X)] = 1, \quad V(\theta, x) = \partial_{\theta }\log[p_\theta(x) \pi(\theta)] \] Next apply Cauchy-Schwarz to obtain \(\mathbb E[(\hat \theta - \theta)^2]\mathbb E[V(\theta, X)^2]\geq 1\) and \[ \mathbb E[V(\theta, X)^2] = \mathbb E[V(\theta, X)^2] = \mathbb E[(\partial_{\theta }\log P_\theta(X))^2] + \mathbb E[(\partial_{\theta }\log \pi(\theta))^2] = \mathbb E_\theta \mathcal J_F(\theta) + J(\pi) \]- We assume a prior density that vanishes at the boundary. A uniform prior yields the Chernoff-Rubin-Stein inequality.
- To obtain the tightest lower bound, one shouold use regular prior with minimum Fisher information, which is known to be \(g(u) = \cos^2(\pi u/2)\) supported on \([-1, 1]\) with Fisher information \(\pi^2\).
We only state the multivariate version of the theorem above.
Theorem 9.6 (multivariate BCR) Given a product prior density \(\pi(\theta) = \prod_{j=1}^d \pi_j(\theta_j)\) such that each \(\pi_j\) is compactly supported and vanishes on the boundary; suppose that for \(\pi\)-almost every \(\theta\) \[ \int \mu(dx) \nabla_\theta p_\theta(x) = 0 \] Then, for \(J(\pi) = \mathrm{diag}(\{J(\pi_j)\})\) we have (first inverse then trace) \[ R_\pi^* = \inf_{\hat \theta} \mathbb E_\pi \|\theta - \hat \theta\|_2^2 \geq \mathrm{tr}\left( \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta) + J(\pi) \right)^{-1} \]
Theorem 9.7 (lower bound on asymptotic minimax risk) Assuming \(\theta \mapsto \mathcal J_F(\theta)\) continuous; let \(X_1, \cdots X_n\sim P_\theta\) i.i.d. and define the minimax risk \(R_n^* = \inf_{\hat \theta} \sup_{\theta \in \Theta} \mathbb E_\theta \|\hat \theta - \theta\|_2^2\), then \[ R_n^* \geq \dfrac{1+o(1)}{n} \sup_{\theta \in \Theta} \mathrm{tr}\mathcal J_F^{-1}(\theta) \]
Proof
Fix \(\theta\in \Theta\) and choose a small \(\delta\) such that \(\theta + [-\delta, \delta]^d \subset \Theta\) and let \(\pi_j(\theta_j)\) be the compact minimal-Fisher information cosine prior. Then \[ J(\pi_j) = \dfrac 1 {\delta^2} J(g) = \dfrac{\pi^2}{\delta^2}\implies J(\pi) = \dfrac{\pi^2}{\delta^2}I_d \] Next, continuity \(\theta \mapsto \mathcal J_F(\theta)\) guarantees the hypothesis of the HCR bound, then applying the additivity of Fisher information, we obtain \[ R_n^* \geq \mathrm{tr}\left( n \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta) + J(\pi) \right)^{-1} = \dfrac 1 n \mathrm{tr}\left( \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta) + \dfrac{\pi^2}{n\delta^2}I_d \right)^{-1} \] Choose \(\delta = n^{-1/4}\) and apply the continuity of \(\mathcal J_F(\theta)\) in at \(\theta\).Miscellany: MLE
Consider \(X^n=(X_j)\sim P_{\theta_0}\) i.i.d., the maximum likelihood estimator (MLE) is \[ \hat \theta_{\mathrm{MLE}} = \mathrm{arg}\max_{\theta\in \Theta} L_\theta(X^n), \quad L_\theta(X^n) = \sum_{j=1}^n \log p_\theta(X_j) \] For discrete distributions, we also have \(\hat \theta_{\mathrm{MLE}} \in \mathrm{arg}\min_{\theta \sin \Theta} D(\hat P_n \| P_\theta)\). This implies that the expected log-likelihood is maximized at the true parameter value \(\theta_0\). - Suppose \(\theta \mapsto P_\theta\) is injective, then \[ \mathbb E_{\theta_0}[L_\theta - L_{\theta_0}] = \mathbb E_{\theta_0} \left[ -\sum \log \dfrac{p_{\theta_0}(X_j)}{p_\theta(X_j)} \right] = -nD(P_{\theta_0} \| P_\theta) < 0 \] Assuming regularity conditions and together with l.l.n, this ensures consistency \(\hat \theta_{n\to \infty}\to \theta_0\).
- Assuming more regularity conditions, we can approximate \[ L_\theta = L_{\theta_0} + (\theta - \theta_0)^T \sum_j V(\theta_0, X_j) + \dfrac 1 2 (\theta - \theta_0^T \left( \sum H(\theta_0, X_j) \right)(\theta - \theta_0) + \cdots \] Under regularity conditions, we have \(\sum H(\theta_0, X_j) \to -n \mathcal J_F(\theta_0)\) and the linear term vanishing. This yields the stochastic approximation \[ L_\theta \approx L_{\theta_0} + \langle\sqrt{n\mathcal J_F(\theta_0)} Z, \theta - \theta_0\rangle - \dfrac n 2 (\theta - \theta_0)^T \mathcal J_F(\theta_0) (\theta - \theta_0) \] Maximizing the RHS yields the asymptotic normal expression \[ \hat \theta_{\mathrm{MLE}} \approx \theta_0 + \dfrac 1 {\sqrt n} \mathcal J_F(\theta_0)^{-1/2} Z \]