$\newcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\S}{\mathcal{S}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\btheta}{\boldsymbol{\theta}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\indep}{\perp\!\!\!\perp}$

# Probabilistic modeling, inference and sampling¶

## 1 Background: conditioning; parametric families¶

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Nov 27, 2020

### 1.1 Conditioning¶

We start with events. Thoughout, we work on a fixed probability space $(\Omega, \mathcal{F}, \P)$, which we assume is finite.

Definition (Conditional Probability): Let $A$ and $B$ be two events with $\mathbb{P}[B] > 0$. The conditional probability of $A$ given $B$ is defined as

$$\P[A|B] = \frac{\P[A \cap B]}{\P[B]}.$$

$\lhd$

The intuitive interpretation goes something like this: knowing that event $B$ has occurred, the updated probability of observing $A$ is the probability of its restriction to $B$ properly normalized to reflect that outcomes outside $B$ have updated probability $0$. This is illustrated next. (Source)

Conditional probabilities behave like unconditional probabilities. For instance:

Exercise: Let $A_1, \ldots, A_m$ be disjoint events and let $B$ be such that $\P[B] > 0$. Show that

$$\P[A_1 \cup \cdots \cup A_m|B] = \P[A_1|B] + \cdots + \P[A_m|B].$$

$\lhd$

Lemma (Independence and Conditional Probability): Let $A$ and $B$ be two events of positive probability. Then $A$ and $B$ are independent, which we will denote as $A \indep B$, if and only if $\P[A|B] = \P[A]$ and $\P[B|A] = \P[B]$.

Proof: If $A$ and $B$ are independent, then c which implies

$$\P[A|B] = \frac{\P[A \cap B]}{\P[B]} = \frac{\P[A] \P[B]}{\P[B]} = \P[A].$$

In the other direction,

$$\P[A] = \P[A|B] = \frac{\P[A \cap B]}{\P[B]}$$

implies $\P[A|B] = \frac{\P[A \cap B]}{\P[B]}$ after rearranging. $\square$

• Product (or chain) rule: For any collection of events $A_1,\ldots,A_r$,
$$\P\left[\cap_{i=1}^r A_i\right] = \prod_{i=1}^r \P\left[A_i \,\middle|\, \cap_{j=1}^{i-1} A_j \right].$$
• Law of total probability: For any event $B$ and any partition $A_1,\ldots,A_r$ of $\Omega$,
$$\P[B] = \sum_{i=1}^r \P[B|A_i] \P[A_i].$$
• Bayes' rule: For any events $A$ and $B$ with positie probability,
$$\P[A|B] = \frac{\P[B|A]\P[A]}{\P[B]}.$$

We define the conditional probability mass function.

Definition (Conditional Probability Mass Function): Let $X$ and $Y$ be discrete random variables with joint probability mass function $p_{X, Y}$ and marginals $p_X$ and $p_Y$. The conditional probability mass function of $X$ given $Y$ is defined as

$$p_{X|Y}(x|y) := P[X=x|Y=y] = \frac{p_{X,Y}(x,y)}{p_Y(y)}$$

which is defined for all $x \in \S_X$ and $y \in \S_Y$. $\lhd$

The conditional expectation can then be defined in a natural way as the expectation over the conditional probability mass function.

Definition (Conditional Expectation): Let $X$ and $Y$ be discrete random variables where $X$ takes real values. The conditional expectation of $X$ given $Y = y$ is given by

$$\E[X|Y=y] = \sum_{x \in \S_X} x\, p_{X|Y}(x|y).$$

$\lhd$

Thinking of $\E[X|Y=y]$ as a function of $y$ leads to a fundamental characterization of the conditional expectation.

Theorem (Conditional Expectation and Least Squares): Let $X$ and $Y$ be discrete random variables where $X$ takes real values. Then the conditional expectation $h(y) = \E[X|Y=y]$ minimizes the least squares criterion

$$\min_{h} \E\left[(X - h(Y))^2\right]$$

where the minimum is over all real-valued functions of $y$.

Proof: Think of $h(y)$ as a vector $\mathbf{h} = (h_y)_{y \in \S_Y}$ indexed by $\S_Y$, where $h(y) = h_y \in \mathbb{R}$. Then

\begin{align*} \mathcal{L}(\mathbf{h}) &=\E\left[(X - h(Y))^2\right]\\ &= \sum_{x\in \S_X} \sum_{y \in \S_Y} (x - h_y)^2 p_{X,Y}(x,y). \end{align*}

Take partial derivatives with respect to $\mathbf{h}$:

\begin{align*} \frac{\partial}{\partial h_y} \mathcal{L} &= - 2 \sum_{x\in \S_X} (x - h_y) p_{X,Y}(x,y)\\ &= - 2 \sum_{x\in \S_X} x p_{X,Y}(x,y) + 2 h_y p_Y(y) \end{align*}

where we used that the marginal of $Y$ satisfies $p_Y(y) = \sum_{x \in \S_X} p_{X,Y}(x,y)$. The Hessian is the diagonal matrix with positive entries $2 p_Y(y)$, so it is positive definite and the problem is convex.

After rearranging, the stationary points satisfy

$$h_y = \sum_{x\in \S_X} x \frac{p_{X,Y}(x,y)}{p_Y(y)} = \sum_{x\in \S_X} x p_{X|Y}(x|y) = \E[X|Y=y]$$

as claimed. $\square$

### 1.2 Parametric families¶

Parametric families of probability distributions serve as basic building blocks for more complex models. By this, we mean a collection $\{p_\btheta:\btheta \in \Theta\}$, where $p_\btheta$ is a probability distribution over a set $\S_\btheta$ and $\btheta$ is a vector-valued parameter.

Example (Bernoulli): The random variable $X$ is Bernoulli with parameter $q \in [0,1]$, denoted $X \sim \mathrm{Ber}(q)$, if it takes values in $\S_X = \{0,1\}$ and $\P[X=1] = q$. Varying $p$ produces the family of Bernoulli distributions. $\lhd$

Here we focus on the exponential family, which includes many common distributions (including the Bernoulli distribution).

One particularly useful class of probability distributions in data science is the exponential family, which includes many well-known cases.

Definition (Exponential Family - Discrete Case): A parametric collection of probability distributions $\{p_\btheta:\btheta \in \Theta\}$ over a discrete space $\S$ is an exponential family if it can be written in the form

$$p_\btheta(\mathbf{x}) = \frac{1}{Z(\btheta)} h(\mathbf{x}) \exp\left(\btheta^T \bphi(\mathbf{x})\right)$$

where $\btheta \in \mathbb{R}^m$ are the canonical parameters, $\bphi : \S \to \mathbb{R}^m$ are the sufficient statistics and $Z(\btheta)$ is the partition function. It is often convenient to introduce the function $A(\btheta) = \log Z(\btheta)$ and re-write

$$p_\btheta(\mathbf{x}) = h(\mathbf{x}) \exp\left(\btheta^T \bphi(\mathbf{x}) - A(\btheta)\right).$$

$\lhd$

Example (Bernoulli, continued): For $x \in \{0,1\}$, the $\mathrm{Ber}(q)$ distribution for $0 < q < 1$ can be written as

\begin{align*} q^{x} (1-q)^{1-x} &= (1-q) \left(\frac{q}{1-q}\right)^x\\ &= (1-q) \exp\left[x \log \left(\frac{q}{1-q}\right)\right]\\ &= \frac{1}{Z(\theta)} h(x) \exp(\theta \,\phi(x)) \end{align*}

where we define $h(x) = 1$, $\phi(x) = x$, $\theta = \log \left(\frac{q}{1-q}\right)$ and, since $Z(\theta)$ serves as the normalization constant in $p_\theta$,

$$Z(\theta) = \sum_{x \in \{0,1\}} h(x) \exp(\theta \,\phi(x)) = 1 + e^\theta.$$

$\lhd$

Example (Categorical and Multinomial): A categorical variable $X$ takes $K \geq 2$ possible values. A standard choice is to use one-hot encoding $\S = \{\mathbf{e}_i : i=1,\ldots,K\}$ where $\mathbf{e}_i$ is the $i$-th canonical basis in $\mathbb{R}^K$. The distribution is specified by setting the probabilities $\bpi = (\pi_1,\ldots,\pi_K)$

$$\pi_i = \P[X = \mathbf{e}_i].$$

We denote this $X \sim \mathrm{Cat}(\bpi)$ and we assume $\pi_i > 0$ for all $i$.

To see that this is an exponential family, write the probability mass function at $\mathbf{x} = (x_1,\ldots,x_K)$ as

$$\prod_{i=1}^K \pi_i^{x_i} = \exp\left(\sum_{i=1}^K x_i \log \pi_i \right).$$

So we can take $h(\mathbf{x}) = 1$, $\btheta = (\log \pi_i)_{i=1}^K$, $\bphi(\mathbf{x}) = (x_i)_{i=1}^K$ and $Z(\btheta) = 1$.

The multinomial distribution arises as a sum of independent categorical variables. Let $n \geq 1$ be the number of trials (or samples) and let $Y_1,\ldots,Y_n$ be IID $\mathrm{Cat}(\bpi)$. Define $X = \sum_{i=1}^n Y_i$. The probability mass function of $X$ at

$$\mathbf{x} = (x_1,\ldots,x_K) \in \left\{ \mathbf{x} \in \{0,1,\ldots,n\}^K : \sum_{i=1}^K x_i = n \right\}=: \S$$

is

$$\frac{n!}{x_1!\cdots x_K!} \prod_{i=1}^K \pi_i^{x_i} = \frac{n!}{x_1!\cdots x_K!} \exp\left(\sum_{i=1}^K x_i \log \pi_i\right)$$

and we can take $h(\mathbf{x}) = \frac{n!}{x_1!\cdots x_K!}$, $\btheta = (\log \pi_i)_{i=1}^K$, $\bphi(\mathbf{x}) = (x_i)_{i=1}^K$ and $Z(\btheta) = 1$. This is an exponential family if we think of $n$ as fixed.

Example (Multivariate Gaussian): A multivariate Gaussian vector $\mathbf{X}$ on $\mathbb{R}^d$ with mean $\bmu \in \mathbb{R}^d$ and positive definite covariance matrix $\bSigma \in \mathbb{R}^{d \times d}$ has probability density function

$$f_{\bmu, \bSigma}(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} \,|\bSigma|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \bmu)^T \bSigma^{-1} (\mathbf{x} - \bmu)\right)$$

where $|A|$ is the determinant of $A$. We use the notation $\mathbf{X} \sim \mathcal{N}(\bmu, \bSigma)$. (Source)

Rewriting the density as

$$f_{\bmu, \bSigma}(\mathbf{x}) = \frac{e^{-(1/2) \bmu^T \bSigma^{-1} \bmu}}{(2\pi)^{d/2} \,|\bSigma|^{1/2}} \exp\left(- \mathbf{x}^T \bSigma^{-1}\bmu - \frac{1}{2} \mathrm{tr}\left(\mathbf{x} \mathbf{x}^T \bSigma^{-1}\right)\right)$$

where we used the symmetric of $\bSigma^{-1}$ in the first term of the exponential and trace identity of the previous exercise in the second term. The expression in parentheses is linear in the entries of $\mathbf{x}$ and $\mathbf{x} \mathbf{x}^T$, which can then be taken as sufficient statistics (formally, using vectorization)). This shows that the multivariate normal is an exponential family. $\lhd$

Example (Dirichlet): The Dirichlet distribution is a distribution over the $(K-1)$-simplex

$$\S = \Delta_{K} := \left\{ \mathbf{x} = (x_1, \ldots, x_K) : \mathbf{x} \geq \mathbf{0},\ \sum_{i=1}^K x_i = 1 \right\}.$$

Its parameters are $\balpha = (\alpha_1, \ldots, \alpha_K) \in \mathbb{R}$ and the density is

$$f_{\balpha}(\mathbf{x}) = \frac{1}{B(\balpha)} \prod_{i=1}^K x_i^{\alpha_i-1}, \quad \mathbf{x} \in \Delta_{K}$$

where the normalizing constant $B(\balpha)$ is the multivariate Beta function.

Rewriting the density as

$$\frac{1}{B(\balpha)} \prod_{i=1}^K x_i^{\alpha_i-1} = \frac{1}{B(\balpha)} \frac{1}{\prod_{i=1}^K x_i} \exp\left(\sum_{i=1}^K \alpha_i \log x_i\right)$$

shows that this is an exponential family with the canonical parameters $\balpha$ and sufficient statistics $(\log x_i)_{i=1}^K$. $\lhd$

In :
p = 0.1
d = Bernoulli(p)

Out:
Bernoulli{Float64}(p=0.1)
In :
N = 5 # number of samples
rand(d,N)

Out:
5-element Array{Bool,1}:
0
0
0
0
0
In :
p = [0.1, 0.2, 0.7]
n = 100
d = Multinomial(n,p)
rand(d, N)

Out:
3×5 Array{Int64,2}:
15   4  10  13   4
15  27  19  31  18
70  69  71  56  78
In :
mu = [0.1, -0.3]
sig = [2. 0.; 0. 3.]
d = MvNormal(mu, sig)
rand(d, N)

Out:
2×5 Array{Float64,2}:
3.78819  0.549552  -1.85312   0.672791  1.9125
2.16195  1.92556    2.63177  -1.92266   0.959448

When modeling data via a parametric family of distributions, the parameters must be determined from the data itself. In a typical setting, we assume that the data comprises $n$ independent samples $\mathbf{X}_1,\ldots,\mathbf{X}_n$ from a parametric family $p_\btheta$ with unknown $\btheta \in \Theta$. Many methods exist for estimating $\btheta$, depending on the context. Here we focus on perhaps the best-known approach, maximum likelihood estimation. It has many good theoretical properties which we will not describe here, as well as drawbacks.

The idea behind MLE is simple and intuitive: we choose the parameter that maximizes the probability of observing the data.

Definition (Maximum likelihood estimator): Assume that $\mathbf{X}_1,\ldots,\mathbf{X}_n$ are $n$ independent samples from a parametric family $p_{\btheta^*}$ with unknown $\btheta^* \in \Theta$. The maximum likelihood estimator of $\btheta$ is defined as

$$\hat\btheta_{\mathrm{MLE}} \in \arg\max\left\{ \prod_{i=1}^n p_{\btheta}(\mathbf{X}_i) \,:\, \btheta \in \Theta \right\}.$$

It is often advantageous to work with the minus log-likelihood

$$L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n) = - \sum_{i=1}^n \log p_{\btheta}(\mathbf{X}_i).$$

$\lhd$

For exponential families, maximum likelihood estimation takes a particularly natural form. We provide details in the discrete case.

Theorem (Maximum Likelihood Estimator for Exponential Families): Assume that $p_\btheta$ takes the exponential family form

$$p_\btheta(\mathbf{x}) = h(\mathbf{x}) \exp\left(\btheta^T \bphi(\mathbf{x}) - A(\btheta)\right)$$

and that the support $\S$ is finite. Let $\mathbf{X}_1,\ldots,\mathbf{X}_n$ be $n$ independent samples from a parametric family $p_{\btheta^*}$ with unknown $\btheta^* \in \Theta$. Then $L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n)$, as a function of $\btheta$, is convex and the maximum likelihood estimator of $\btheta$ - if it exists - solves the system of moment-matching equations

$$\E[\bphi(\mathbf{X})] = \frac{1}{n} \sum_{i=1}^n \bphi(\mathbf{X}_i),$$

where $\mathbf{X} \sim p_{\hat\btheta_{\mathrm{MLE}}}$.

We will need some auxiliary formulas and results.

Definition (Covariance matrix): Let $\mathbf{Z}$ be a random vector. The covariance matrix of $\mathbf{Z}$ is defined as

$$\mathrm{K}_{\mathbf{Z}, \mathbf{Z}} = \E[(\mathbf{Z} - \E[\mathbf{Z}])(\mathbf{Z} - \E[\mathbf{Z}])^T].$$

$\lhd$

Exercise: Show that

$$\mathrm{K}_{\mathbf{Z}, \mathbf{Z}} = \E[\mathbf{Z} \mathbf{Z}^T] - \E[\mathbf{Z}]\E[\mathbf{Z}]^T$$

$\lhd$

Lemma: Let $\mathbf{Z}$ be a random vector taking values in $\mathbb{R}^m$ whose components have finite variances. Then the covariance matrix of $\mathbf{Z}$, $\mathrm{K}_{\mathbf{Z}, \mathbf{Z}}$, is positive semidefinite.

Proof idea: We show that the quadratic form $\langle \mathbf{y}, \mathrm{K}_{\mathbf{Z}, \mathbf{Z}} \mathbf{y}\rangle$ can be written as a variance, and so is always non-negative.

Proof: For any $\mathbf{y} \in \mathbb{R}^m$, by the exercise above,

\begin{align*} \langle \mathbf{y}, \mathrm{K}_{\mathbf{Z}, \mathbf{Z}} \mathbf{y}\rangle &= \mathbf{y} ^T \left(\E[\mathbf{Z} \mathbf{Z}^T] - \E[\mathbf{Z}]\E[\mathbf{Z}]^T \right)\mathbf{y}\\ &= \E[\mathbf{y} ^T \mathbf{Z} \mathbf{Z}^T \mathbf{y}] - \E[\mathbf{y} ^T\mathbf{Z}]\E[\mathbf{Z}^T \mathbf{y}] \\ &= \E[(\mathbf{y} ^T \mathbf{Z})^2] - (\E[\mathbf{y} ^T \mathbf{Z}])^2\\ &= \var[\mathbf{y} ^T \mathbf{Z}]\\ &\geq 0. \end{align*}

Since this inequality holds for any $\mathbf{y}$, it follows by definition that $\mathrm{K}_{\mathbf{Z}, \mathbf{Z}} \succeq 0$. $\square$

The function $A$ has properties worth highlighting.

Lemma (Derivatives of $A$): Assume that $p_\btheta$ takes the exponential family form

$$p_\btheta(\mathbf{x}) = h(\mathbf{x}) \exp\left(\btheta^T \bphi(\mathbf{x}) - A(\btheta)\right)$$

and that the support $\S$ is finite. Then

$$\nabla A(\btheta) = \E[\phi(\mathbf{X})] \qquad \text{and} \qquad \mathbf{H}_A (\btheta) = \mathrm{K}_{\phi(\mathbf{X}), \phi(\mathbf{X})},$$

where $\mathbf{X} \sim p_\btheta$.

Proof idea: Follows from a direct calculation.

Proof: We observe first that

$$A(\btheta) = \log Z(\btheta) = \log\left(\sum_{\mathbf{x} \in \S} h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))\right),$$

where we used the fact that, by definition, $Z(\btheta)$ is the normalization constant of $p_\btheta$. In particular, as the logarithm of a finite, weighted sum of exponentials, the function $A(\btheta)$ is continuously differentiable. Hence so is $p_\theta(\mathbf{x})$ as a function of $\btheta$.

From the formula above and the basic rules of calculus,

\begin{align*} \frac{\partial}{\partial \theta_j} A(\btheta) &= \frac{\partial}{\partial \theta_j} \log\left(\sum_{\mathbf{x} \in \S} h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))\right)\\ &= \frac{\sum_{\mathbf{x} \in \S} h(\mathbf{x}) \,\phi_j(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))}{\sum_{\mathbf{x} \in \S} h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))}\\ &= \sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) \frac{1}{Z(\btheta)} h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))\\ &= \sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta))\\ &= \E[\phi_j(\mathbf{X})], \end{align*}

where $\mathbf{X} \sim p_\btheta$.

Differentiating again, this time with respect to $\theta_i$, we obtain

\begin{align*} \frac{\partial^2}{\partial \theta_i \partial \theta_j} A(\btheta) &= \frac{\partial}{\partial \theta_i} \left\{\sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta))\right\}\\ &= \sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta)) \left\{\phi_i(\mathbf{x}) - \frac{\partial}{\partial \theta_i} A(\btheta) \right\}\\ &= \sum_{\mathbf{x} \in \S} \phi_i(\mathbf{x}) \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta))\\ & \qquad - \left(\sum_{\mathbf{x} \in \S} \phi_i(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta)) \right)\\ & \qquad\qquad \times \left(\sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta)) \right)\\ &= \E[\phi_i(\mathbf{X})\phi_j(\mathbf{X})] - \E[\phi_i(\mathbf{X})]\E[\phi_j(\mathbf{X})], \end{align*}

where again $\mathbf{X} \sim p_\btheta$. That concludes the proof. $\square$

Proof (Maximum Likelihood Estimator for Exponential Families): We begin by computing the stationary points of the log-likelihood, for which we need the gradient with respect to $\btheta \in \mathbb{R}^m$. We will also need the second-order derivatives to establish convexity. We have

\begin{align*} \frac{\partial}{\partial \theta_j} \{- \log p_\btheta(\mathbf{x})\} &= \frac{\partial}{\partial \theta_j} \left\{- \log h(\mathbf{x}) - \btheta^T \bphi(\mathbf{x}) + A(\btheta)\right\}\\ &= - \phi_j(\mathbf{x}) + \frac{\partial}{\partial \theta_j} A(\btheta). \end{align*}

and

\begin{align*} \frac{\partial^2}{\partial \theta_i \partial \theta_j} \{- \log p_\btheta(\mathbf{x})\} &= \frac{\partial}{\partial \theta_i} \left\{- \phi_j(\mathbf{x}) + \frac{\partial}{\partial \theta_j} A(\btheta)\right\}\\ &= \frac{\partial^2}{\partial \theta_i \partial \theta_j} A(\btheta). \end{align*}

We use the expressions for the derivatives of $A$ obtained above.

Plugging into the formula for the minus log-likelihood (as a function of $\btheta$), we get for the gradient with respect to $\btheta$

\begin{align*} \nabla L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n) &= \sum_{i=1}^n \{- \bphi(\mathbf{X}_i) + \nabla A(\btheta)\}\\ &= \sum_{i=1}^n \{- \bphi(\mathbf{X}_i) + \E[\bphi(\mathbf{X})]\}. \end{align*}

For the Hessian with respect to $\btheta$, we get

\begin{align*} \mathbf{H}_{L_n}(\btheta; \{\mathbf{X}_i\}_{i=1}^n) = \sum_{i=1}^n \mathbf{H}_A (\btheta) = n \,\mathrm{K}_{\bphi(\mathbf{X}), \bphi(\mathbf{X})}. \end{align*}

Since $\mathrm{K}_{\bphi(\mathbf{X}), \bphi(\mathbf{X})}$ is positive semidefinite, so is $\mathbf{H}_{L_n}(\btheta; \{\mathbf{X}_i\}_{i=1}^n)$.

Hence, a stationary point $\hat\btheta_{\mathrm{MLE}}$ must satisfy

$$\mathbf{0} = \nabla L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n) = \sum_{i=1}^n \{- \bphi(\mathbf{X}_i) + \E[\bphi(\mathbf{X})]\},$$

where $\mathbf{X} \sim p_{\hat\btheta_{\mathrm{MLE}}}$ or, after re-arranging,

$$\E[\bphi(\mathbf{X})] = \frac{1}{n} \sum_{i=1}^n \bphi(\mathbf{X}_i).$$

Because $L_n$ is convex, a stationary point - if it exists - is necessarily a global minimum by the First-Order Sufficient Condition for Convex Functions. $\square$

Example (Bernoulli, continued): For $x \in \{0,1\}$, recall that the $\mathrm{Ber}(q)$ distribution can be written as

\begin{align*} p_\theta(x) &= \frac{1}{Z(\theta)} h(x) \exp(\theta \,\phi(x)) \end{align*}

where we define $h(x) = 1$, $\phi(x) = x$, $\theta = \log \left(\frac{q}{1-q}\right)$ and $Z(\theta) = \sigma(\theta)$. Let $X_1,\ldots,X_n$ be independent samples from $p_{\theta^*}$.

For $X \sim p_{\hat\theta_{\mathrm{MLE}}}$, the moment-matching equations reduce to

$$\hat{q}_{\mathrm{MLE}} := \E[X] = \E[\phi(X)] = \frac{1}{n} \sum_{i=1}^n \phi(X_i) = \frac{1}{n} \sum_{i=1}^n X_i.$$

To compute the left-hand side in terms of $\hat\theta_{\mathrm{MLE}}$ we use the relationship $\theta = \log \left(\frac{q}{1-q}\right)$, that is,

$$\hat\theta_{\mathrm{MLE}} = \log \left(\frac{\frac{1}{n} \sum_{i=1}^n X_i}{1-\frac{1}{n} \sum_{i=1}^n X_i}\right).$$

Hence, $\hat\theta_{\mathrm{MLE}}$ is well-defined when $\frac{1}{n} \sum_{i=1}^n X_i \neq 0, 1$.

Define $q^*$ as the solution to

$$\theta^* = \log \left(\frac{q^*}{1-q^*}\right)$$

that is,

$$q^* = \frac{e^{\theta^*}}{1+e^{\theta^*}} = \frac{1}{1 + e^{-\theta*}} = \sigma(\theta^*),$$

where $\sigma$ is the sigmoid function.

By the law of large numbers, as $n \to +\infty$, we get the convergence

$$\frac{1}{n} \sum_{i=1}^n X_i \to q^*,$$

with probability one.

Because the function $\log \left(\frac{q}{1-q}\right)$ is continuous for $q \in (0,1)$, we have furthermore

$$\hat\theta_{\mathrm{MLE}} = \log \left(\frac{\frac{1}{n} \sum_{i=1}^n X_i}{1-\frac{1}{n} \sum_{i=1}^n X_i}\right) \to \log \left(\frac{q^*}{1-q^*}\right) = \theta^*.$$

In words, the maximum likelihood estimator $\hat\theta_{\mathrm{MLE}}$ is guaranteed to converge to the true parameter $\theta^*$ when the number of samples grows. This fundamental property is known as statistical consistency. $\lhd$