$\newcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\btheta}{\boldsymbol{\theta}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\indep}{\perp\!\!\!\perp}$

*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* Nov 27, 2020

*Copyright:* © 2020 Sebastien Roch

We introduce here the two basic building blocks in constructing probabilistic models. We first review a fundamental probability concept: conditional probability. We then introduce a common family of distributions, the exponential family.

Throughout this topic, all formal proofs are done under the assumption of a discrete distribution with finite support to avoid unnecessary technicalities and focus rather on the concepts. But everything we discuss can be adapted to continuous or hybrid distributions.

We begin with the probabilistic notion of conditioning, which plays a key role in probabilistic modeling and reasoning.

We start with events. Thoughout, we work on a fixed probability space $(\Omega, \mathcal{F}, \P)$, which we assume is finite.

**Definition (Conditional Probability):** Let $A$ and $B$ be two events with $\mathbb{P}[B] > 0$. The conditional probability of $A$ given $B$ is defined as

$\lhd$

The intuitive interpretation goes something like this: knowing that event $B$ has occurred, the updated probability of observing $A$ is the probability of its restriction to $B$ properly normalized to reflect that outcomes outside $B$ have updated probability $0$. This is illustrated next.

(Source)

Conditional probabilities behave like unconditional probabilities. For instance:

*Exercise:* Let $A_1, \ldots, A_m$ be disjoint events and let $B$ be such that $\P[B] > 0$. Show that

$\lhd$

Independence can be characterized in terms of conditional probability. In words, $A$ and $B$ are independent if conditioning on one of them does not change the probability of observing the other.

**Lemma (Independence and Conditional Probability):** Let $A$ and $B$ be two events of positive probability. Then $A$ and $B$ are independent, which we will denote as $A \indep B$, if and only if $\P[A|B] = \P[A]$ and $\P[B|A] = \P[B]$.

*Proof:* If $A$ and $B$ are independent, then c which implies

In the other direction,

$$ \P[A] = \P[A|B] = \frac{\P[A \cap B]}{\P[B]} $$implies $\P[A|B] = \frac{\P[A \cap B]}{\P[B]}$ after rearranging. $\square$

The conditional probability is often used in three fundamental ways, which we recall next. Proofs can be found in most probability textbooks.

**Product (or chain) rule:**For any collection of events $A_1,\ldots,A_r$,

**Law of total probability:**For any event $B$ and any partition $A_1,\ldots,A_r$ of $\Omega$,

**Bayes' rule:**For any events $A$ and $B$ with positie probability,

Conditional probabilities extend naturally to random variables. If $X$ is a discrete random variable, we let $p_X$ be its probability mass function and $\S_X$ be its support, that is, the set of values where it has positive probability. Then we can for instance condition on the event $\{X = x\}$ for any $x \in \S_X$.

We define the conditional probability mass function.

**Definition (Conditional Probability Mass Function):** Let $X$ and $Y$ be discrete random variables with joint probability mass function $p_{X, Y}$ and marginals $p_X$ and $p_Y$. The conditional probability mass function of $X$ given $Y$ is defined as

which is defined for all $x \in \S_X$ and $y \in \S_Y$. $\lhd$

The conditional expectation can then be defined in a natural way as the expectation over the conditional probability mass function.

**Definition (Conditional Expectation):** Let $X$ and $Y$ be discrete random variables where $X$ takes real values. The conditional expectation of $X$ given $Y = y$ is given by

$\lhd$

Thinking of $\E[X|Y=y]$ as a function of $y$ leads to a fundamental characterization of the conditional expectation.

**Theorem (Conditional Expectation and Least Squares):** Let $X$ and $Y$ be discrete random variables where $X$ takes real values. Then the conditional expectation $h(y) = \E[X|Y=y]$ minimizes the least squares criterion

where the minimum is over all real-valued functions of $y$.

*Proof:* Think of $h(y)$ as a vector $\mathbf{h} = (h_y)_{y \in \S_Y}$ indexed by $\S_Y$, where $h(y) = h_y \in \mathbb{R}$. Then

Take partial derivatives with respect to $\mathbf{h}$:

$$ \begin{align*} \frac{\partial}{\partial h_y} \mathcal{L} &= - 2 \sum_{x\in \S_X} (x - h_y) p_{X,Y}(x,y)\\ &= - 2 \sum_{x\in \S_X} x p_{X,Y}(x,y) + 2 h_y p_Y(y) \end{align*} $$where we used that the marginal of $Y$ satisfies $p_Y(y) = \sum_{x \in \S_X} p_{X,Y}(x,y)$. The Hessian is the diagonal matrix with positive entries $2 p_Y(y)$, so it is positive definite and the problem is convex.

After rearranging, the stationary points satisfy

$$ h_y = \sum_{x\in \S_X} x \frac{p_{X,Y}(x,y)}{p_Y(y)} = \sum_{x\in \S_X} x p_{X|Y}(x|y) = \E[X|Y=y] $$as claimed. $\square$

Parametric families of probability distributions serve as basic building blocks for more complex models. By this, we mean a collection $\{p_\btheta:\btheta \in \Theta\}$, where $p_\btheta$ is a probability distribution over a set $\S_\btheta$ and $\btheta$ is a vector-valued parameter.

**Example (Bernoulli):** The random variable $X$ is Bernoulli with parameter $q \in [0,1]$, denoted $X \sim \mathrm{Ber}(q)$, if it takes values in $\S_X = \{0,1\}$ and $\P[X=1] = q$. Varying $q$ produces the family of Bernoulli distributions. $\lhd$

Here we focus on the exponential family, which includes many common distributions (including the Bernoulli distribution).

One particularly useful class of probability distributions in data science is the exponential family, which includes many well-known cases.

**Definition (Exponential Family - Discrete Case):** A parametric collection of probability distributions $\{p_\btheta:\btheta \in \Theta\}$ over a discrete space $\S$ is an exponential family if it can be written in the form

where $\btheta \in \mathbb{R}^m$ are the canonical parameters, $\bphi : \S \to \mathbb{R}^m$ are the sufficient statistics and $Z(\btheta)$ is the partition function. It is often convenient to introduce the function $A(\btheta) = \log Z(\btheta)$ and re-write

$$ p_\btheta(\mathbf{x}) = h(\mathbf{x}) \exp\left(\btheta^T \bphi(\mathbf{x}) - A(\btheta)\right). $$$\lhd$

**Example (Bernoulli, continued):** For $x \in \{0,1\}$, the $\mathrm{Ber}(q)$ distribution for $0 < q < 1$ can be written as

where we define $h(x) = 1$, $\phi(x) = x$, $\theta = \log \left(\frac{q}{1-q}\right)$ and, since $Z(\theta)$ serves as the normalization constant in $p_\theta$,

$$ Z(\theta) = \sum_{x \in \{0,1\}} h(x) \exp(\theta \,\phi(x)) = 1 + e^\theta. $$$\lhd$

The following is an important generalization.

**Example (Categorical and Multinomial):** A categorical variable $X$ takes $K \geq 2$ possible values. A standard choice is to use one-hot encoding $\S = \{\mathbf{e}_i : i=1,\ldots,K\}$ where $\mathbf{e}_i$ is the $i$-th canonical basis in $\mathbb{R}^K$. The distribution is specified by setting the probabilities $\bpi = (\pi_1,\ldots,\pi_K)$

We denote this $X \sim \mathrm{Cat}(\bpi)$ and we assume $\pi_i > 0$ for all $i$.

To see that this is an exponential family, write the probability mass function at $\mathbf{x} = (x_1,\ldots,x_K)$ as

$$ \prod_{i=1}^K \pi_i^{x_i} = \exp\left(\sum_{i=1}^K x_i \log \pi_i \right). $$So we can take $h(\mathbf{x}) = 1$, $\btheta = (\log \pi_i)_{i=1}^K$, $\bphi(\mathbf{x}) = (x_i)_{i=1}^K$ and $Z(\btheta) = 1$.

The multinomial distribution arises as a sum of independent categorical variables. Let $n \geq 1$ be the number of trials (or samples) and let $Y_1,\ldots,Y_n$ be IID $\mathrm{Cat}(\bpi)$. Define $X = \sum_{i=1}^n Y_i$. The probability mass function of $X$ at

$$ \mathbf{x} = (x_1,\ldots,x_K) \in \left\{ \mathbf{x} \in \{0,1,\ldots,n\}^K : \sum_{i=1}^K x_i = n \right\}=: \S $$is

$$ \frac{n!}{x_1!\cdots x_K!} \prod_{i=1}^K \pi_i^{x_i} = \frac{n!}{x_1!\cdots x_K!} \exp\left(\sum_{i=1}^K x_i \log \pi_i\right) $$and we can take $h(\mathbf{x}) = \frac{n!}{x_1!\cdots x_K!}$, $\btheta = (\log \pi_i)_{i=1}^K$, $\bphi(\mathbf{x}) = (x_i)_{i=1}^K$ and $Z(\btheta) = 1$. This is an exponential family if we think of $n$ as fixed.

We use the notation $X \sim \mathrm{Mult}(n, \bpi)$. $\lhd$

*Exercise (Moments of Multinomial):* Let $X \sim \mathrm{Mult}(n, \bpi)$ with $n \geq 1$ and $\bpi \in \Delta_K$. Establish the following formulas:

(a) $\E[X] = n \bpi$

(b) $\var[X] = n[\mathrm{Diag}(\bpi) - \bpi \bpi^T]$

$\lhd$

While we have focused so far on discrete distributions, one can adapt the definitions above by replacing mass functions with density functions. We give two important examples.

*Exercise:* Recall that the trace) of a square matrix $A$, denoted $\mathrm{tr}(A)$, is the sum of its diagonal entries.

(a) Show that, for any $A \in \mathbb{R}^{n \times m}$ and $B \in \mathbb{R}^{m \times n}$, it holds that $\mathrm{tr}(AB) = \mathrm{tr}(BA)$.

(b) Use (a) to show that more generally $\mathrm{tr}(ABC) = \mathrm{tr}(CAB) = \mathrm{tr}(BCA)$.

$\lhd$

**Example (Multivariate Gaussian):** A multivariate Gaussian vector $\mathbf{X}$ on $\mathbb{R}^d$ with mean $\bmu \in \mathbb{R}^d$ and positive definite covariance matrix $\bSigma \in \mathbb{R}^{d \times d}$ has probability density function

where $|A|$ is the determinant of $A$. We use the notation $\mathbf{X} \sim \mathcal{N}(\bmu, \bSigma)$.

(Source)

Rewriting the density as

$$ f_{\bmu, \bSigma}(\mathbf{x}) = \frac{e^{-(1/2) \bmu^T \bSigma^{-1} \bmu}}{(2\pi)^{d/2} \,|\bSigma|^{1/2}} \exp\left(- \mathbf{x}^T \bSigma^{-1}\bmu - \frac{1}{2} \mathrm{tr}\left(\mathbf{x} \mathbf{x}^T \bSigma^{-1}\right)\right) $$where we used the symmetric of $\bSigma^{-1}$ in the first term of the exponential and trace identity of the previous exercise in the second term. The expression in parentheses is linear in the entries of $\mathbf{x}$ and $\mathbf{x} \mathbf{x}^T$, which can then be taken as sufficient statistics (formally, using vectorization)). This shows that the multivariate normal is an exponential family. $\lhd$

The Dirichlet distribution, which we describe next, is a natural probability distribution over probability distributions. In particular, it is used in Bayesian data analysis as a prior on the parameters of categorical and multinomial distribution, largely because of a property known as conjuguacy.

**Example (Dirichlet):** The Dirichlet distribution is a distribution over the $(K-1)$-simplex

Its parameters are $\balpha = (\alpha_1, \ldots, \alpha_K) \in \mathbb{R}$ and the density is

$$ f_{\balpha}(\mathbf{x}) = \frac{1}{B(\balpha)} \prod_{i=1}^K x_i^{\alpha_i-1}, \quad \mathbf{x} \in \Delta_{K} $$where the normalizing constant $B(\balpha)$ is the multivariate Beta function.

Rewriting the density as

$$ \frac{1}{B(\balpha)} \prod_{i=1}^K x_i^{\alpha_i-1} = \frac{1}{B(\balpha)} \frac{1}{\prod_{i=1}^K x_i} \exp\left(\sum_{i=1}^K \alpha_i \log x_i\right) $$shows that this is an exponential family with the canonical parameters $\balpha$ and sufficient statistics $(\log x_i)_{i=1}^K$. $\lhd$

See here for many more examples. Observe, in particular, that the same distribution can have several representations as an exponential family.

**NUMERICAL CORNER** In Julia, the package Distributions.jl provides a way to sample from a variety of standard distributions. We first initialize the pseudorandom number generator with a random seed. In particular it allows the results to be reproducible: using the same seed produces the same results again.

In [1]:

```
# Julia version: 1.5.1
using Random, Distributions
```

In [2]:

```
Random.seed!(123); # Setting the seed
```

We then set the distribution and its parameters. Here's are lists of available continuous distributions, discrete distributions, and multivariate distributions.

In [3]:

```
p = 0.1
d = Bernoulli(p)
```

Out[3]:

We then use `rand`

to generate samples.

In [4]:

```
N = 5 # number of samples
rand(d,N)
```

Out[4]:

Here are a few other examples.

In [5]:

```
p = [0.1, 0.2, 0.7]
n = 100
d = Multinomial(n,p)
rand(d, N)
```

Out[5]:

In [6]:

```
mu = [0.1, -0.3]
sig = [2. 0.; 0. 3.]
d = MvNormal(mu, sig)
rand(d, N)
```

Out[6]:

When modeling data via a parametric family of distributions, the parameters must be determined from the data itself. In a typical setting, we assume that the data comprises $n$ independent samples $\mathbf{X}_1,\ldots,\mathbf{X}_n$ from a parametric family $p_\btheta$ with unknown $\btheta \in \Theta$. Many methods exist for estimating $\btheta$, depending on the context. Here we focus on perhaps the best-known approach, maximum likelihood estimation. It has many good theoretical properties which we will not describe here, as well as drawbacks.

The idea behind maximum likelihood estimation is simple and intuitive: we choose the parameter that maximizes the probability of observing the data.

**Definition (Maximum likelihood estimator):** Assume that $\mathbf{X}_1,\ldots,\mathbf{X}_n$ are $n$ independent samples from a parametric family $p_{\btheta^*}$ with unknown $\btheta^* \in \Theta$. The maximum likelihood estimator of $\btheta$ is defined as

It is often advantageous to work with the negative log-likelihood

$$ L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n) = - \sum_{i=1}^n \log p_{\btheta}(\mathbf{X}_i). $$$\lhd$

We can give an alternative perspective on the maximum likelihood estimator. Assume that $p_\btheta$ is supported on a fixed finite set $\X$ for all $\btheta \in \Theta$. Given samples $\mathbf{X}_1,\ldots,\mathbf{X}_n$, for each $\mathbf{x} \in \X$, let

$$ N_\mathbf{x} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{\mathbf{X}_i = \mathbf{x}\}} $$count the number of times $\mathbf{x}$ is observed in the data and let

$$ \hat\mu_n(\mathbf{x}) = \frac{N_\mathbf{x}}{n} $$be the empirical frequency of $\mathbf{x}$ in the sample. Observe that $\hat\mu_n$ is a probability distribution over $\X$.

The following theorem characterizes the maximum likelihood estimator in terms of the Kullback-Liebler divergence, which was introduced in a previous section.

**Theorem (MLE via KL):** Assume that, for all $\btheta \in \Theta$, $p_\btheta$ is supported on a fixed finite set $\X$ and that $p_\btheta(\mathbf{x}) > 0$ for all $\mathbf{x} \in \X$. Given samples $\mathbf{X}_1,\ldots,\mathbf{X}_n$ from $p_{\btheta^*}$, let $\{\hat\mu_n(\mathbf{x})\}_{\mathbf{x} \in \X}$ be the corresponding empirical frequencies. Then the maximum likelihood estimator $\hat\btheta_{\mathrm{MLE}}$ of $\btheta$ is also a solution to

*Proof idea:* Massage the negative log-likelihood to bring out its relationship to the Kullback-Liebler divergence.

*Proof:* We can re-write the negative log-likelihood as

To bring out the Kullback-Liebler divergence, we further transform the previous equation into

$$ \begin{align*} \frac{1}{n} L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n) &= - \frac{1}{n} \sum_{\mathbf{x} \in \X} N_{\mathbf{x}} \log p_\btheta(\mathbf{x})\\ &= \sum_{\mathbf{x} \in \X} (N_{\mathbf{x}}/n) \log \frac{N_{\mathbf{x}}/n}{p_\btheta(\mathbf{x})} - \sum_{\mathbf{x} \in \X} (N_{\mathbf{x}}/n) \log (N_{\mathbf{x}}/n)\\ &= \sum_{\mathbf{x} \in \X} \hat\mu_n(\mathbf{x}) \log \frac{\hat\mu_n(\mathbf{x})}{p_\btheta(\mathbf{x})} - \sum_{\mathbf{x} \in \X} \hat\mu_n(\mathbf{x}) \log \hat\mu_n(\mathbf{x})\\ &= \mathrm{KL}(\hat{\mu}_n \| p_{\btheta}) + \mathrm{H}(\hat\mu_n), \end{align*} $$where the second term is referred to as the entropy) of $\hat\mu_n$.

Because $\mathrm{H}(\hat\mu_n)$ does not depend on $\btheta$, minimizing $L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n)$ is equivalent to minimizing $\mathrm{KL}(\hat{\mu}_n \| p_{\btheta})$ as claimed. $\square$

In words, the maximum likelihood estimator chooses the parametric distribution that is closest to $\hat\mu_n$ in Kullback-Liebler divergence. One can think of this as projecting $\hat\mu_n$ onto the space $\{p_\btheta : \btheta \in \Theta\}$ under the Kullback-Liebler notion of distance.

**Example (Special case):** One special case is where $\X$ is finite, $\btheta = (\theta_\mathbf{x})_{\mathbf{x} \in \X}$ is a probability distribution over $\X$ and $p_\btheta = \btheta$. That is, we consider the class of all probability distributions over $\X$. Given samples $\mathbf{X}_1,\ldots,\mathbf{X}_n$ from $p_{\btheta^*}$, in this case

where recall that, by convention, if $\hat\mu_n(\mathbf{x}) = 0$ then $\hat\mu_n(\mathbf{x}) \log \frac{\hat\mu_n(\mathbf{x})}{\theta_\mathbf{x}} = 0$ for any $\theta_\mathbf{x}$. So, letting $\mathbb{X}_n = \{\mathbf{X}_1,\ldots,\mathbf{X}_n\}$ be the set of distinct values encountered in the sample (ignoring repetitions), we have

$$ \mathrm{KL}(\hat{\mu}_n \| p_{\btheta}) = \sum_{\mathbf{x} \in \mathbb{X}_n} \hat\mu_n(\mathbf{x}) \log \frac{\hat\mu_n(\mathbf{x})}{\theta_\mathbf{x}}. $$Note that $\sum_{\mathbf{x} \in \mathbb{X}_n} \hat\mu_n(\mathbf{x}) = 1$.

We have previously established *Gibbs' inequality* which says that: for any $\mathbf{p}, \mathbf{q} \in \Delta_K$ with $\mathbf{q} > \mathbf{0}$, it holds that $\mathrm{KL}(\mathbf{p} \| \mathbf{q}) \geq 0$. A quick look at the proof shows that the result holds as long as $\sum_{i=1}^K q_i \leq 1$ instead of $\sum_{i=1}^K q_i = 1$.

The minimum $\mathrm{KL}(\hat{\mu}_n \| p_{\btheta}) = 0$ can be achieved by setting $\btheta_\mathbf{x} = \hat\mu_n(\mathbf{x})$ for all $\mathbf{x} \in \mathbb{X}_n$ and $\btheta_\mathbf{x} = 0$ for all $\mathbf{x} \notin \mathbb{X}_n$. The condition

$$ \sum_{\mathbf{x} \in \X} \btheta_\mathbf{x} = \sum_{\mathbf{x} \in \mathbb{X}_n} \btheta_\mathbf{x} + \sum_{\mathbf{x} \notin \mathbb{X}_n} \btheta_\mathbf{x} = \sum_{\mathbf{x} \in \mathbb{X}_n} \hat\mu_n(\mathbf{x}) = 1, $$is then satisfied.

So in this case $\hat\mu_n$ is a maximum likelihood estimator. $\lhd$

For exponential families, maximum likelihood estimation takes a particularly natural form. We provide details in the discrete case.

**Theorem (Maximum Likelihood Estimator for Exponential Families):** Assume that $p_\btheta$ takes the exponential family form

and that the support $\S$ is finite. Let $\mathbf{X}_1,\ldots,\mathbf{X}_n$ be $n$ independent samples from a parametric family $p_{\btheta^*}$ with unknown $\btheta^* \in \Theta$. Then $L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n)$, as a function of $\btheta$, is convex and the maximum likelihood estimator of $\btheta$ - if it exists - solves the system of moment-matching equations

$$ \E[\bphi(\mathbf{X})] = \frac{1}{n} \sum_{i=1}^n \bphi(\mathbf{X}_i), $$where $\mathbf{X} \sim p_{\hat\btheta_{\mathrm{MLE}}}$.

We will need some auxiliary formulas and results.

**Definition (Covariance matrix):** Let $\mathbf{Z}$ be a random vector. The covariance matrix of $\mathbf{Z}$ is defined as

$\lhd$

*Exercise:* Show that

$\lhd$

**Lemma:** Let $\mathbf{Z}$ be a random vector taking values in $\mathbb{R}^m$ whose components have finite variances. Then the covariance matrix of $\mathbf{Z}$, $\mathrm{K}_{\mathbf{Z}, \mathbf{Z}}$, is positive semidefinite.

*Proof idea:* We show that the quadratic form $\langle \mathbf{y}, \mathrm{K}_{\mathbf{Z}, \mathbf{Z}} \mathbf{y}\rangle$ can be written as a variance, and so is always non-negative.

*Proof:* For any $\mathbf{y} \in \mathbb{R}^m$, by the exercise above,

Since this inequality holds for any $\mathbf{y}$, it follows by definition that $\mathrm{K}_{\mathbf{Z}, \mathbf{Z}} \succeq 0$. $\square$

The function $A$ has properties worth highlighting.

**Lemma (Derivatives of $A$):** Assume that $p_\btheta$ takes the exponential family form

and that the support $\S$ is finite. Then

$$ \nabla A(\btheta) = \E[\bphi(\mathbf{X})] \qquad \text{and} \qquad \mathbf{H}_A (\btheta) = \mathrm{K}_{\bphi(\mathbf{X}), \bphi(\mathbf{X})}, $$where $\mathbf{X} \sim p_\btheta$.

*Proof idea:* Follows from a direct calculation.

*Proof:* We observe first that

where we used the fact that, by definition, $Z(\btheta)$ is the normalization constant of $p_\btheta$. In particular, as the logarithm of a finite, weighted sum of exponentials, the function $A(\btheta)$ is continuously differentiable. Hence so is $p_\theta(\mathbf{x})$ as a function of $\btheta$.

From the formula above and the basic rules of calculus,

$$ \begin{align*} \frac{\partial}{\partial \theta_j} A(\btheta) &= \frac{\partial}{\partial \theta_j} \log\left(\sum_{\mathbf{x} \in \S} h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))\right)\\ &= \frac{\sum_{\mathbf{x} \in \S} h(\mathbf{x}) \,\phi_j(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))}{\sum_{\mathbf{x} \in \S} h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))}\\ &= \sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) \frac{1}{Z(\btheta)} h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}))\\ &= \sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta))\\ &= \E[\phi_j(\mathbf{X})], \end{align*} $$where $\mathbf{X} \sim p_\btheta$.

Differentiating again, this time with respect to $\theta_i$, we obtain

$$ \begin{align*} \frac{\partial^2}{\partial \theta_i \partial \theta_j} A(\btheta) &= \frac{\partial}{\partial \theta_i} \left\{\sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta))\right\}\\ &= \sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta)) \left\{\phi_i(\mathbf{x}) - \frac{\partial}{\partial \theta_i} A(\btheta) \right\}\\ &= \sum_{\mathbf{x} \in \S} \phi_i(\mathbf{x}) \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta))\\ & \qquad - \left(\sum_{\mathbf{x} \in \S} \phi_i(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta)) \right)\\ & \qquad\qquad \times \left(\sum_{\mathbf{x} \in \S} \phi_j(\mathbf{x}) h(\mathbf{x}) \exp(\btheta^T \bphi(\mathbf{x}) - A(\btheta)) \right)\\ &= \E[\phi_i(\mathbf{X})\phi_j(\mathbf{X})] - \E[\phi_i(\mathbf{X})]\E[\phi_j(\mathbf{X})], \end{align*} $$where again $\mathbf{X} \sim p_\btheta$. That concludes the proof. $\square$

We are now ready to the prove the main theorem.

*Proof (Maximum Likelihood Estimator for Exponential Families):* We begin by computing the stationary points of the log-likelihood, for which we need the gradient with respect to $\btheta \in \mathbb{R}^m$. We will also need the second-order derivatives to establish convexity. We have

and

$$ \begin{align*} \frac{\partial^2}{\partial \theta_i \partial \theta_j} \{- \log p_\btheta(\mathbf{x})\} &= \frac{\partial}{\partial \theta_i} \left\{- \phi_j(\mathbf{x}) + \frac{\partial}{\partial \theta_j} A(\btheta)\right\}\\ &= \frac{\partial^2}{\partial \theta_i \partial \theta_j} A(\btheta). \end{align*} $$We use the expressions for the derivatives of $A$ obtained above.

Plugging into the formula for the minus log-likelihood (as a function of $\btheta$), we get for the gradient with respect to $\btheta$

$$ \begin{align*} \nabla L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n) &= \sum_{i=1}^n \{- \bphi(\mathbf{X}_i) + \nabla A(\btheta)\}\\ &= \sum_{i=1}^n \{- \bphi(\mathbf{X}_i) + \E[\bphi(\mathbf{X})]\}. \end{align*} $$This is also known in statistics as the score).

For the Hessian with respect to $\btheta$, we get

$$ \begin{align*} \mathbf{H}_{L_n}(\btheta; \{\mathbf{X}_i\}_{i=1}^n) = \sum_{i=1}^n \mathbf{H}_A (\btheta) = n \,\mathrm{K}_{\bphi(\mathbf{X}), \bphi(\mathbf{X})}. \end{align*} $$This is also known in statistics as the observed information. (In fact, in this case, it reduces to the Fisher information.) Since $\mathrm{K}_{\bphi(\mathbf{X}), \bphi(\mathbf{X})}$ is positive semidefinite, so is $\mathbf{H}_{L_n}(\btheta; \{\mathbf{X}_i\}_{i=1}^n)$.

Hence, a stationary point $\hat\btheta_{\mathrm{MLE}}$ must satisfy

$$ \mathbf{0} = \nabla L_n(\btheta; \{\mathbf{X}_i\}_{i=1}^n) = \sum_{i=1}^n \{- \bphi(\mathbf{X}_i) + \E[\bphi(\mathbf{X})]\}, $$where $\mathbf{X} \sim p_{\hat\btheta_{\mathrm{MLE}}}$ or, after re-arranging,

$$ \E[\bphi(\mathbf{X})] = \frac{1}{n} \sum_{i=1}^n \bphi(\mathbf{X}_i). $$Because $L_n$ is convex, a stationary point - if it exists - is necessarily a global minimum by the *First-Order Sufficient Condition for Convex Functions*. $\square$

**Example (Bernoulli, continued):** For $x \in \{0,1\}$, recall that the $\mathrm{Ber}(q)$ distribution can be written as

where we define $h(x) = 1$, $\phi(x) = x$, $\theta = \log \left(\frac{q}{1-q}\right)$ and $Z(\theta) = \sigma(\theta)$. Let $X_1,\ldots,X_n$ be independent samples from $p_{\theta^*}$.

For $X \sim p_{\hat\theta_{\mathrm{MLE}}}$, the moment-matching equations reduce to

$$ \hat{q}_{\mathrm{MLE}} := \E[X] = \E[\phi(X)] = \frac{1}{n} \sum_{i=1}^n \phi(X_i) = \frac{1}{n} \sum_{i=1}^n X_i. $$To compute the left-hand side in terms of $\hat\theta_{\mathrm{MLE}}$ we use the relationship $\theta = \log \left(\frac{q}{1-q}\right)$, that is,

$$ \hat\theta_{\mathrm{MLE}} = \log \left(\frac{\frac{1}{n} \sum_{i=1}^n X_i}{1-\frac{1}{n} \sum_{i=1}^n X_i}\right). $$Hence, $\hat\theta_{\mathrm{MLE}}$ is well-defined when $\frac{1}{n} \sum_{i=1}^n X_i \neq 0, 1$.

Define $q^*$ as the solution to

$$ \theta^* = \log \left(\frac{q^*}{1-q^*}\right) $$that is,

$$ q^* = \frac{e^{\theta^*}}{1+e^{\theta^*}} = \frac{1}{1 + e^{-\theta*}} = \sigma(\theta^*), $$where $\sigma$ is the sigmoid function.

By the law of large numbers, as $n \to +\infty$, we get the convergence

$$ \frac{1}{n} \sum_{i=1}^n X_i \to q^*, $$with probability one.

Because the function $\log \left(\frac{q}{1-q}\right)$ is continuous for $q \in (0,1)$, we have furthermore

$$ \hat\theta_{\mathrm{MLE}} = \log \left(\frac{\frac{1}{n} \sum_{i=1}^n X_i}{1-\frac{1}{n} \sum_{i=1}^n X_i}\right) \to \log \left(\frac{q^*}{1-q^*}\right) = \theta^*. $$In words, the maximum likelihood estimator $\hat\theta_{\mathrm{MLE}}$ is guaranteed to converge to the true parameter $\theta^*$ when the number of samples grows. This fundamental property is known as statistical consistency. $\lhd$

Statistical consistency holds more generally for the maximum likelihood estimator under exponential families (and parametric families more broadly), provided certain technical conditions hold. We will not provide further details here.

Unlike the previous example, one does not always have an explicit formula for the maximum likelihood estimator under exponential families. Instead, optimization methods, for instance Newton's method described in a previous assignment, are used in such cases.