$\newcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\S}{\mathcal{S}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\btheta}{\boldsymbol{\theta}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\indep}{\perp\!\!\!\perp}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\pa}{\mathrm{pa}}$

*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* Nov 30, 2020

*Copyright:* © 2020 Sebastien Roch

In this notebook, we discuss two standard techniques for constructing joint distributions from simpler building blocks: (1) marginalizing out an unobserved random variable and (2) imposing conditional independence relations. Combining them produces a large class of models known as probabilistic graphical models. As before, we make our derivations in the finite support case, but these can be adapted to continuous cases.

Our starting point for both techniques will be the product rule: if $\bX = (X_1,\ldots,X_r)$ is a discrete random vector, then for any $\bx = (x_1,\ldots,x_r) \in \S_{\mathbf{X}}$

$$ \P[\bX = \bx] = \P\left[\cap_{i=1}^r \{X_i = x_i\}\right] = \prod_{i=1}^r \P\left[X_i = x_i \,\middle|\, \cap_{j=1}^{i-1} \{X_j = x_j\} \right]. $$Hence, we can specify a joint distribution through a chain marginal and conditional distributions. Note that we can re-order the variables $X_1, \ldots, X_r$ arbitrarily, giving rise to different representations.

Mixtures are a natural way to define probability distributions. The basic idea is to consider a pair of random vectors $(\bX,\bY)$ and assume that $\bY$ is unobserved. The effect on the observed vector $\bX$ is that $\bY$ is marginalized out. Indeed, by the law of total probability, for any $\bx \in \S_\bX$

$$ \begin{align*} p_\bX(\bx) &= \P[\bX = \bx]\\ &= \sum_{\by \in \S_\bY} \P[\bX=\bx|\bY=\by] \,\P[\bY=\by]\\ &= \sum_{\by \in \S_\bY} p_{\bX|\bY}(\bx|\by) \,p_\bY(\by) \end{align*} $$where we used that the events $\{\bY=\by\}$, $\by \in \S_\bY$, form a partition of the probability space. We interpret this equation as defining $p_\bX(\bx)$ as a convex combination - a mixture - of the distributions $p_{\bX|\bY}(\bx|\by)$, $\by \in \S_\bY$, with mixing weights $p_\bY(\by)$. In general, we need to specify the full conditional probability distribution (CPD): $p_{\bX|\bY}(\bx|\by), \forall \bx \in \S_{\bX}, \by \in \S_\bY$. But assuming that the mixing weights and/or CPD come from parametric families can help reduce the complexity of the model.

In particular, in the parametric context, this gives rise to a powerful approach to expanding distribution families. Suppose $\{p_\btheta:\btheta \in \Theta\}$ is a parametric family of distributions. Let $K \geq 2$, $\btheta_1, \ldots, \btheta_K \in \Theta$ and $\bpi = (\pi_1,\ldots,\pi_K) \in \Delta_K$. Suppose $Y \sim \mathrm{Cat}(\bpi)$ and that the conditional distributions satisfy

$$ p_{\bX|Y}(\bx|i) = p_{\btheta_i}(\bx). $$We write this as $\bX|\{Y=i\} \sim p_{\btheta_i}$. Then we obtain the mixture model

$$ p_{\bX}(\bx) = \sum_{i=1}^K p_{\bX|Y}(\bx|i) \,p_Y(i) = \sum_{i=1}^K \pi_i p_{\btheta_i}(\bx). $$**Example (Mixture of Multinomials):** Let $n, m , K \geq 1$, $\bpi \in \Delta_K$ and, for $i=1,\ldots,K$, $\mathbf{p}_i = (p_{i1},\ldots,p_{im}) \in \Delta_m$.
Suppose that $Y \sim \mathrm{Cat}(\bpi)$ and that the conditional distributions are

Then $\bX$ is a mixture of multinomials. Its distribution is then

$$ p_\bX(\bx) = \sum_{i=1}^K \pi_i \frac{n!}{x_1!\cdots x_m!} \prod_{j=1}^m p_{ij}^{x_j}. $$$\lhd$

*Exercise (Moments of a Mixture):* For $i=1, \ldots, K$, let $p_i$ be a probability mass function over the set $\S_i \subseteq \mathbb{R}$ with mean $\mu_i$ and variance $\sigma_i^2$. Let $\bpi = (\pi_1,\ldots,\pi_K) \in \Delta_K$. Suppose $X$ is drawn from the mixture distribution

Establish the following formulas:

(a) $\E[X] = \sum_{i=1}^K \pi_i \mu_i$

(b) $\var[X] = \sum_{i=1}^K \pi_i (\sigma_i^2 + \mu_i^2 - \mu^2)$.

$\lhd$

Next is an important continuous example.

**Example (Gaussian mixture model):** For $i=1,\ldots,K$, let $\bmu_i$ and $\bSigma_i$ be the mean and covariance matrix of a multivariate normal distribution. Let $\bpi \in \Delta_K$. A Gaussian mixture model is obtained as follows: take $Y \sim \mathrm{Cat}(\bpi)$ and

It has a probability density function of the form

$$ f_\bX(\bx) = \sum_{i=1}^K \pi_i \frac{1}{(2\pi)^{d/2} \,|\bSigma_i|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \bmu_i)^T \bSigma_i^{-1} (\bx - \bmu_i)\right). $$This is illustrated below.

(Source)

We could also consider continuous mixing distributions. Several standard approaches, e.g. probabilistic PCA, factor analysis and independent components analysis, use such mixtures. See [Bis, Chapter 12]. $\lhd$

**NUMERICAL CORNER** In Julia, the package Distributions.jl also provides a way to sample from mixture models. This is better seen on an example.

In [1]:

```
# Julia version: 1.5.1
using Random, Distributions
```

In [2]:

```
Random.seed!(123); # Setting the seed
```

In [3]:

```
# constructs a mixture of three normal distributions,
# with prior probabilities [0.2, 0.5, 0.3]
d = MixtureModel(Normal[
Normal(-2.0, 1.2),
Normal(0.0, 1.0),
Normal(3.0, 2.5)],
[0.2, 0.5, 0.3])
```

Out[3]:

In [4]:

```
N = 5 # number of samples
rand(d,N)
```

Out[4]:

Another powerful approach is the use of conditional independence.

We begin with the definition. First an exercise: in words, the conditional probabilities satisfy the product rule.

*Exercise:* Let $A$, $B$ and $C$ be events. Use the product rule to show that

$\lhd$

**Definition (Conditional Independence):** Let $A, B, C$ be events such that $\P[C] > 0$. Then $A$ and $B$ are conditionally independent given $C$, denoted $A \indep B | C$, if

$\lhd$

In words, quoting [Wikipedia]:

$A$ and $B$ are conditionally independent given $C$ if and only if, given knowledge that $C$ occurs, knowledge of whether $A$ occurs provides no information on the likelihood of $B$ occurring, and knowledge of whether $B$ occurs provides no information on the likelihood of $A$ occurring.

In general, conditionally independent events are not (unconditionally) independent.

**Example:** Imagine I have two six-sided dice. Dice $a$ has faces $\{1,3,5,7,9,11\}$ and dice $b$ has faces $\{2, 4, 6, 8, 10, 12\}$. Suppose I perform the following experiment: I pick one of the two dice uniformly at random, and then I roll that die twice. Let $X_1$ and $X_2$ be the outcomes of the rolls. Consider the events $A = \{X_1 = 1\}$, $B = \{X_2 = 2\}$, and $C = \{\text{die $a$ is picked}\}$. The events $A$ and $B$ are clearly dependent: if $A$ occurs, then I know that die $a$ was picked, and hence $B$ cannot occur. Knowledge of one event provides information about the likelihood of the other event occurring. Formally, by the law of total probability,

Similarly $\P[B] = \frac{1}{12}$. Yet $\P[A \cap B] = 0 \neq \frac{1}{12} \frac{1}{12}$.

On the other hand, we claim that $A$ and $B$ are conditionally independent given $C$. Again this is intuitively clear: once I pick a die, the two rolls are independent. For a given die choice, knowledge of one roll provides no information about the likelihood of the other roll. Note that the phrase "for a given die choice" is critical in the last statement. Formally, by our experiment, we have $\P[A|C] = 1/6$, $\P[B|C] = 0$ and $\P[A \cap B|C] = 0$. So indeed

$$ \P[A \cap B| C] = \P[A|C] \,\P[B|C] $$as claimed. $\lhd$

The following exercises are further instances of the important principle that conditional probabilities satisfy the same rules as probabilities.

*Exercise:* Let $A, B ,C$ be events such that $\P[C] > 0$ and $A \indep B | C$. Show that $A \indep B^c | C$. $\lhd$

*Exercise:* Let $A, B, C$ be events such that $\P[B \cap C], \P[A \cap C] > 0$. Show that $A \indep B | C$ if and only if

$\lhd$

*Exercise (Adding Data):* Let $A, B, C$ be events such that $\P[B \cap C] > 0$.

(a) First show that

$$ \P[A|B\cap C] = \frac{\P[C|A\cap B] \,\P[A|B]}{\P[C|A\cap B] \,\P[A|B] + \P[C|A^c\cap B] \,\P[A^c|B]}. $$(b) Now suppose $B \indep C | A$. Show that

$$ \P[A|B\cap C] = \frac{\P[C|A] \,\P[A|B]}{\P[C|A] \,\P[A|B] + \P[C|A^c] \,\P[A^c|B]}. $$$\lhd$

Conditional independence is naturally extended to random vectors.

**Definition (Conditional Independence of Random Vectors):** Let $\bX, \bY, \bW$ be discrete random vectors. Then $\bX$ and $\bY$ are said to be conditionally independent given $\bW$, denoted $\bX \indep \bY | \bW$ if for all $\bx \in \S_\bX$, $\by \in \S_\bY$ and $\bw \in \S_\bW$

$\lhd$

An important consequence is that we can drop the conditioning by the independent variable.

**Lemma (Role of Independence):** Let $\bX, \bY, \bW$ be discrete random vectors such that $\bX \indep \bY | \bW$. For all $\bx \in \S_\bX$, $\by \in \S_\bY$ and $\bw \in \S_\bW$,

*Proof:* In a previous exercise, we showed that $A \indep B | C$ implies $\P[A | B\cap C] = \P[A | C]$. That implies the claim. $\square$

Here are a couple exercises to get used to the definition.

*Exercise:* Let $\bX, \bY, \bZ, \bW$ be discrete random vectors. Show that $\bX \indep (\bY, \bZ) | \bW$ implies that $\bX \indep \bY|\bW$ and $\bX \indep \bZ|\bW$. $\lhd$

*Exercise:* Let $\bX, \bY, \bZ$ be discrete random vectors. Suppose that $\bX \indep \bY | \bZ$ and $\bX \indep \bZ$. Show that $\bX \indep (\bY, \bZ)$. $\lhd$

The case of three random variables exemplifies key probabilistic relationships. By the product rule, we can write

$$ \P[X=x, Y=y, Z=z] = \P[X=x] \,\P[Y=y|X=x] \,\P[Z=z | X=x, Y=y]. $$This is conveniently represented through a digraph where the vertices are the variables. Recall that an arrow $(i,j)$, from $i$ to $j$, indicates that $i$ is a parent of $j$ and that $j$ is a child of $i$. Let $\pa(i)$ be the set of parents of $i$. The digraph $G = (V, E)$ below encodes the following sampling scheme, referred as ancestral sampling:

First we pick $X$ according to its marginal $\P[X=x]$. Note that $X$ has no parent in $G$.

Second we pick $Y$ according to the CPD $\P[Y=y|X=x]$. Note that $X$ is the only parent of $Y$.

Finally we pick $Z$ according to the CPD $\P[Z=z|X=x, Y=y]$. Note that the parents of $Z$ are $X$ and $Y$.

Note that the graph above is acyclic, that is, it has no directed cycle. In particular, the ordering of variables $X, Y, Z$ is a topological ordering, that is, an ordering of the vertices where all edges $(i,j)$ are such that $i$ comes before $j$.

The same joint distribution can be represented by a different digraph if the product rule is used in a different order. For instance,

$$ \P[X=x, Y=y, Z=z] = \P[Z=z] \,\P[Y=y|Z=z] \,\P[X=x | Z=z, Y=y] $$is represented by the following digraph.

**The fork:** Removing edges in the first graph above encodes conditional independence relations. For instance, removing the edge from $Y$ to $Z$ gives the following graph, known as a fork. We denote this configuration as $Y \leftarrow X \rightarrow Z$.

The joint distribution simplifies as follows:

$$ \P[X=x, Y=y, Z=z] = \P[X=x] \,\P[Y=y|X=x] \,\P[Z=z | X=x]. $$So, in this case, what has changed is that the CPD of $Z$ does not depend on the value of $Y$. That is, $Z \indep Y|X$. Indeed, we can check that claim directly

$$ \begin{align*} \P[Y= y, Z=z|X=x] &= \frac{\P[X=x, Y= y, Z=z]}{\P[X=x]}\\ &= \frac{\P[X=x] \,\P[Y=y|X=x] \,\P[Z=z | X=x]}{\P[X=x]}\\ &= \P[Y=y|X=x] \,\P[Z=z | X=x] \end{align*} $$as claimed.

**The chain:** Removing the edge from $X$ to $Z$ gives the following graph, known as a chain (or pipe). We denote this configuration as $X \rightarrow Y \rightarrow Z$.

The joint distribution simplifies as follows:

$$ \P[X=x, Y=y, Z=z] = \P[X=x] \,\P[Y=y|X=x] \,\P[Z=z | Y=y]. $$In this case, what has changed is that the CPD of $Z$ does not depend on the value of $X$. Compare that to the fork. The corresponding conditional independence relation is $Z \indep X|Y$. Indeed, we can check that claim directly

$$ \begin{align*} \P[X= x, Z=z|Y=y] &= \frac{\P[X=x, Y= y, Z=z]}{\P[Y=y]}\\ &= \frac{\P[X=x] \,\P[Y=y|X=x] \,\P[Z=z | Y=y]}{\P[Y=y]} \end{align*} $$Now we have to use Bayes' rule to get

$$ \begin{align*} &= \frac{\P[X=x] \,\P[Y=y|X=x] \,\P[Z=z | Y=y]}{\P[Y=y]}\\ &= \frac{\P[Y=y|X=x]\,\P[X=x]}{\P[Y=y]} \P[Z=z | Y=y]\\ &= \P[X=x|Y=y] \,\P[Z=z | Y=y] \end{align*} $$as claimed.

**The collider:** Removing the edge from $X$ to $Y$ gives the following graph, known as a collider. We denote this configuration as $X \rightarrow Z \leftarrow Y$.

The joint distribution simplifies as follows:

$$ \P[X=x, Y=y, Z=z] = \P[X=x] \,\P[Y=y] \,\P[Z=z | X=x, Y=y]. $$In this case, what has changed is that the CPD of $Y$ does not depend on the value of $X$. Compare that to the fork and the chain. This time we have $X \indep Y$. Indeed, we can check that claim directly

$$ \begin{align*} \P[X= x, Y=y] &= \sum_{z \in \S_z} \P[X=x, Y=y, Z=z]\\ &= \sum_{z \in \S_z} \P[X=x] \,\P[Y=y] \,\P[Z=z | X=x, Y=y]\\ &= \P[X=x] \,\P[Y=y] \end{align*} $$as claimed.

Perhaps counter-intuitively, conditioning on $Z$ makes $X$ and $Y$ dependent in general. This is known as explaining away or Berkson's Paradox. Quoting my colleague Jordan Ellenberg:

Suppose Alex will only date a man if his niceness plus his handsomeness exceeds some threshold. Then nicer men do not have to be as handsome to qualify for Alex's dating pool. So, among the men that Alex dates, Alex may observe that the nicer ones are less handsome on average (and vice versa), even if these traits are uncorrelated in the general population. Note that this does not mean that men in the dating pool compare unfavorably with men in the population. On the contrary, Alex's selection criterion means that Alex has high standards. The average nice man that Alex dates is actually more handsome than the average man in the population (since even among nice men, the ugliest portion of the population is skipped). Berkson's negative correlation is an effect that arises within the dating pool: the rude men that Alex dates must have been even more handsome to qualify.

We end this subsection with two key examples.

**Example (Naive Bayes):** This model is useful for document classification. We assume that a document has a single topic $C$ from a list $\mathcal{C} = \{c_1, \ldots, c_K\}$ with probability distribution $\pi_k = \P[C = c_k]$. There is a vocabulary of size $M$ and we record the presence or absence of a word $m$ in the document with a Bernoulli variable $X_m$, where $p_{km} = \P[X_m = 1|C = c_k]$. We denote by $\bX = (X_1, \ldots, X_M)$ the corresponding vector.

The conditional independence assumption comes next: we assume that, given a topic $C$, the word occurrences are independent. That is,

$$ \begin{align*} \P[\bX = \bx|C=c_k] &= \prod_{m=1}^M \P[X_m = x_m|C = c_k]\\ &= \prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}. \end{align*} $$Finally, the joint distribution is

$$ \begin{align*} \P[C = c_k, \bX = \bx] &= \P[\bX = \bx|C=c_k] \,\P[C=c_k]\\ &= \pi_k \prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}. \end{align*} $$$\lhd$

**Example (Markov Chain):** Markov chains are widely used to model sequential data. We restrict ourselves here to discrete-time, time-homogeneous, finite-space Markov chains. But the conditional independence is at the heart of Markov chains more generally. Let $\bX_0, \bX_1, \ldots, \bX_T$ be a sequence of random vectors taking values in a common space $\S$, where the index represents time. In words the basic assumption is the following: we assume that the past and future are conditionally independent given the present. Formally, for all $1 \leq t \leq T-1$,

The implication of these assumptions is that the joint distribution (also known referred to as the distribution of a sample path) is

$$ \begin{align*} &\P[\bX_0 = \bx_0, \bX_1 = \bx_1, \ldots, \bX_T =\bx_T]\\ &\quad = \P[\bX_0 = \bx_0]\prod_{t=1}^T \P[\bX_t =\bx_t\,|\,\bX_{t-1} = \bx_{t-1}, \ldots, \bX_0 = \bx_0]\\ &\quad = \P[\bX_0 = \bx_0] \prod_{t=1}^T \P[\bX_t =\bx_t\,|\,\bX_{t-1} = \bx_{t-1}] \end{align*} $$where we used that, by a previous exercise, $(\bX_0,\ldots,\bX_{t-2}) \indep (\bX_{t},\ldots,\bX_T) \,|\, \bX_{t-1}$ implies $(\bX_0,\ldots,\bX_{t-2}) \indep \bX_{t} \,|\, \bX_{t-1}$. That in turn implies the above simplfication by the Role of Independence.

Now we assume that the common state space $\S$ is finite, say $\{1,\ldots,m\}$ for concreteness. We also assume the process is time-homogeneous, which means that

$$ \P[\bX_t =\bx\,|\,\bX_{t-1} = \bx'] = \P[\bX_1 =\bx\,|\,\bX_{0} = \bx'] =: p_{\bx',\bx}, \quad \forall t=1,\ldots,T. $$We can think of the transition probabilities $P = (p_{\bx',\bx})_{\bx,\bx' \in \S}$ as an $m \times m$ nonnegative matrix. In fact, $P$ is a stochastic matrix, that is, its rows necessarily sum to one. Indeed,

$$ \sum_{\bx \in \S} p_{\bx',\bx} = \sum_{\bx \in \S} \P[\bX_1 =\bx\,|\,\bX_{0} = \bx'] = \P[\bX_1 \in \S \,|\,\bX_{0} = \bx'] = 1 $$by the properties of the conditional probability. We also let $\mu_{\bx} = \P[X_0 = \bx]$ and we think of $\bmu = (\mu_{\bx})_{\bx \in \S}$ as a row vector.

With this notation, the joint distribution becomes simply

$$ \P[\bX_0 = \bx_0, \bX_1 = \bx_1, \ldots, \bX_T =\bx_T] = \bmu_{\bx_0} \prod_{t=1}^T p_{\bx_{t-1},\bx_t}. $$This formula has a remarkable consequence. The marginal distribution of $\bx_s$ is a matrix power of $P$. Indeed, for any $1 \leq s \leq T$,

$$ \begin{align*} \P[\bX_s = \bx_s] &= \sum_{\bx_0, \ldots, \bx_{s-1} \in \S} \sum_{\bx_{s+1}, \ldots, \bx_{T} \in \S} \bmu_{\bx_0} \prod_{t=1}^T p_{\bx_{t-1},\bx_t}\\ &= \sum_{\bx_0, \ldots, \bx_{s-1} \in \S} \bmu_{\bx_0} \prod_{t=1}^{s} p_{\bx_{t-1},\bx_t} \sum_{\bx_{s+1}, \ldots, \bx_{T} \in \S} \prod_{t=s+1}^T p_{\bx_{t-1},\bx_t}. \end{align*} $$The most interior sum simplifies to $$ \begin{align*} \sum_{\bx_{s+1}, \ldots, \bx_{T} \in \S} \prod_{t=s+1}^T p_{\bx_{t-1},\bx_t} &= \sum_{\bx_{s+1}, \ldots, \bx_{T} \in \S} \prod_{t=s+1}^{T-1} p_{\bx_{t-1},\bx_t} \sum_{\bx_{T} \in \S} p_{\bx_{T-1},\bx_T}\\ & = \sum_{\bx_{s+1}, \ldots, \bx_{T-1} \in \S} \prod_{t=s+1}^{T-1} p_{\bx_{t-1},\bx_t}\\ & = \cdots\\ & = 1 \end{align*} $$

where we repeatedly used the fact that the rows of $P$ sum to $1$.

The remaining sum can be simplified in a similar fashion

$$ \begin{align*} &\sum_{\bx_0, \ldots, \bx_{s-1} \in \S} \bmu_{\bx_0} \prod_{t=1}^{s} p_{\bx_{t-1},\bx_t}\\ & \quad = \sum_{\bx_0 \in \S} \bmu_{\bx_0} \sum_{\bx_{1} \in \S} p_{\bx_{0},\bx_{1}} \cdots \sum_{\bx_{s-2} \in \S} p_{\bx_{s-3},\bx_{s-2}} \sum_{\bx_{s-1} \in \S} p_{\bx_{s-2},\bx_{s-1}} \,p_{\bx_{s-1},\bx_s}\\ & \quad = \sum_{\bx_0 \in \S} \bmu_{\bx_0} \sum_{\bx_{1} \in \S} p_{\bx_{0},\bx_{1}} \cdots \sum_{\bx_{s-2} \in \S} p_{\bx_{s-3},\bx_{s-2}} \, \left(P^2\right)_{\bx_{s-2}, \bx_s} \\ & \quad = \cdots \\ & \quad = \left(\bmu P^s\right)_{\bx_s}. \end{align*} $$Thus we have established that, for any $1 \leq s \leq T$,

$$ \P[\bX_s = \bx_s] = \left(\bmu P^s\right)_{\bx_s}. $$This type of recursive simplification will play a major role in the next section. $\lhd$

The time homogeneity in the model above is an example of parameter sharing, a standard way to reduce the complexity of a model.