# Optimality, convexity, and gradient descent¶

## 7 Neural networks¶

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Nov 12, 2020

Feedforward neural networks generalize the previous progressive examples with the addition of an arbitrary number of layers.

(Source)

### 7.1 Background¶

Each of the main layers has two components, an affine map and a nonlinear activation function. For the latter, we restrict ourselves here to the sigmoid function (there are many other popular choices of activation functions), that is, the function

$$\sigma(t) = \frac{1}{1 + e^{-t}} \qquad t \in \mathbb{R}.$$

Its derivative is easy to compute

$$\sigma'(t) = \frac{e^{-t}}{(1 + e^{-t})^2} = \frac{1}{1 + e^{-t}}\left(1 - \frac{1}{1 + e^{-t}}\right) = \sigma(t) (1 - \sigma(t)).$$

The latter expression is known as the logistic differential equation. It arises in a variety of applications, including the modeling of population dynamics. When applied to a vector component-wise, we denote the sigmoid function by $\sigma_\odot$, similarly, $\sigma'_{\odot}$ for the derivative.

We consider an arbitrary number of layers $L+1$. Layer $i$, $i=1,\ldots,L$, is defined by a continuously differentiable function $g_i$ which takes two vector-valued inputs: a vector of parameters $\mathbf{w}_i \in \mathbb{R}^{r_i}$ specific to the $i$-th layer and a vector of inputs $\mathbf{z}_i \in \mathbb{R}^{n_i}$ which is fed from the $i-1$-th layer

$$g_i : \mathbb{R}^{r_i + n_i} \to \mathbb{R}^{n_{i+1}}.$$

The output of $g_i$ is a vector in $\mathbb{R}^{n_{i+1}}$ which is passed to the $i+1$-th layer as input. Here $r_i = n_{i+1}(n_i + 1)$ and $\mathbf{w}_i = (\mathbf{w}^{(1)}_i;\ldots;\mathbf{w}^{(n+{i+1})}_i)$ are the parameters with $\mathbf{w}^{(k)}_i \in \mathbb{R}^{n_i+1}$. Specifically, $g_i$ is given by

$$g_i(\mathbf{w}_i;\mathbf{z}_i) = \left[\sigma\left(\sum_{j=1}^{n_i} w^{(k)}_{i,j} z_{i,j} + w^{(k)}_{i,n_i+1}\right)\right]_{k=1}^{n_{i+1}} = \sigma_{\odot}\left(\mathcal{W}_i \mathbf{z}_i^{;1}\right)$$

where we define $\mathcal{W}_i \in \mathbb{R}^{n_{i+1} \times (n_i+1)}$ as the matrix with rows $(\mathbf{w}_i^{(1)})^T,\ldots,(\mathbf{w}_i^{(n_{i+1})})^T$.

Each component of $g_i$ is referred to as a neuron.

(Source)

The first layer takes as input $\mathbf{z}_1 = \mathbf{x} \in \mathbb{R}^d$ so that $n_1 = d$. As in logistic regression, layer $L$ has $K-1$ outputs so that $n_{L+1} = K-1$.

Also as in logistic regression, layer $L+1$ is the modified softmax function

$$g_{L+1}(\mathbf{z}_{L+1}) = \tilde\gamma(\mathbf{z}_{L+1}),$$

which has no parameter.

So the output of the classifier with parameters $\mathbf{w} =(\mathbf{w}_1;\ldots;\mathbf{w}_L)$ on input $\mathbf{x}$ is

$$h_{\mathbf{x}}(\mathbf{w}) = g_{L+1}(g_{L}(\mathbf{w}_{L}; \cdots g_2(\mathbf{w}_2;g_1(\mathbf{w}_1;\mathbf{x})) \cdots)).$$

We use the cross-entropy for loss function. That is, we set

$$\ell_{\mathbf{y}}(\mathbf{z}_{L+1}) = H(\mathbf{y}, \mathbf{z}_{L+1}) = - \sum_{i=1}^K y_i \log z_{L+1,i}.$$

Finally,

$$f_{\mathbf{x},\mathbf{y}}(\mathbf{w}) = \ell_{\mathbf{y}}(h_{\mathbf{x}}(\mathbf{w})).$$

### 7.2 Computing the gradient¶

We now detail how to compute the gradient of $f_{\mathbf{x},\mathbf{y}}(\mathbf{w})$. In the forward loop, we first set $\mathbf{z}_1 := \mathbf{x}$ and then we compute for $i =1,\ldots,L$

\begin{align*} \mathbf{z}_{i+1} &:= g_i(\mathbf{w}_i;\mathbf{z}_i) = \left[\sigma\left(\sum_{j=1}^{n_i} w^{(k)}_{i,j} z_{i,j} + w^{(k)}_{i,n_i+1}\right)\right]_{k=1}^{n_{i+1}} = \sigma_{\odot}\left(\mathcal{W}_i \mathbf{z}_i^{;1}\right)\\ \begin{pmatrix} A_i & B_i \end{pmatrix} &:= J_{g_i}(\mathbf{w}_i;\mathbf{z}_i). \end{align*}

To compute the Jacobian of $g_i$, we use the Chain Rule on the composition $g_i(\mathbf{w}_i;\mathbf{z}_i) = \sigma_{\odot}(h_i(\mathbf{w}_i;\mathbf{z}_i))$ where we define $h_i(\mathbf{w}_i;\mathbf{z}_i) = \mathcal{W}_i \mathbf{z}_i^{;1}$. That is,

$$J_{g_i}(\mathbf{w}_i;\mathbf{z}_i) = J_{\sigma_{\odot}}\left(\mathcal{W}_i \mathbf{z}_i^{;1}\right) J_{h_i}(\mathbf{w}_i;\mathbf{z}_i).$$

In our analysis of logistic regression, we computed the Jacobian of $h_i$. We obtained

$$J_{h_i}(\mathbf{w}_i;\mathbf{z}_i) = \begin{pmatrix} \mathbb{A}_{n_i,n_{i+1}}[\mathbf{z}_i] & \mathbb{B}_{n_i,n_{i+1}}[\mathbf{w}_i] \end{pmatrix}.$$

Recall that

$$\mathbb{A}_{n_i,n_{i+1}}[\mathbf{z}_i] = \begin{pmatrix} \mathbf{e}_1 (\mathbf{z}_i^{;1})^T & \cdots & \mathbf{e}_{n_{i+1}} (\mathbf{z}_i^{;1})^T \end{pmatrix}$$

where here $\mathbf{e}_{j}$ is the $j$-th canonical basis vector in $\mathbb{R}^{n_{i+1}}$ and

$$\mathbb{B}_{n_i,n_{i+1}}[\mathbf{w}_i] = \begin{pmatrix} ((\mathbf{w}^{(1)}_i)^{:n_i})^T\\ \vdots\\ ((\mathbf{w}^{(n_{i+1})}_i)^{:n_i})^T \end{pmatrix}.$$

The Jacobian of $\sigma_{\odot}(\mathbf{t})$ can be computed from $\sigma'$. Indeed,

$$\frac{\partial}{\partial t_j} \sigma_{\odot}(\mathbf{t})_j = \frac{\partial}{\partial t_j} \sigma(t_j) = \sigma'(t_j) = \frac{e^{-t_j}}{(1 + e^{-t_j})^2},$$

while for $\ell \neq j$

$$\frac{\partial}{\partial t_j} \sigma_{\odot}(\mathbf{t})_\ell = \frac{\partial}{\partial t_j} \sigma(t_\ell) =0.$$

In other words, the $j$-th column of the Jacobian is $\sigma'(t_j) \,\mathbf{e}_j$ where again $\mathbf{e}_{j}$ is the $j$-th canonical basis vector in $\mathbb{R}^{n_{i+1}}$. So $J_{\sigma_{\odot}}(\mathbf{t})$ is the diagonal matrix with diagonal entries $\sigma'(t_j)$, $j=1, \ldots, n_{i+1}$, which we denote

$$J_{\sigma_{\odot}}(\mathbf{t}) = \mathrm{diag}(\sigma'_{\odot}(\mathbf{t})).$$

Combining the previous formulas, we get

\begin{align*} J_{g_i}(\mathbf{w}_i;\mathbf{z}_i) &= J_{\sigma_{\odot}}\left(\mathcal{W}_i \mathbf{z}_i^{;1}\right) J_{h_i}(\mathbf{w}_i;\mathbf{z}_i)\\ &= \mathrm{diag}\left(\sigma'_{\odot}\left(\mathcal{W}_i \mathbf{z}_i^{;1}\right)\right) \begin{pmatrix} \mathbb{A}_{n_i,n_{i+1}}[\mathbf{z}_i] & \mathbb{B}_{n_i,n_{i+1}}[\mathbf{w}_i] \end{pmatrix}\\ &=\begin{pmatrix} \widetilde{\mathbb{A}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i] & \widetilde{\mathbb{B}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i] \end{pmatrix} \end{align*}

where we define

$$\widetilde{\mathbb{A}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i] = \begin{pmatrix} \sigma'\left((\mathbf{w}_i^{(1)})^T \,\mathbf{z}_i^{;1}\right)\mathbf{e}_1 (\mathbf{z}_i^{;1})^T & \cdots & \sigma'\left((\mathbf{w}_i^{(n_{i+1})})^T \,\mathbf{z}_i^{;1}\right) \mathbf{e}_{n_{i+1}} (\mathbf{z}_i^{;1})^T \end{pmatrix}$$

and

$$\widetilde{\mathbb{B}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i] = \begin{pmatrix} \sigma'\left((\mathbf{w}_i^{(1)})^T \,\mathbf{z}_i^{;1}\right) ((\mathbf{w}^{(1)}_i)^{:n_i})^T\\ \vdots\\ \sigma'\left((\mathbf{w}_i^{(n_{i+1})})^T \,\mathbf{z}_i^{;1}\right)((\mathbf{w}^{(n_{i+1})}_i)^{:n_i})^T \end{pmatrix}.$$

For layer $L+1$, we have previously computed the Jacobian of the modified softmax function. We get

\begin{align*} \mathbf{z}_{L+2} &:= g_{L+1}(\mathbf{z}_{L+1}) = \tilde\gamma(\mathbf{z}_{L+1})\\ B_{L+1} &:= J_{g_{L+1}}(\mathbf{z}_{L+1}) = J_{\tilde{\gamma}}(\mathbf{z}_{L+1}) = \mathbb{C}_{K}[\mathbf{z}_{L+1}]. \end{align*}

Refer to the previous notebook for an explicit formula for $\mathbb{C}_{K}[\mathbf{z}_{L+1}]$.

Also, as in the multinomial logistic regression case, the loss and gradient of the loss are

\begin{align*} \mathbf{z}_{L+3} &:= \ell_{\mathbf{y}}(\mathbf{z}_{L+2}) = - \sum_{i=1}^K y_i \log z_{L+2,i}\\ \mathbf{q}_{L+2} &:= \nabla \ell_{\mathbf{y}}(\mathbf{z}_{L+2}) = \left[- \frac{y_i}{z_{L+2,i}}\right]_{i=1}^K. \end{align*}

Initialization: $\mathbf{z}_1 := \mathbf{x}$

Forward layer loop: For $i = 1,\ldots,L$:

\begin{align*} \mathbf{z}^{i+1} &:= g_i(\mathbf{w}^i;\mathbf{z}^i) = \sigma_{\odot}\left(\mathcal{W}_i \mathbf{z}_i^{;1}\right)\\ \begin{pmatrix} A_i & B_i \end{pmatrix} &:= J_{g_i}(\mathbf{w}^i;\mathbf{z}^i) =\begin{pmatrix} \widetilde{\mathbb{A}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i] & \widetilde{\mathbb{B}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i] \end{pmatrix} \end{align*}

and

\begin{align*} \mathbf{z}_{L+2} &:= g_{L+1}(\mathbf{z}_{L+1}) = \tilde\gamma(\mathbf{z}_{L+1})\\ B_{L+1} &:= J_{g_{L+1}}(\mathbf{z}_{L+1}) = \mathbb{C}_{K}[\mathbf{z}_{L+1}]. \end{align*}

Loss:

\begin{align*} \mathbf{z}_{L+3} &:= \ell_{\mathbf{y}}(\mathbf{z}^{L+2})\\ \mathbf{q}_{L+2} &:= \nabla {\ell_{\mathbf{y}}}(\mathbf{z}^{L+2}) = \left[- \frac{y_i}{z_{L+2,i}}\right]_{i=1}^K. \end{align*}

Backward layer loop:

$$\mathbf{q}_{L+1} := B_{L+1}^T \mathbf{q}_{L+2} = \mathbb{C}_{K}[\mathbf{z}_{L+1}]^T \mathbf{q}_{L+2}$$

and for $i = L,\ldots,1$:

\begin{align*} \mathbf{p}_{i} &:= A_i^T \mathbf{q}_{i+1}\\ &= \widetilde{\mathbb{A}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i]^T \mathbf{q}_{i+1}\\ &= \left(\sigma'\left((\mathbf{w}_i^{(1)})^T \,\mathbf{z}_i^{;1}\right) \mathbf{z}_i^{;1}; \cdots ;\sigma'\left((\mathbf{w}_i^{(n_{i+1})})^T \,\mathbf{z}_i^{;1}\right) \mathbf{z}_i^{;1} \right) \odot \mathbf{q}_{i+1} \\ \mathbf{q}_{i} &:= B_i^T \mathbf{q}_{i+1} = \widetilde{\mathbb{B}}_{n_i,n_{i+1}}[\mathbf{w}_i;\mathbf{z}_i]^T \mathbf{q}_{i+1} \end{align*}

Output:

$$\nabla f_{\mathbf{x},\mathbf{y}}(\mathbf{w}) = (\mathbf{p}_L;\ldots;\mathbf{p}_1).$$

### 7.3 Implementation¶

In [2]:
imgs = MNIST.images()
labels = MNIST.labels()
Xtrain = reduce(hcat, [reshape(Float32.(imgs[i]),:) for i = 1:length(imgs)]);
ytrain = onehotbatch(labels, 0:9);
In [3]:
test_imgs = MNIST.images(:test)
test_labels = MNIST.labels(:test)
Xtest = reduce(hcat,
[reshape(Float32.(test_imgs[i]),:) for i = 1:length(test_imgs)])
ytest = onehotbatch(test_labels, 0:9);
In [4]:
m = Chain(
Dense(28^2, 32, σ),
Dense(32, 10),
softmax
)
Out[4]:
Chain(Dense(784, 32, σ), Dense(32, 10), softmax)
In [5]:
accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9));
In [6]:
accuracy(Xtest, ytest)
Out[6]:
0.1247
In [7]:
accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9))
loss(x, y) = crossentropy(m(x), y)
ps = params(m)
evalcb = () -> @show(accuracy(Xtest,ytest));
In [8]:
@time train!(loss, ps, ncycle(loader, Int(2e1)), opt, cb = throttle(evalcb, 2))
accuracy(Xtest, ytest) = 0.1265
accuracy(Xtest, ytest) = 0.9416
accuracy(Xtest, ytest) = 0.9529
accuracy(Xtest, ytest) = 0.9605
25.973718 seconds (52.72 M allocations: 13.555 GiB, 2.89% gc time)
In [9]:
accuracy(Xtest, ytest)
Out[9]:
0.9613
In [10]:
onecold(m(Xtest[:,1]), 0:9)
Out[10]:
7
In [11]:
onecold(ytest[:,1], 0:9)
Out[11]:
7