TOPIC 3

Optimality, convexity, and gradient descent

6 Automatic differentiation: examples


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Nov 12, 2020
Copyright: © 2020 Sebastien Roch


6.1 Example 1: linear regression

Here $y$ is a real-valued outcome variable. We revisit the case of linear regression where the loss function is

$$ \ell_y(z_2) = (z_2 - y)^2 $$

and the regression function has a single layer (that is, $L=1$) with

$$ h_{\mathbf{x}}(\mathbf{w}) = g_1(\mathbf{w}_1;\mathbf{z}_1) = \sum_{j=1}^d w_{1,j} z_{1,j} + w_{1,d+1} $$

where $\mathbf{w}_1 = \mathbf{w} \in \mathbb{R}^{d+1}$ are the parameters and $\mathbf{z}_1 = \mathbf{x} \in \mathbb{R}^d$ is the input. Hence,

$$ f_{\mathbf{x},y}(\mathbf{w}) = \ell_y(g_1(\mathbf{w}_1;\mathbf{x})) = \left(\sum_{j=1}^d w_{1,j} x_{j} + w_{1,d+1} - y\right)^2. $$

It will convenient to introduce the following notation. For a vector $\mathbf{z} \in \mathbb{R}^d$, the vector $\mathbf{z}^{;1} \in \mathbb{R}^{d+1}$ concatenates a $1$ at the end of $\mathbf{z}$, that is, $\mathbf{z}^{;1} = (\mathbf{z};1)$. For a vector $\mathbf{w} \in \mathbb{R}^{d+1}$, the vector $\mathbf{w}^{:d} \in \mathbb{R}^{d}$ drops the last entry of $\mathbf{w}$, that is, $\mathbf{w}^{:d} = (w_1,\ldots,w_d)$.

The forward pass in this case is:

Initialization:

$$\mathbf{z}_1 := \mathbf{x}$$

Forward layer loop:

$$ \begin{align*} z_2 &:= g_1(\mathbf{w}_1;\mathbf{z}_1) = \mathbf{w}_1^T \mathbf{z}_1^{;1}\\ \begin{pmatrix} A_1 & B_1 \end{pmatrix} &:= J_{g_1}(\mathbf{w}_1;\mathbf{z}_1) = (\mathbf{z}_1^{;1};\mathbf{w}_1^{:d})^T \end{align*} $$

Loss:

$$ \begin{align*} z_3 &:= \ell_{y}(z_2) = (z_2 - y)^2 = (\mathbf{w}^T \mathbf{x}^{;1} - y)^2\\ q_2 &:= \frac{\mathrm{d}}{\mathrm{d} z_2} {\ell_{y}}(z_2) = 2 (z_2 - y). \end{align*} $$

The backward pass is:

Backward layer loop:

$$ \begin{align*} \mathbf{p}_1 &:= A_1^T q_2 = 2 (z_2 - y) \,\mathbf{z}_1^{;1} \end{align*} $$

Output:

$$ \nabla f_{\mathbf{x}, y}(\mathbf{w}) = \mathbf{p}_1 = 2 \left(\mathbf{w}^T \mathbf{x}^{;1} - y\right) \,\mathbf{x}^{;1}. $$
In [7]:
df = DataFrame(CSV.File("advertising.csv"))
first(df,5)
Out[7]:

5 rows × 5 columns

Column1TVradionewspapersales
Int64Float64Float64Float64Float64
11230.137.869.222.1
2244.539.345.110.4
3317.245.969.39.3
44151.541.358.518.5
55180.810.858.412.9

We first compute the solution using the least-squares approach we detailed previously.

In [9]:
X = reduce(hcat, [df[:,:TV], df[:,:radio], df[:,:newspaper]])
Xaug = hcat(ones(n), X)
y = df[:,:sales];
In [10]:
@time q = Xaug\y
  0.873739 seconds (2.89 M allocations: 141.946 MiB, 2.85% gc time)
Out[10]:
4-element Array{Float64,1}:
  2.938889369459415
  0.04576464545539759
  0.18853001691820445
 -0.0010374930424763011
In [11]:
mean((Xaug*q .- y).^2)
Out[11]:
2.7841263145109356
In [12]:
Xtrain = X'
ytrain = reshape(y, (1,length(y)))
loader = DataLoader(Xtrain, ytrain; batchsize=64, shuffle=true);
In [13]:
first(loader)[1]
Out[13]:
3×64 Array{Float64,2}:
  7.8  238.2   4.1  31.5  225.8  273.7  …  232.1  206.9  265.2  193.7  74.7
 38.9   34.3  11.6  24.6    8.2   28.9       8.6    8.4    2.9   35.4  49.4
 50.6    5.3   5.7   2.2   56.5   59.7       8.7   26.4   43.0   75.6  45.7
In [14]:
m = Dense(3, 1)
Out[14]:
Dense(3, 1)
In [15]:
loss(x,y) = mse(m(x),y)
Out[15]:
loss (generic function with 1 method)
In [16]:
ps = params(m)
opt = Descent(1e-5)
evalcb =() -> @show(loss(Xtrain,ytrain));
In [17]:
@time train!(loss, ps, ncycle(loader, Int(1e4)), opt, cb = throttle(evalcb, 2))
loss(Xtrain, ytrain) = 2160.6028562802303
loss(Xtrain, ytrain) = 3.8571577593201845
 18.668849 seconds (75.40 M allocations: 3.075 GiB, 4.78% gc time)
In [18]:
m.b
Out[18]:
1-element Array{Float32,1}:
 0.3245267
In [19]:
m.W
Out[19]:
1×3 Array{Float32,2}:
 0.0509482  0.218272  0.0144676
In [20]:
loss(Xtrain,ytrain)
Out[20]:
3.9033906662010316

6.2 Example 2: multinomial logistic regression

This time, we have $K$ functions that output a score associated to each label. We then transform these scores into a probability distribution over the $K$ labels. There are many ways of doing this. A standard approach is the softmax function: for $\mathbf{z} \in \mathbb{R}^K$

$$ \gamma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}, \quad i=1,\ldots,K. $$

To explain the name, observe that the larger inputs are mapped to larger probabilities.

In fact, since a probability distribution must to $1$, it is determined by the probabilities assigned to the first $K-1$ labels. In other words, we can drop the score associated to the last label. Formally, we use the modifed softmax function

$$ \tilde\gamma(\mathbf{z})_i = \begin{cases} \frac{e^{z_i}}{1 + \sum_{j=1}^{K-1} e^{z_j}} & i=1,\ldots,K-1\\ \frac{1}{1 + \sum_{j=1}^{K-1} e^{z_j}} & i=K \end{cases} $$

where this time $\mathbf{z} \in \mathbb{R}^{K-1}$.

Hence we now have two layers, that is, $L = 2$. The layers are defined by the functions

$$ z_{2,k} = g_1(\mathbf{w}_1;\mathbf{z}_1)_k = \sum_{j=1}^d w^{(k)}_{1,j} z_{1,j} + w^{(k)}_{1,d+1}, \quad k=1,\ldots,K-1 $$

where $\mathbf{w}_1 = (\mathbf{w}^{(1)}_1;\ldots;\mathbf{w}^{(K-1)}_1)$ are the parameters with $\mathbf{w}^{(k)}_1 \in \mathbb{R}^{d+1}$ and $\mathbf{z}_1 = \mathbf{x} \in \mathbb{R}^d$ is the input, and

$$ \mathbf{z}_3 = g_2(\mathbf{z}_2) = \tilde\gamma(\mathbf{z}_2), $$

where $\tilde\gamma$ is the modified softmax function. Note that the latter has no associated parameter.

So the output of the classifier with parameters $\mathbf{w} = \mathbf{w}_1 =(\mathbf{w}^{(1)}_1;\ldots;\mathbf{w}^{(K-1)}_1)$ on input $\mathbf{x}$ is

$$ \begin{align*} h_{\mathbf{x}}(\mathbf{w})_i &= g_2(g_1(\mathbf{w}_1;\mathbf{z}_1))_i\\ &= \tilde\gamma\left(\left[\sum_{j=1}^d w^{(k)}_{1,j} x_{j} + w^{(k)}_{1,d+1}\right]_{k=1}^{K-1}\right)_i, \end{align*} $$

for $i=1,\ldots,K$.

It remains to define a loss function. To quantify the fit, it is natural to use a notion of distance between probability measures, here between the output $h_{\mathbf{x}}(\mathbf{w}) \in \Delta_K$ and the correct label $\mathbf{y} \in \{\mathbf{e}_1,\ldots,\mathbf{e}_{K}\} \subseteq \Delta_K$. There are many such measures. In multinomial logistic regression, we use the Kullback-Leibler divergence. For two probability distributions $\mathbf{p}, \mathbf{q} \in \Delta_K$, it is defined as

$$ \mathrm{KL}(\mathbf{p} \| \mathbf{q}) = \sum_{i=1}^K p_i \log \frac{p_i}{q_i} $$

where it will suffice to restrict ourselves to the case $\mathbf{q} > \mathbf{0}$ and where we use the convention $0 \log 0 = 0$ (so that terms with $p_i = 0$ contribute $0$ to the sum).

Going back to the loss function, we use the identity $\log\frac{\alpha}{\beta} = \log \alpha - \log \beta$ to re-write

$$ \begin{align*} \mathrm{KL}(\mathbf{y} \| h_{\mathbf{x}}(\mathbf{w})) &= \sum_{i=1}^K y_i \log \frac{y_i}{h_{\mathbf{x}}(\mathbf{w})_i}\\ &= \sum_{i=1}^K y_i \log y_i - \sum_{i=1}^K y_i \log h_{\mathbf{x}}(\mathbf{w})_i. \end{align*} $$

Notice that the first term on right-hand side does not depend on $\mathbf{w}$. Hence we can ignore when optimizing $\mathrm{KL}(\mathbf{y} \| h_{\mathbf{x}}(\mathbf{w}))$. The remaining term

$$ H(\mathbf{y}, h_{\mathbf{x}}(\mathbf{w})) = - \sum_{i=1}^K y_i \log h_{\mathbf{x}}(\mathbf{w})_i $$

is known as the cross-entropy. We use it to define our loss function. That is, we set

$$ \ell_{\mathbf{y}}(\mathbf{z}_3) = H(\mathbf{y}, \mathbf{z}_3) = - \sum_{i=1}^K y_i \log z_{3,i}. $$

Finally,

$$ f_{\mathbf{x},\mathbf{y}}(\mathbf{w}) = \ell_{\mathbf{y}}(h_{\mathbf{x}}(\mathbf{w})) = - \sum_{i=1}^K y_i \log\tilde\gamma\left(\left[\sum_{j=1}^d w^{(k)}_{1,j} x_{j} + w^{(k)}_{1,d+1}\right]_{k=1}^{K-1}\right)_i. $$

The forward layer loop has two steps. First we compute

$$ \begin{align*} \mathbf{z}_{2} &:= g_1(\mathbf{w}_1;\mathbf{z}_1) = \left[\sum_{j=1}^d w^{(k)}_{1,j} z_{1,j} + w^{(k)}_{1,d+1}\right]_{k=1}^{K-1} = \mathcal{W}_1 \mathbf{z}_1^{;1}\\ \begin{pmatrix} A_1 & B_1 \end{pmatrix} &:= J_{g_1}(\mathbf{w}_1;\mathbf{z}_1) \end{align*} $$

where we define $\mathcal{W} = \mathcal{W}_1 \in \mathbb{R}^{(K-1)\times (d+1)}$ as the matrix with rows $(\mathbf{w}_1^{(1)})^T,\ldots,(\mathbf{w}_1^{(K-1)})^T$. To compute the Jacobian, let us look at the columns corresponding to the variables in $\mathbf{w}_1^{(k)}$, that is, columns $\alpha_k = (k-1) (d+1) + 1$ to $\beta_k = k(d+1)$. Note that only component $k$ of $g_1$ depends on $\mathbf{w}_1^{(k)}$, so the rows $\neq k$ of $J_{g_1}$ are $0$ for those columns. Row $k$ on the other hand is $(\mathbf{z}_1^{;1})^T$. Hence one way to write the columns $\alpha_k$ to $\beta_k$ of $J_{g_1}$ is $\mathbf{e}_k (\mathbf{z}_1^{;1})^T$, where here $\mathbf{e}_k \in \mathbb{R}^{K-1}$ is the canonical basis of $\mathbb{R}^{K-1}$ (in a slight abuse of notation). So $A_1$ can be written in block form as

$$ A_1 = \begin{pmatrix} \mathbf{e}_1 (\mathbf{z}_1^{;1})^T & \cdots & \mathbf{e}_{K-1} (\mathbf{z}_1^{;1})^T \end{pmatrix} =: \mathbb{A}_{d,K-1}[\mathbf{z}_1], $$

where the last equality is a definition.

As for the columns corresponding to the variables in $\mathbf{z}_1$, that is, columns $(K-1) (d+1) + 1$ to $(K-1) (d+1) + d$, each row takes the same form. Indeed row $k$ is $((\mathbf{w}^{(k)}_1)^{:d})^T$. So $B_1$ can be written as

$$ B_1 = \begin{pmatrix} ((\mathbf{w}^{(1)}_1)^{:d})^T\\ \vdots\\ ((\mathbf{w}^{(K-1)}_1)^{:d})^T \end{pmatrix} =: \mathbb{B}_{d,K-1}[\mathbf{w}_1]. $$

In fact, we will not need $B_1$ here, but we will need it in a later section.

In the second step of the forward layer loop, we compute

$$ \begin{align*} \mathbf{z}_3 &:= g_2(\mathbf{z}_2) = \tilde\gamma(\mathbf{z}_2)\\ B_2 &:= J_{g_2}(\mathbf{z}_2) = J_{\tilde{\gamma}}(\mathbf{z}_2). \end{align*} $$

So we need to compute the Jacobian of $\tilde\gamma$. We divide the computation into several cases.

When $1 \leq i = j \leq K-1$,

$$ \begin{align*} (B_2)_{ii} &= \frac{\partial}{\partial z_{2,i}} \left[ \tilde\gamma(\mathbf{z}_2)_i \right]\\ &= \frac{\partial}{\partial z_{2,i}} \left[ \frac{e^{z_{2,i}}}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}} \right]\\ &= \frac{e^{z_{2,i}}\left(1 + \sum_{k=1}^{K-1} e^{z_{2,k}}\right) - e^{z_{2,i}}\left(e^{z_{2,i}}\right)} {\left(1 + \sum_{k=1}^{K-1} e^{z_{2,k}}\right)^2}\\ &= \frac{e^{z_{2,i}}\left(1 + \sum_{k = 1, k \neq i}^{K-1} e^{z_{2,k}}\right)} {\left(1 + \sum_{k=1}^{K-1} e^{z_{2,k}}\right)^2} =:\mathbb{C}_{K}[\mathbf{z}_2]_{ii}, \end{align*} $$

by the quotient rule.

When $1 \leq i, j \leq K-1$ with $i \neq j$,

$$ \begin{align*} (B_2)_{ij} &= \frac{\partial}{\partial z_{2,j}} \left[ \tilde\gamma(\mathbf{z}_2)_i \right]\\ &= \frac{\partial}{\partial z_{2,j}} \left[ \frac{e^{z_{2,i}}}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}} \right]\\ &= \frac{- e^{z_{2,i}}\left(e^{z_{2,j}}\right)} {\left(1 + \sum_{k=1}^{K-1} e^{z_{2,k}}\right)^2} =:\mathbb{C}_{K}[\mathbf{z}_2]_{ij}. \end{align*} $$

When $i = K$ and $1 \leq j \leq K-1$,

$$ \begin{align*} (B_2)_{ij} &= \frac{\partial}{\partial z_{2,j}} \left[ \tilde\gamma(\mathbf{z}_2)_i \right]\\ &= \frac{\partial}{\partial z_{2,j}} \left[ \frac{1}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}} \right]\\ &= \frac{- \left(e^{z_{2,j}}\right)} {\left(1 + \sum_{k=1}^{K-1} e^{z_{2,k}}\right)^2} =:\mathbb{C}_{K}[\mathbf{z}_2]_{ij}. \end{align*} $$

The gradient of the loss function is

$$ \mathbf{q}_3 = \nabla_{\ell_{\mathbf{y}}}(\mathbf{z}_3) = \left[- \frac{y_i}{z_{3,i}}\right]_{i=1}^K. $$

The backward layer loop also has two steps. Because $g_2$ does not have parameters, we only need to compute $\mathbf{q}_2$ and $\mathbf{p}_1$. We get for $1 \leq j \leq K-1$

$$ \begin{align*} q_{2,j} = [B_2^T \mathbf{q}_{3}]_j &= \sum_{i=1}^K \mathbb{C}_{d,K}[\mathbf{z}_2]_{ij} \left(- \frac{y_i}{\tilde\gamma(\mathbf{z}_2)_i}\right)\\ &= - \frac{y_j \left(1 + \sum_{k = 1, k \neq j}^{K-1} e^{z_{2,k}}\right)} {1 + \sum_{k=1}^{K-1} e^{z_{2,k}}} + \sum_{k=1, k \neq j}^{K} \frac{y_k e^{z_{2,j}}}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}}\\ &= - \frac{y_j \left(1 + \sum_{k = 1, k \neq j}^{K-1} e^{z_{2,k}}\right)} {1 + \sum_{k=1}^{K-1} e^{z_{2,k}}} + \sum_{k=1, k \neq j}^{K} \frac{y_k e^{z_{2,j}}}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}} + \frac{y_j e^{z_{2,j}} - y_j e^{z_{2,j}}}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}}\\ &= - y_j + \sum_{k=1}^{K} \frac{y_k e^{z_{2,j}}}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}}\\ &= - y_j + \frac{e^{z_{2,j}}}{1 + \sum_{k=1}^{K-1} e^{z_{2,k}}}, \end{align*} $$

where we used that $\sum_{k=1}^{K} y_k = 1$. That is, $\mathbf{q}_2 = \tilde\gamma(\mathbf{z}_2) - \mathbf{y}^{:K-1}$.

It remains to compute $\mathbf{p}_1$. We have

$$ \begin{align*} \mathbf{p}_{1} &= A_1^T \mathbf{q}_{2}\\ &= \mathbb{A}_{d,K}[\mathbf{z}_1]^T (\tilde\gamma(\mathbf{z}_2) - \mathbf{y}^{:K-1})\\ &= \begin{pmatrix} \mathbf{e}_1 (\mathbf{z}_1^{;1})^T & \cdots & \mathbf{e}_{K-1} (\mathbf{z}_1^{;1})^T \end{pmatrix}^T (\tilde\gamma(\mathbf{z}_2) - \mathbf{y}^{:K-1})\\ &= \begin{pmatrix} \mathbf{z}_1^{;1} \mathbf{e}_1^T\\ \vdots\\ \mathbf{z}_1^{;1} \mathbf{e}_{K-1}^T \end{pmatrix} (\tilde\gamma(\mathbf{z}_2) - \mathbf{y}^{:K-1})\\ &= \begin{pmatrix} (\tilde\gamma(\mathbf{z}_2)_1 - y_1)\, \mathbf{z}_1^{;1}\\ \vdots\\ (\tilde\gamma(\mathbf{z}_2)_{K-1} - y_{K-1}) \, \mathbf{z}_1^{;1} \end{pmatrix}. \end{align*} $$

Initialization:

$$\mathbf{z}_1 := \mathbf{x}$$

Forward layer loop:

$$ \begin{align*} \mathbf{z}_{2} &:= g_1(\mathbf{w}_1;\mathbf{z}_1) = \mathcal{W}_1 \mathbf{z}_1^{;1}\\ \begin{pmatrix} A_1 & B_1 \end{pmatrix} &:= J_{g_1}(\mathbf{w}_1;\mathbf{z}_1) = \begin{pmatrix} \mathbb{A}_{d,K-1}[\mathbf{z}_1] & \mathbb{B}_{d,K-1}[\mathbf{w}_1] \end{pmatrix}\end{align*} $$$$ \begin{align*} \mathbf{z}_3 &:= g_2(\mathbf{z}_2) = \tilde\gamma(\mathbf{z}_2)\\ B_2 &:= J_{g_2}(\mathbf{z}_2) = \mathbb{C}_{K}[\mathbf{z}_2] \end{align*} $$

Loss:

$$ \begin{align*} \mathbf{z}_4 &:= \ell_{\mathbf{y}}(\mathbf{z}_3) = - \sum_{i=1}^K y_i \log z_{3,i} = - \sum_{i=1}^K y_i \log \tilde\gamma\left(\mathcal{W}_1 \mathbf{x}_1^{;1}\right)_{i}\\ \mathbf{q}_3 &:= \nabla {\ell_{\mathbf{y}}}(\mathbf{z}_3) = \left[- \frac{y_i}{z_{3,i}}\right]_{i=1}^K. \end{align*} $$

Backward layer loop:

$$ \begin{align*} \mathbf{q}_{2} &:= B_2^T \mathbf{q}_{3} = \mathbb{C}_{K}[\mathbf{z}_2]^T \mathbf{q}_{3} \end{align*} $$$$ \begin{align*} \mathbf{p}_{1} &:= A_1^T \mathbf{q}_{2} = \mathbb{A}_{d,K-1}[\mathbf{z}_1]^T \mathbf{q}_{2} \end{align*} $$

Output:

$$ \nabla f_{\mathbf{x},\mathbf{y}}(\mathbf{w}) = \mathbf{p}_1 = \begin{pmatrix} \left(\tilde\gamma\left(\mathcal{W} \,\mathbf{x}^{;1}\right)_1 - y_1\right)\, \mathbf{x}^{;1}\\ \vdots\\ \left(\tilde\gamma\left(\mathcal{W} \,\mathbf{x}^{;1}\right)_{K-1} - y_{K-1}\right) \, \mathbf{x}^{;1} \end{pmatrix}. $$

We will use the MNIST dataset. Quoting Wikipedia:

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset.

Here is a sample of the images:

MNIST sample images

(Source)

In [21]:
imgs = MNIST.images()
labels = MNIST.labels()
length(imgs)
Out[21]:
60000
In [22]:
imgs[1]
Out[22]:
In [23]:
labels[1]
Out[23]:
5
In [25]:
reshape(Float32.(imgs[1]),:)
Out[25]:
784-element Array{Float32,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
In [26]:
Xtrain = reduce(hcat, [reshape(Float32.(imgs[i]),:) for i = 1:length(imgs)]);

We also convert the labels into vectors. We use one-hot encoding, that is, we convert the label 0 to the standard basis $\mathbf{e}_1 \in \mathbb{R}^{10}$, the label 1 to $\mathbf{e}_2 \in \mathbb{R}^{10}$, and so on. The functions onehot and onehotbatch perform this transformation, while onecold undoes it.

In [30]:
onehot(labels[1], 0:9)
Out[30]:
10-element Flux.OneHotVector:
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
In [31]:
onecold(ans, 0:9)
Out[31]:
5
In [32]:
ytrain = onehotbatch(labels, 0:9);
In [33]:
test_imgs = MNIST.images(:test)
test_labels = MNIST.labels(:test)
length(test_labels)
Out[33]:
10000
In [34]:
Xtest = reduce(hcat, 
    [reshape(Float32.(test_imgs[i]),:) for i = 1:length(test_imgs)])
ytest = onehotbatch(test_labels, 0:9);

We appeal to multinomial logistic regression to learn a classifier for the MNIST data. Here the model takes the form of an affine map from $\mathbb{R}^{784}$ to $\mathbb{R}^{10}$ (where $784 = 28^2$ is the size of the images in vector form and $10$ is the dimension of the one-hot encoding of the labels) composed with the softmax function which returns a probability distribution over the $10$ labels. The loss function is the cross-entropy, as we explained above.

In [36]:
m = Chain(
    Dense(28^2, 10), 
    softmax
)
Out[36]:
Chain(Dense(784, 10), softmax)
In [37]:
m(Xtrain[:,1])
Out[37]:
10-element Array{Float32,1}:
 0.11419689
 0.08392158
 0.1407485
 0.032359276
 0.11348817
 0.10049408
 0.1500063
 0.053701017
 0.12868376
 0.08240051
In [38]:
accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9));
In [39]:
accuracy(Xtest, ytest)
Out[39]:
0.052
In [40]:
loader = DataLoader(Xtrain, ytrain; batchsize=128, shuffle=true);
In [41]:
loss(x, y) = crossentropy(m(x), y)
ps = params(m)
opt = ADAM()
evalcb = () -> @show(accuracy(Xtest,ytest));
In [42]:
@time train!(loss, ps, ncycle(loader, Int(1e1)), opt, cb = throttle(evalcb, 2))
accuracy(Xtest, ytest) = 0.0608
accuracy(Xtest, ytest) = 0.9254
  8.873735 seconds (18.12 M allocations: 5.078 GiB, 4.46% gc time)
In [43]:
accuracy(Xtest, ytest)
Out[43]:
0.9257
In [44]:
m(Xtest[:,1])
Out[44]:
10-element Array{Float32,1}:
 1.4131784f-5
 2.004657f-10
 2.7680435f-5
 0.00446238
 8.006697f-7
 2.7219783f-5
 1.4060293f-9
 0.9950832
 2.7037542f-5
 0.00035753092
In [45]:
onecold(ans, 0:9)
Out[45]:
7
In [46]:
onecold(ytest[:,1], 0:9)
Out[46]:
7