*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* Nov 12, 2020

*Copyright:* © 2020 Sebastien Roch

Today's paper shows that it is possible to implement John Von Neumann's claim: "With 4 parameters I can ﬁt an elephant, and with 5 I can make him wiggle his trunk"

— Fermat's Library (@fermatslibrary) February 20, 2018

Paper here: https://t.co/SvVrLuRFNy pic.twitter.com/VG37439vE7

In this optional notebook, we illustrate the use of automatic differentiation on multiclass classification with convolutional neural networks. We will not expand on the concepts required here. Review [Wri, Section 2.11] first, and then see the following references for background:

*Convolutional neural networks:*See [Bis, Sections 5.1-2, 5.3.1-2, 5.5.6-7] and this module from Stanford's CS231n.*Flux.jl:*See the documentation for the Flux.jl package.

We have already used automatic differentiation and Flux.jl in previous notebooks. We introduce more advanced features here.

We will use the MNIST dataset. Quoting Wikipedia:

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset.

Here is a sample of the images:

(Source)

We first load the data and convert it to an appropriate matrix representation. The data can be accessed with `Flux.Data.MNIST`

.

In [1]:

```
#Julia version: 1.5.1
ENV["JULIA_CUDA_SILENT"] = true # silences warning about GPUs
using CSV, DataFrames, GLM, Statistics, Images, QuartzImageIO
using Flux, Flux.Data.MNIST, Flux.Data.FashionMNIST
using Flux: mse, train!, Data.DataLoader, throttle
using Flux: onehot, onehotbatch, onecold, crossentropy
using IterTools: ncycle
```

In [2]:

```
imgs = MNIST.images()
labels = MNIST.labels()
length(imgs)
```

Out[2]:

For example, the first image and its label are:

In [3]:

```
imgs[1]
```

Out[3]:

In [4]:

```
labels[1]
```

Out[4]:

We first transform the images into vectors using `reshape`

.

In [5]:

```
?reshape
```

Out[5]:

In [6]:

```
reshape(Float32.(imgs[1]),:)
```

Out[6]:

Using a list comprehension and `reduce`

with `hcat`

, we do this for every image in the dataset.

In [7]:

```
Xtrain = reduce(hcat, [reshape(Float32.(imgs[i]),:) for i = 1:length(imgs)]);
```

We can get back the original images by using `reshape`

again.

In [8]:

```
Gray.(reshape(Xtrain[:,1],(28,28)))
```

Out[8]:

We also convert the labels into vectors. We use one-hot encoding, that is, we convert the label `0`

to the standard basis $\mathbf{e}_1 \in \mathbb{R}^{10}$, the label `1`

to $\mathbf{e}_2 \in \mathbb{R}^{10}$, and so on. The functions `onehot`

and `onehotbatch`

perform this transformation, while `onecold`

undoes it.

In [9]:

```
?onehotbatch
```

Out[9]:

In [10]:

```
?onecold
```

Out[10]:

For example, on the first label we get:

In [11]:

```
onehot(labels[1], 0:9)
```

Out[11]:

In [12]:

```
onecold(ans, 0:9)
```

Out[12]:

We do this for all labels simultaneously.

In [13]:

```
ytrain = onehotbatch(labels, 0:9);
```

We will also use a test dataset provided in MNIST to assess the accuracy of our classifiers. We perform the same transformation.

In [14]:

```
test_imgs = MNIST.images(:test)
test_labels = MNIST.labels(:test)
length(test_labels)
```

Out[14]:

In [15]:

```
Xtest = reduce(hcat,
[reshape(Float32.(test_imgs[i]),:) for i = 1:length(test_imgs)])
ytest = onehotbatch(test_labels, 0:9);
```

Finally, we consider a class of neural networks tailored for image processing, convolutional neural networks (CNN). From Wikipedia:

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.

More background can be found in this excellent module from Stanford's CS231n.

We will use CNNs on the MNIST dataset. What follows is based on Flux's model zoo. Our CNN will be a composition of convolutional layers and pooling layers.

In [16]:

```
?Conv
```

Out[16]:

In [17]:

```
?MaxPool
```

Out[17]:

In [18]:

```
m = Chain(
# First convolution, operating upon a 28x28 image
Conv((3, 3), 1=>16, pad=(1,1), relu),
MaxPool((2,2)),
# Second convolution, operating upon a 14x14 image
Conv((3, 3), 16=>32, pad=(1,1), relu),
MaxPool((2,2)),
# Third convolution, operating upon a 7x7 image
Conv((3, 3), 32=>32, pad=(1,1), relu),
MaxPool((2,2)),
# Reshape 3d tensor into a 2d one, at this point it should be (3, 3, 32, N)
# which is where we get the 288 in the `Dense` layer below:
x -> reshape(x, :, size(x, 4)),
Dense(288, 10),
# Finally, softmax to get nice probabilities
softmax,
);
```

One complication is that the convolutional layers take as input a tensor, that is, a multidimensional array. So the first step is to convert the images in the dataset into $4d$-arrays in WHCN order (width, height, #channels, batch size). Here the number of of channels is $1$ for grayscale and the batch size is $1$ for a single image. We will use `DataLoader`

as before to create larger mini-batches.

We use `reshape`

to make a $4d$-array.

In [19]:

```
reshape(Float32.(imgs[1]), 28, 28, 1, 1)
```

Out[19]:

Then applying our model outputs a probability distribution over $10$ labels as before.

In [20]:

```
m(ans)
```

Out[20]:

We concatenate the images into a large $4d$ tensor where the last dimension is for the samples. Here we cannot use `hcat`

, as we are concatenating tensors rather than vectors. Instead we pre-allocate the tensor and then assign the images as we scan the last dimension.

In [21]:

```
train_tensor_imgs = zeros(Float32, 28, 28, 1, length(labels))
for i in 1:length(labels)
train_tensor_imgs[:, :, :, i] = reshape(Float32.(imgs[i]), 28, 28, 1, 1)
end
train_onehot_labels = ytrain;
```

For example, the first image is encoded as:

In [22]:

```
train_tensor_imgs[:,:,:,1:1]
```

Out[22]:

We do the same transformation on the test dataset.

In [23]:

```
test_tensor_imgs = zeros(Float32, 28, 28, 1, length(test_labels))
for i in 1:length(test_labels)
test_tensor_imgs[:, :, :, i] = reshape(Float32.(test_imgs[i]), 28, 28, 1, 1)
end
test_onehot_labels = ytest;
```

We now use `DataLoader`

to create mini-batches and set the parameters for the optimizer.

In [24]:

```
loader = DataLoader(train_tensor_imgs, train_onehot_labels;
batchsize=128, shuffle=true);
```

In [25]:

```
accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9))
loss(x, y) = crossentropy(m(x), y)
ps = params(m)
opt = ADAM()
evalcb = () -> @show(accuracy(test_tensor_imgs, test_onehot_labels));
```

As before the initial accuracy of the network, with random weights, is close to $10\%$.

In [26]:

```
accuracy(test_tensor_imgs,test_onehot_labels)
```

Out[26]:

We train for $10$ epochs. On my computer it takes about 10 minutes.

In [27]:

```
@time train!(loss, ps, ncycle(loader, 1), opt, cb = throttle(evalcb, 60))
```

In [28]:

```
@time train!(loss, ps, ncycle(loader, 9), opt, cb = throttle(evalcb, 60))
```

The final accuracy is significantly higher:

In [29]:

```
@time accuracy(test_tensor_imgs, test_onehot_labels)
```

Out[29]:

In [30]:

```
new_tensor_img = reshape(Float32.(test_imgs[1]), 28, 28, 1, 1)
onecold(m(new_tensor_img), 0:9)
```

Out[30]:

In [31]:

```
onecold(ytest[:,1], 0:9)
```

Out[31]:

Finally, we test CNNs on the Fashion MNIST dataset. Quoting Kaggle:

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. [...] Each training and test example is assigned to one of the following labels: 0. T-shirt/top, 1. Trouser, 2. Pullover, 3. Dress, 4. Coat, 5. Sandal, 6. Shirt, 7. Sneaker, 8. Bag, 9. Ankle boot.

Here are sample images:

(Source)

The data can be accessed from `Flux.Data.FashionMNIST`

. As we did before, we load the training and test data, convert it to tensor form and create mini-batches. We also use the same CNN network structure and run ADAM on it.

In [32]:

```
imgs = FashionMNIST.images()
labels = FashionMNIST.labels()
length(imgs)
```

Out[32]:

In [33]:

```
train_tensor_imgs = zeros(Float32, 28, 28, 1, length(labels))
for i in 1:length(labels)
train_tensor_imgs[:, :, :, i] = reshape(Float32.(imgs[i]), 28, 28, 1, 1)
end
train_onehot_labels = onehotbatch(labels, 0:9);
```

In [34]:

```
test_imgs = FashionMNIST.images(:test)
test_labels = FashionMNIST.labels(:test)
length(test_imgs)
```

Out[34]:

In [35]:

```
test_tensor_imgs = zeros(Float32, 28, 28, 1, length(test_labels))
for i in 1:length(test_labels)
test_tensor_imgs[:, :, :, i] = reshape(Float32.(test_imgs[i]), 28, 28, 1, 1)
end
test_onehot_labels = onehotbatch(test_labels, 0:9);
```

In [36]:

```
m = Chain(
# First convolution, operating upon a 28x28 image
Conv((3, 3), 1=>16, pad=(1,1), relu),
MaxPool((2,2)),
# Second convolution, operating upon a 14x14 image
Conv((3, 3), 16=>32, pad=(1,1), relu),
MaxPool((2,2)),
# Third convolution, operating upon a 7x7 image
Conv((3, 3), 32=>32, pad=(1,1), relu),
MaxPool((2,2)),
# Reshape 3d tensor into a 2d one, at this point it should be (3, 3, 32, N)
# which is where we get the 288 in the `Dense` layer below:
x -> reshape(x, :, size(x, 4)),
Dense(288, 10),
# Finally, softmax to get nice probabilities
softmax,
);
```

In [37]:

```
accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9))
loss(x, y) = crossentropy(m(x), y)
ps = params(m)
opt = ADAM()
evalcb = () -> @show(accuracy(test_tensor_imgs, test_onehot_labels));
```

In [38]:

```
accuracy(test_tensor_imgs,test_onehot_labels)
```

Out[38]:

In [39]:

```
loader = DataLoader(train_tensor_imgs, train_onehot_labels;
batchsize=128, shuffle=true);
```

In [40]:

```
@time train!(loss, ps, ncycle(loader, 1), opt, cb = throttle(evalcb, 60))
```

In [41]:

```
@time train!(loss, ps, ncycle(loader, 9), opt, cb = throttle(evalcb, 60))
```

The final accuracy is:

In [42]:

```
@time accuracy(test_tensor_imgs, test_onehot_labels)
```

Out[42]: