TUTORIAL 3b

Convolutional neural networks [optional]


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Nov 12, 2020
Copyright: © 2020 Sebastien Roch


Recap and further background

In this optional notebook, we illustrate the use of automatic differentiation on multiclass classification with convolutional neural networks. We will not expand on the concepts required here. Review [Wri, Section 2.11] first, and then see the following references for background:

  1. Convolutional neural networks: See [Bis, Sections 5.1-2, 5.3.1-2, 5.5.6-7] and this module from Stanford's CS231n.

  2. Flux.jl: See the documentation for the Flux.jl package.

We have already used automatic differentiation and Flux.jl in previous notebooks. We introduce more advanced features here.

1 MNIST dataset

We will use the MNIST dataset. Quoting Wikipedia:

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset.

Here is a sample of the images:

MNIST sample images

(Source)

We first load the data and convert it to an appropriate matrix representation. The data can be accessed with Flux.Data.MNIST.

In [1]:
#Julia version: 1.5.1
ENV["JULIA_CUDA_SILENT"] = true # silences warning about GPUs

using CSV, DataFrames, GLM, Statistics, Images, QuartzImageIO
using Flux, Flux.Data.MNIST, Flux.Data.FashionMNIST
using Flux: mse, train!, Data.DataLoader, throttle
using Flux: onehot, onehotbatch, onecold, crossentropy
using IterTools: ncycle
In [2]:
imgs = MNIST.images()
labels = MNIST.labels()
length(imgs)
Out[2]:
60000

For example, the first image and its label are:

In [3]:
imgs[1]
Out[3]:
In [4]:
labels[1]
Out[4]:
5

We first transform the images into vectors using reshape.

In [5]:
?reshape
search: reshape promote_shape

Out[5]:
reshape(A, dims...) -> AbstractArray
reshape(A, dims) -> AbstractArray

Return an array with the same data as A, but with different dimension sizes or number of dimensions. The two arrays share the same underlying data, so that the result is mutable if and only if A is mutable, and setting elements of one alters the values of the other.

The new dimensions may be specified either as a list of arguments or as a shape tuple. At most one dimension may be specified with a :, in which case its length is computed such that its product with all the specified dimensions is equal to the length of the original array A. The total number of elements must not change.

Examples

jldoctest
julia> A = Vector(1:16)
16-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16

julia> reshape(A, (4, 4))
4×4 Array{Int64,2}:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> reshape(A, 2, :)
2×8 Array{Int64,2}:
 1  3  5  7   9  11  13  15
 2  4  6  8  10  12  14  16

julia> reshape(1:6, 2, 3)
2×3 reshape(::UnitRange{Int64}, 2, 3) with eltype Int64:
 1  3  5
 2  4  6
In [6]:
reshape(Float32.(imgs[1]),:)
Out[6]:
784-element Array{Float32,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Using a list comprehension and reduce with hcat, we do this for every image in the dataset.

In [7]:
Xtrain = reduce(hcat, [reshape(Float32.(imgs[i]),:) for i = 1:length(imgs)]);

We can get back the original images by using reshape again.

In [8]:
Gray.(reshape(Xtrain[:,1],(28,28)))
Out[8]:

We also convert the labels into vectors. We use one-hot encoding, that is, we convert the label 0 to the standard basis $\mathbf{e}_1 \in \mathbb{R}^{10}$, the label 1 to $\mathbf{e}_2 \in \mathbb{R}^{10}$, and so on. The functions onehot and onehotbatch perform this transformation, while onecold undoes it.

In [9]:
?onehotbatch
search:

Out[9]:
onehotbatch(ls, labels[, unk...])

Return a OneHotMatrix where kth column of the matrix is onehot(ls[k], labels).

If one of the input labels ls is not found in labels and unk is given, return onehot(unk, labels) ; otherwise the function will raise an error.

Examples

jldoctest
julia> Flux.onehotbatch([:b, :a, :b], [:a, :b, :c])
3×3 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
 0  1  0
 1  0  1
 0  0  0
In [10]:
?onecold
search: SkipConnection ExponentialBackOff component_centroids

Out[10]:
onecold(y[, labels = 1:length(y)])

Inverse operations of onehot.

Examples

jldoctest
julia> Flux.onecold([true, false, false], [:a, :b, :c])
:a

julia> Flux.onecold([0.3, 0.2, 0.5], [:a, :b, :c])
:c

For example, on the first label we get:

In [11]:
onehot(labels[1], 0:9)
Out[11]:
10-element Flux.OneHotVector:
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
In [12]:
onecold(ans, 0:9)
Out[12]:
5

We do this for all labels simultaneously.

In [13]:
ytrain = onehotbatch(labels, 0:9);

We will also use a test dataset provided in MNIST to assess the accuracy of our classifiers. We perform the same transformation.

In [14]:
test_imgs = MNIST.images(:test)
test_labels = MNIST.labels(:test)
length(test_labels)
Out[14]:
10000
In [15]:
Xtest = reduce(hcat,
    [reshape(Float32.(test_imgs[i]),:) for i = 1:length(test_imgs)])
ytest = onehotbatch(test_labels, 0:9);

2 Convolutional neural networks

Finally, we consider a class of neural networks tailored for image processing, convolutional neural networks (CNN). From Wikipedia:

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.

More background can be found in this excellent module from Stanford's CS231n.

We will use CNNs on the MNIST dataset. What follows is based on Flux's model zoo. Our CNN will be a composition of convolutional layers and pooling layers.

In [16]:
?Conv
search: Conv conv conv! convert ConvDims convexhull ConvTranspose conv_bias_act

Out[16]:
Conv(filter, in => out, σ = identity; init = glorot_uniform,
     stride = 1, pad = 0, dilation = 1)

filter = (2,2)
in = 1
out = 16
Conv((2, 2), 1=>16, relu)

Standard convolutional layer. filter should be a tuple like (2, 2). in and out specify the number of input and output channels respectively.

Data should be stored in WHCN order (width, height, # channels, batch size). In other words, a 100×100 RGB image would be a 100×100×3×1 array, and a batch of 50 would be a 100×100×3×50 array.

Accepts keyword arguments weight and bias to set the corresponding fields. Setting bias to Flux.Zeros() will switch bias off for the layer.

Takes the keyword arguments pad, stride and dilation. For input dimension N, pad should be a single Integer indicating equal padding value for each spatial dimension, a tuple of length (N-2) to apply symmetric padding or a tuple of length 2*(N-2) indicating padding values for each spatial dimension at both the ends. stride and dilation should be a single Integer or a tuple with N-2 parameters. Use pad=SamePad() to apply padding so that outputsize == inputsize / stride.

Examples

Apply a Conv layer to a 1-channel input using a 2×2 window filter size, giving us a 16-channel output. Output is activated with ReLU.

filter = (2,2)
in = 1
out = 16
Conv(filter, in => out, relu)

Conv(weight::AbstractArray, bias::AbstractArray)
Conv(weight::AbstractArray, bias::AbstractArray, activation)

Constructs the convolutional layer with user defined weight and bias arrays.

Setting bias to Flux.Zeros() would switch bias off for the layer.

Takes the keyword arguments pad, stride and dilation. For input dimension N, pad should be a single Integer indicating equal padding value for each spatial dimension, a tuple of length (N-2) to apply symmetric padding or a tuple of length 2*(N-2) indicating padding values for each spatial dimension at both the ends. stride and dilation should be a single Integer or a tuple with N-2 parameters.

There is also a keyword-only constuctor available for all convoultional layers.

weight = rand(Float32, 3, 3, 5)
bias = zeros(Float32, 5)
Conv(weight = weight,
    bias = bias,
    σ = sigmoid)
In [17]:
?MaxPool
search: MaxPool maxpool maxpool! ∇maxpoolmaxpool! GlobalMaxPool

Out[17]:
MaxPool(k; pad = 0, stride = k)

Max pooling layer. k is the size of the window for each dimension of the input.

Use pad=SamePad() to apply padding so that outputsize == inputsize / stride.

In [18]:
m = Chain(
    # First convolution, operating upon a 28x28 image
    Conv((3, 3), 1=>16, pad=(1,1), relu),
    MaxPool((2,2)),

    # Second convolution, operating upon a 14x14 image
    Conv((3, 3), 16=>32, pad=(1,1), relu),
    MaxPool((2,2)),

    # Third convolution, operating upon a 7x7 image
    Conv((3, 3), 32=>32, pad=(1,1), relu),
    MaxPool((2,2)),

    # Reshape 3d tensor into a 2d one, at this point it should be (3, 3, 32, N)
    # which is where we get the 288 in the `Dense` layer below:
    x -> reshape(x, :, size(x, 4)),
    Dense(288, 10),

    # Finally, softmax to get nice probabilities
    softmax,
);

One complication is that the convolutional layers take as input a tensor, that is, a multidimensional array. So the first step is to convert the images in the dataset into $4d$-arrays in WHCN order (width, height, #channels, batch size). Here the number of of channels is $1$ for grayscale and the batch size is $1$ for a single image. We will use DataLoader as before to create larger mini-batches.

We use reshape to make a $4d$-array.

In [19]:
reshape(Float32.(imgs[1]), 28, 28, 1, 1)
Out[19]:
28×28×1×1 Array{Float32,4}:
[:, :, 1, 1] =
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.498039  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.25098   0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 ⋮                             ⋮         ⋱                 ⋮
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.215686  0.67451      0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.533333  0.992157     0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0

Then applying our model outputs a probability distribution over $10$ labels as before.

In [20]:
m(ans)
Out[20]:
10×1 Array{Float32,2}:
 0.09954709
 0.094213165
 0.08629797
 0.10479064
 0.12571757
 0.089546174
 0.09167668
 0.12285014
 0.10734907
 0.07801155

We concatenate the images into a large $4d$ tensor where the last dimension is for the samples. Here we cannot use hcat, as we are concatenating tensors rather than vectors. Instead we pre-allocate the tensor and then assign the images as we scan the last dimension.

In [21]:
train_tensor_imgs = zeros(Float32, 28, 28, 1, length(labels))
for i in 1:length(labels)
    train_tensor_imgs[:, :, :, i] = reshape(Float32.(imgs[i]), 28, 28, 1, 1)
end
train_onehot_labels = ytrain;

For example, the first image is encoded as:

In [22]:
train_tensor_imgs[:,:,:,1:1]
Out[22]:
28×28×1×1 Array{Float32,4}:
[:, :, 1, 1] =
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.498039  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.25098   0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 ⋮                             ⋮         ⋱                 ⋮
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.215686  0.67451      0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.533333  0.992157     0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0       …  0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0       0.0          0.0       0.0  0.0  0.0  0.0

We do the same transformation on the test dataset.

In [23]:
test_tensor_imgs = zeros(Float32, 28, 28, 1, length(test_labels))
for i in 1:length(test_labels)
    test_tensor_imgs[:, :, :, i] = reshape(Float32.(test_imgs[i]), 28, 28, 1, 1)
end
test_onehot_labels = ytest;

We now use DataLoader to create mini-batches and set the parameters for the optimizer.

In [24]:
loader = DataLoader(train_tensor_imgs, train_onehot_labels;
    batchsize=128, shuffle=true);
In [25]:
accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9))
loss(x, y) = crossentropy(m(x), y)
ps = params(m)
opt = ADAM()
evalcb = () -> @show(accuracy(test_tensor_imgs, test_onehot_labels));

As before the initial accuracy of the network, with random weights, is close to $10\%$.

In [26]:
accuracy(test_tensor_imgs,test_onehot_labels)
Out[26]:
0.1227

We train for $10$ epochs. On my computer it takes about 10 minutes.

In [27]:
@time train!(loss, ps, ncycle(loader, 1), opt, cb = throttle(evalcb, 60))
accuracy(test_tensor_imgs, test_onehot_labels) = 0.1469
 68.820500 seconds (65.93 M allocations: 30.727 GiB, 6.90% gc time)
In [28]:
@time train!(loss, ps, ncycle(loader, 9), opt, cb = throttle(evalcb, 60))
accuracy(test_tensor_imgs, test_onehot_labels) = 0.9755
accuracy(test_tensor_imgs, test_onehot_labels) = 0.9866
accuracy(test_tensor_imgs, test_onehot_labels) = 0.9886
accuracy(test_tensor_imgs, test_onehot_labels) = 0.9892
accuracy(test_tensor_imgs, test_onehot_labels) = 0.9875
accuracy(test_tensor_imgs, test_onehot_labels) = 0.9897
accuracy(test_tensor_imgs, test_onehot_labels) = 0.9905
373.053069 seconds (18.16 M allocations: 244.312 GiB, 10.95% gc time)

The final accuracy is significantly higher:

In [29]:
@time accuracy(test_tensor_imgs, test_onehot_labels)
  1.818649 seconds (158.78 k allocations: 1.711 GiB, 12.15% gc time)
Out[29]:
0.9911
In [30]:
new_tensor_img = reshape(Float32.(test_imgs[1]), 28, 28, 1, 1)
onecold(m(new_tensor_img), 0:9)
Out[30]:
1-element Array{Int64,1}:
 7
In [31]:
onecold(ytest[:,1], 0:9)
Out[31]:
7

3 Fashion MNIST dataset

Finally, we test CNNs on the Fashion MNIST dataset. Quoting Kaggle:

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. [...] Each training and test example is assigned to one of the following labels: 0. T-shirt/top, 1. Trouser, 2. Pullover, 3. Dress, 4. Coat, 5. Sandal, 6. Shirt, 7. Sneaker, 8. Bag, 9. Ankle boot.

Here are sample images:

Fashion MNIST sample

(Source)

The data can be accessed from Flux.Data.FashionMNIST. As we did before, we load the training and test data, convert it to tensor form and create mini-batches. We also use the same CNN network structure and run ADAM on it.

In [32]:
imgs = FashionMNIST.images()
labels = FashionMNIST.labels()
length(imgs)
Out[32]:
60000
In [33]:
train_tensor_imgs = zeros(Float32, 28, 28, 1, length(labels))
for i in 1:length(labels)
    train_tensor_imgs[:, :, :, i] = reshape(Float32.(imgs[i]), 28, 28, 1, 1)
end
train_onehot_labels = onehotbatch(labels, 0:9);
In [34]:
test_imgs = FashionMNIST.images(:test)
test_labels = FashionMNIST.labels(:test)
length(test_imgs)
Out[34]:
10000
In [35]:
test_tensor_imgs = zeros(Float32, 28, 28, 1, length(test_labels))
for i in 1:length(test_labels)
    test_tensor_imgs[:, :, :, i] = reshape(Float32.(test_imgs[i]), 28, 28, 1, 1)
end
test_onehot_labels = onehotbatch(test_labels, 0:9);
In [36]:
m = Chain(
    # First convolution, operating upon a 28x28 image
    Conv((3, 3), 1=>16, pad=(1,1), relu),
    MaxPool((2,2)),

    # Second convolution, operating upon a 14x14 image
    Conv((3, 3), 16=>32, pad=(1,1), relu),
    MaxPool((2,2)),

    # Third convolution, operating upon a 7x7 image
    Conv((3, 3), 32=>32, pad=(1,1), relu),
    MaxPool((2,2)),

    # Reshape 3d tensor into a 2d one, at this point it should be (3, 3, 32, N)
    # which is where we get the 288 in the `Dense` layer below:
    x -> reshape(x, :, size(x, 4)),
    Dense(288, 10),

    # Finally, softmax to get nice probabilities
    softmax,
);
In [37]:
accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9))
loss(x, y) = crossentropy(m(x), y)
ps = params(m)
opt = ADAM()
evalcb = () -> @show(accuracy(test_tensor_imgs, test_onehot_labels));
In [38]:
accuracy(test_tensor_imgs,test_onehot_labels)
Out[38]:
0.1
In [39]:
loader = DataLoader(train_tensor_imgs, train_onehot_labels;
    batchsize=128, shuffle=true);
In [40]:
@time train!(loss, ps, ncycle(loader, 1), opt, cb = throttle(evalcb, 60))
accuracy(test_tensor_imgs, test_onehot_labels) = 0.1001
 42.813316 seconds (2.92 M allocations: 27.569 GiB, 9.23% gc time)
In [41]:
@time train!(loss, ps, ncycle(loader, 9), opt, cb = throttle(evalcb, 60))
accuracy(test_tensor_imgs, test_onehot_labels) = 0.8313
accuracy(test_tensor_imgs, test_onehot_labels) = 0.8631
accuracy(test_tensor_imgs, test_onehot_labels) = 0.8788
accuracy(test_tensor_imgs, test_onehot_labels) = 0.8904
accuracy(test_tensor_imgs, test_onehot_labels) = 0.8788
accuracy(test_tensor_imgs, test_onehot_labels) = 0.8931
368.517308 seconds (18.00 M allocations: 242.601 GiB, 7.61% gc time)

The final accuracy is:

In [42]:
@time accuracy(test_tensor_imgs, test_onehot_labels)
  1.699265 seconds (158.78 k allocations: 1.711 GiB, 7.04% gc time)
Out[42]:
0.8984