{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# TOPIC 3 \n", "\n", "# Optimality, convexity, and gradient descent\n", "\n", "## 6 Automatic differentiation: examples\n", "\n", "***\n", "*Course:* [Math 535](http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS) \n", "*Author:* [Sebastien Roch](http://www.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison \n", "*Updated:* Nov 12, 2020 \n", "*Copyright:* © 2020 Sebastien Roch\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We give several concrete examples of progressive functions and of the application of the algorithmic differentiation method from the previous section." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 6.1 Example 1: linear regression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We begin with linear regression." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.1.1 Computing the gradient\n", "\n", "While we have motivated the framework introduced the previous section from the point of view of classification, it also immediately applies to the regression setting. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here $y$ is a real-valued outcome variable. We revisit the case of linear regression where the loss function is\n", "\n", "$$\n", "\\ell_y(z_2) = (z_2 - y)^2 \n", "$$\n", "\n", "and the regression function has a single layer (that is, $L=1$) with\n", "\n", "$$\n", "h_{\\mathbf{x}}(\\mathbf{w})\n", "= g_1(\\mathbf{w}_1;\\mathbf{z}_1)\n", "= \\sum_{j=1}^d w_{1,j} z_{1,j} + w_{1,d+1} \n", "$$\n", "\n", "where $\\mathbf{w}_1 = \\mathbf{w} \\in \\mathbb{R}^{d+1}$ are the parameters and $\\mathbf{z}_1 = \\mathbf{x} \\in \\mathbb{R}^d$ is the input. Hence,\n", "\n", "$$\n", "f_{\\mathbf{x},y}(\\mathbf{w})\n", "= \\ell_y(g_1(\\mathbf{w}_1;\\mathbf{x}))\n", "= \\left(\\sum_{j=1}^d w_{1,j} x_{j} + w_{1,d+1} - y\\right)^2. \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "It will convenient to introduce the following notation. For a vector $\\mathbf{z} \\in \\mathbb{R}^d$, the vector $\\mathbf{z}^{;1} \\in \\mathbb{R}^{d+1}$ concatenates a $1$ at the end of $\\mathbf{z}$, that is, $\\mathbf{z}^{;1} = (\\mathbf{z};1)$. For a vector $\\mathbf{w} \\in \\mathbb{R}^{d+1}$, the vector $\\mathbf{w}^{:d} \\in \\mathbb{R}^{d}$ drops the last entry of $\\mathbf{w}$, that is, $\\mathbf{w}^{:d} = (w_1,\\ldots,w_d)$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The forward pass in this case is:\n", "\n", "*Initialization:* \n", "\n", "$$\\mathbf{z}_1 := \\mathbf{x}$$\n", "\n", "*Forward layer loop:* \n", " \n", "\n", "\\begin{align*}\n", "z_2 &:= g_1(\\mathbf{w}_1;\\mathbf{z}_1) = \\mathbf{w}_1^T \\mathbf{z}_1^{;1}\\\\\n", "\\begin{pmatrix}\n", "A_1 & B_1\n", "\\end{pmatrix}\n", "&:= J_{g_1}(\\mathbf{w}_1;\\mathbf{z}_1) = (\\mathbf{z}_1^{;1};\\mathbf{w}_1^{:d})^T\n", "\\end{align*}\n", "\n", "\n", "*Loss:* \n", "\n", "\n", "\\begin{align*}\n", "z_3\n", "&:= \\ell_{y}(z_2) = (z_2 - y)^2 = (\\mathbf{w}^T \\mathbf{x}^{;1} - y)^2\\\\\n", "q_2\n", "&:= \\frac{\\mathrm{d}}{\\mathrm{d} z_2} {\\ell_{y}}(z_2)\n", "= 2 (z_2 - y).\n", "\\end{align*}\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The backward pass is:\n", "\n", "*Backward layer loop:*\n", " \n", "\n", "\\begin{align*}\n", "\\mathbf{p}_1 &:= A_1^T q_2 = 2 (z_2 - y) \\,\\mathbf{z}_1^{;1}\n", "\\end{align*}\n", "\n", "\n", "*Output:* \n", "\n", "$$\n", "\\nabla f_{\\mathbf{x}, y}(\\mathbf{w})\n", "= \\mathbf{p}_1\n", "= 2 \\left(\\mathbf{w}^T \\mathbf{x}^{;1} - y\\right) \\,\\mathbf{x}^{;1}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "There is in fact no need to compute $B_1$ (and $\\mathbf{q}_1$)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "When applied to a mini-batch of samples $B \\subseteq [n]$, we compute the average of the gradients over the samples in $B$, that is,\n", "\n", "$$\n", "\\frac{1}{|B|}\n", "\\sum_{i\\in B}\n", "\\nabla f_{\\mathbf{x}_i, y_i}(\\mathbf{w})\n", "= \n", "\\frac{2}{|B|}\n", "\\sum_{i\\in B}\n", "\\left(\\mathbf{w}^T \\mathbf{x}_i^{;1} - y_i\\right) \\,\\mathbf{x}_i^{;1}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.1.2 Flux\n", "\n", "We will be using [Flux.jl](https://fluxml.ai) to implement the previous method. The Dense function constructs an affine map from in predictor variables to out response variables. It is defined a special way that its parameters, the matrix W and the intercept vector b, can be accessed by m.W and m.b if m is the name given to the function. \n", "\n", "We will also use some other utility functions. The function mse computes the mean squared error (MSE). It will be our loss function in this section. The functions DataLoader and ncycle allow us to construct mini-batches in a straightforward way. Finally thottle is used to print progress messages.\n", "\n", "The help rubric of each of these is below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "#Julia version: 1.5.1\n", "ENV[\"JULIA_CUDA_SILENT\"] = true # silences warning about GPUs\n", "\n", "using CSV, DataFrames, GLM, Statistics, Images, QuartzImageIO\n", "using Flux, Flux.Data.MNIST, Flux.Data.FashionMNIST\n", "using Flux: mse, train!, Data.DataLoader, throttle\n", "using Flux: onehot, onehotbatch, onecold, crossentropy\n", "using IterTools: ncycle" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search: \u001b[0m\u001b[1mD\u001b[22m\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22m \u001b[0m\u001b[1mD\u001b[22m\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22mArray \u001b[0m\u001b[1mD\u001b[22m\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22mVector \u001b[0m\u001b[1mD\u001b[22m\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22mMatrix \u001b[0m\u001b[1mD\u001b[22m\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22mVecOrMat \u001b[0m\u001b[1mD\u001b[22m\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22mConvDims\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "Dense(in::Integer, out::Integer, σ = identity)\n", "\\end{verbatim}\n", "Create a traditional \\texttt{Dense} layer with parameters \\texttt{W} and \\texttt{b}.\n", "\n", "\\begin{verbatim}\n", "y = σ.(W * x .+ b)\n", "\\end{verbatim}\n", "The input \\texttt{x} must be a vector of length \\texttt{in}, or a batch of vectors represented as an \\texttt{in × N} matrix. The out \\texttt{y} will be a vector or batch of length \\texttt{out}.\n", "\n", "\\section{Example}\n", "\\begin{verbatim}\n", "julia> d = Dense(5, 2)\n", "Dense(5, 2)\n", "\n", "julia> d(rand(5))\n", "2-element Array{Float32,1}:\n", " -0.16210233\n", " 0.123119034\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "Dense(in::Integer, out::Integer, σ = identity)\n", "\n", "\n", "Create a traditional Dense layer with parameters W and b.\n", "\n", "\n", "y = σ.(W * x .+ b)\n", "\n", "\n", "The input x must be a vector of length in, or a batch of vectors represented as an in × N matrix. The out y will be a vector or batch of length out.\n", "\n", "# Example\n", "\n", "\n", "julia> d = Dense(5, 2)\n", "Dense(5, 2)\n", "\n", "julia> d(rand(5))\n", "2-element Array{Float32,1}:\n", " -0.16210233\n", " 0.123119034\n", "\n" ], "text/plain": [ "\u001b[36m Dense(in::Integer, out::Integer, σ = identity)\u001b[39m\n", "\n", " Create a traditional \u001b[36mDense\u001b[39m layer with parameters \u001b[36mW\u001b[39m and \u001b[36mb\u001b[39m.\n", "\n", "\u001b[36m y = σ.(W * x .+ b)\u001b[39m\n", "\n", " The input \u001b[36mx\u001b[39m must be a vector of length \u001b[36min\u001b[39m, or a batch of vectors represented\n", " as an \u001b[36min × N\u001b[39m matrix. The out \u001b[36my\u001b[39m will be a vector or batch of length \u001b[36mout\u001b[39m.\n", "\n", "\u001b[1m Example\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[36m julia> d = Dense(5, 2)\u001b[39m\n", "\u001b[36m Dense(5, 2)\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> d(rand(5))\u001b[39m\n", "\u001b[36m 2-element Array{Float32,1}:\u001b[39m\n", "\u001b[36m -0.16210233\u001b[39m\n", "\u001b[36m 0.123119034\u001b[39m" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?Dense" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search: \u001b[0m\u001b[1mm\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22m r\u001b[0m\u001b[1mm\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1me\u001b[22m i\u001b[0m\u001b[1mm\u001b[22m\u001b[0m\u001b[1ms\u001b[22mtr\u001b[0m\u001b[1me\u001b[22mtch i\u001b[0m\u001b[1mm\u001b[22mre\u001b[0m\u001b[1ms\u001b[22miz\u001b[0m\u001b[1me\u001b[22m Su\u001b[0m\u001b[1mm\u001b[22m\u001b[0m\u001b[1mS\u001b[22mquar\u001b[0m\u001b[1me\u001b[22mdDifference Co\u001b[0m\u001b[1mm\u001b[22mpo\u001b[0m\u001b[1ms\u001b[22mit\u001b[0m\u001b[1me\u001b[22mException\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "mse(ŷ, y; agg=mean)\n", "\\end{verbatim}\n", "Return the loss corresponding to mean square error: \n", "\n", "\\begin{verbatim}\n", "agg((ŷ .- y).^2)\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "mse(ŷ, y; agg=mean)\n", "\n", "\n", "Return the loss corresponding to mean square error: \n", "\n", "\n", "agg((ŷ .- y).^2)\n", "\n" ], "text/plain": [ "\u001b[36m mse(ŷ, y; agg=mean)\u001b[39m\n", "\n", " Return the loss corresponding to mean square error: \n", "\n", "\u001b[36m agg((ŷ .- y).^2)\u001b[39m" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?mse" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search:\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "DataLoader(data; batchsize=1, shuffle=false, partial=true)\n", "\\end{verbatim}\n", "An object that iterates over mini-batches of \\texttt{data}, each mini-batch containing \\texttt{batchsize} observations (except possibly the last one). \n", "\n", "Takes as input a single data tensor, or a tuple (or a named tuple) of tensors. The last dimension in each tensor is considered to be the observation dimension.\n", "\n", "If \\texttt{shuffle=true}, shuffles the observations each time iterations are re-started. If \\texttt{partial=false}, drops the last mini-batch if it is smaller than the batchsize.\n", "\n", "The original data is preserved in the \\texttt{data} field of the DataLoader. \n", "\n", "Usage example:\n", "\n", "\\begin{verbatim}\n", "Xtrain = rand(10, 100)\n", "train_loader = DataLoader(Xtrain, batchsize=2) \n", "# iterate over 50 mini-batches of size 2\n", "for x in train_loader\n", " @assert size(x) == (10, 2)\n", " ...\n", "end\n", "\n", "train_loader.data # original dataset\n", "\n", "# similar, but yielding tuples\n", "train_loader = DataLoader((Xtrain,), batchsize=2) \n", "for (x,) in train_loader\n", " @assert size(x) == (10, 2)\n", " ...\n", "end\n", "\n", "Xtrain = rand(10, 100)\n", "Ytrain = rand(100)\n", "train_loader = DataLoader((Xtrain, Ytrain), batchsize=2, shuffle=true) \n", "for epoch in 1:100\n", " for (x, y) in train_loader\n", " @assert size(x) == (10, 2)\n", " @assert size(y) == (2,)\n", " ...\n", " end\n", "end\n", "\n", "# train for 10 epochs\n", "using IterTools: ncycle \n", "Flux.train!(loss, ps, ncycle(train_loader, 10), opt)\n", "\n", "# can use NamedTuple to name tensors\n", "train_loader = DataLoader((images=Xtrain, labels=Ytrain), batchsize=2, shuffle=true)\n", "for datum in train_loader\n", " @assert size(datum.images) == (10, 2)\n", " @assert size(datum.labels) == (2,)\n", "end\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "DataLoader(data; batchsize=1, shuffle=false, partial=true)\n", "\n", "\n", "An object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one). \n", "\n", "Takes as input a single data tensor, or a tuple (or a named tuple) of tensors. The last dimension in each tensor is considered to be the observation dimension.\n", "\n", "If shuffle=true, shuffles the observations each time iterations are re-started. If partial=false, drops the last mini-batch if it is smaller than the batchsize.\n", "\n", "The original data is preserved in the data field of the DataLoader. \n", "\n", "Usage example:\n", "\n", "\n", "Xtrain = rand(10, 100)\n", "train_loader = DataLoader(Xtrain, batchsize=2) \n", "# iterate over 50 mini-batches of size 2\n", "for x in train_loader\n", " @assert size(x) == (10, 2)\n", " ...\n", "end\n", "\n", "train_loader.data # original dataset\n", "\n", "# similar, but yielding tuples\n", "train_loader = DataLoader((Xtrain,), batchsize=2) \n", "for (x,) in train_loader\n", " @assert size(x) == (10, 2)\n", " ...\n", "end\n", "\n", "Xtrain = rand(10, 100)\n", "Ytrain = rand(100)\n", "train_loader = DataLoader((Xtrain, Ytrain), batchsize=2, shuffle=true) \n", "for epoch in 1:100\n", " for (x, y) in train_loader\n", " @assert size(x) == (10, 2)\n", " @assert size(y) == (2,)\n", " ...\n", " end\n", "end\n", "\n", "# train for 10 epochs\n", "using IterTools: ncycle \n", "Flux.train!(loss, ps, ncycle(train_loader, 10), opt)\n", "\n", "# can use NamedTuple to name tensors\n", "train_loader = DataLoader((images=Xtrain, labels=Ytrain), batchsize=2, shuffle=true)\n", "for datum in train_loader\n", " @assert size(datum.images) == (10, 2)\n", " @assert size(datum.labels) == (2,)\n", "end\n", "\n" ], "text/plain": [ "\u001b[36m DataLoader(data; batchsize=1, shuffle=false, partial=true)\u001b[39m\n", "\n", " An object that iterates over mini-batches of \u001b[36mdata\u001b[39m, each mini-batch\n", " containing \u001b[36mbatchsize\u001b[39m observations (except possibly the last one). \n", "\n", " Takes as input a single data tensor, or a tuple (or a named tuple) of\n", " tensors. The last dimension in each tensor is considered to be the\n", " observation dimension.\n", "\n", " If \u001b[36mshuffle=true\u001b[39m, shuffles the observations each time iterations are\n", " re-started. If \u001b[36mpartial=false\u001b[39m, drops the last mini-batch if it is smaller\n", " than the batchsize.\n", "\n", " The original data is preserved in the \u001b[36mdata\u001b[39m field of the DataLoader. \n", "\n", " Usage example:\n", "\n", "\u001b[36m Xtrain = rand(10, 100)\u001b[39m\n", "\u001b[36m train_loader = DataLoader(Xtrain, batchsize=2) \u001b[39m\n", "\u001b[36m # iterate over 50 mini-batches of size 2\u001b[39m\n", "\u001b[36m for x in train_loader\u001b[39m\n", "\u001b[36m @assert size(x) == (10, 2)\u001b[39m\n", "\u001b[36m ...\u001b[39m\n", "\u001b[36m end\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m train_loader.data # original dataset\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m # similar, but yielding tuples\u001b[39m\n", "\u001b[36m train_loader = DataLoader((Xtrain,), batchsize=2) \u001b[39m\n", "\u001b[36m for (x,) in train_loader\u001b[39m\n", "\u001b[36m @assert size(x) == (10, 2)\u001b[39m\n", "\u001b[36m ...\u001b[39m\n", "\u001b[36m end\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m Xtrain = rand(10, 100)\u001b[39m\n", "\u001b[36m Ytrain = rand(100)\u001b[39m\n", "\u001b[36m train_loader = DataLoader((Xtrain, Ytrain), batchsize=2, shuffle=true) \u001b[39m\n", "\u001b[36m for epoch in 1:100\u001b[39m\n", "\u001b[36m for (x, y) in train_loader\u001b[39m\n", "\u001b[36m @assert size(x) == (10, 2)\u001b[39m\n", "\u001b[36m @assert size(y) == (2,)\u001b[39m\n", "\u001b[36m ...\u001b[39m\n", "\u001b[36m end\u001b[39m\n", "\u001b[36m end\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m # train for 10 epochs\u001b[39m\n", "\u001b[36m using IterTools: ncycle \u001b[39m\n", "\u001b[36m Flux.train!(loss, ps, ncycle(train_loader, 10), opt)\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m # can use NamedTuple to name tensors\u001b[39m\n", "\u001b[36m train_loader = DataLoader((images=Xtrain, labels=Ytrain), batchsize=2, shuffle=true)\u001b[39m\n", "\u001b[36m for datum in train_loader\u001b[39m\n", "\u001b[36m @assert size(datum.images) == (10, 2)\u001b[39m\n", "\u001b[36m @assert size(datum.labels) == (2,)\u001b[39m\n", "\u001b[36m end\u001b[39m" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?DataLoader" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search:\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "ncycle(iter, n)\n", "\\end{verbatim}\n", "Cycle through \\texttt{iter} \\texttt{n} times.\n", "\n", "\\begin{verbatim}\n", "julia> for i in ncycle(1:3, 2)\n", " @show i\n", " end\n", "i = 1\n", "i = 2\n", "i = 3\n", "i = 1\n", "i = 2\n", "i = 3\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "ncycle(iter, n)\n", "\n", "\n", "Cycle through iter n times.\n", "\n", "jldoctest\n", "julia> for i in ncycle(1:3, 2)\n", " @show i\n", " end\n", "i = 1\n", "i = 2\n", "i = 3\n", "i = 1\n", "i = 2\n", "i = 3\n", "\n" ], "text/plain": [ "\u001b[36m ncycle(iter, n)\u001b[39m\n", "\n", " Cycle through \u001b[36miter\u001b[39m \u001b[36mn\u001b[39m times.\n", "\n", "\u001b[36m julia> for i in ncycle(1:3, 2)\u001b[39m\n", "\u001b[36m @show i\u001b[39m\n", "\u001b[36m end\u001b[39m\n", "\u001b[36m i = 1\u001b[39m\n", "\u001b[36m i = 2\u001b[39m\n", "\u001b[36m i = 3\u001b[39m\n", "\u001b[36m i = 1\u001b[39m\n", "\u001b[36m i = 2\u001b[39m\n", "\u001b[36m i = 3\u001b[39m" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?ncycle" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search:\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "throttle(f, timeout; leading=true, trailing=false)\n", "\\end{verbatim}\n", "Return a function that when invoked, will only be triggered at most once during \\texttt{timeout} seconds.\n", "\n", "Normally, the throttled function will run as much as it can, without ever going more than once per \\texttt{wait} duration; but if you'd like to disable the execution on the leading edge, pass \\texttt{leading=false}. To enable execution on the trailing edge, pass \\texttt{trailing=true}.\n", "\n" ], "text/markdown": [ "\n", "throttle(f, timeout; leading=true, trailing=false)\n", "\n", "\n", "Return a function that when invoked, will only be triggered at most once during timeout seconds.\n", "\n", "Normally, the throttled function will run as much as it can, without ever going more than once per wait duration; but if you'd like to disable the execution on the leading edge, pass leading=false. To enable execution on the trailing edge, pass trailing=true.\n" ], "text/plain": [ "\u001b[36m throttle(f, timeout; leading=true, trailing=false)\u001b[39m\n", "\n", " Return a function that when invoked, will only be triggered at most once\n", " during \u001b[36mtimeout\u001b[39m seconds.\n", "\n", " Normally, the throttled function will run as much as it can, without ever\n", " going more than once per \u001b[36mwait\u001b[39m duration; but if you'd like to disable the\n", " execution on the leading edge, pass \u001b[36mleading=false\u001b[39m. To enable execution on\n", " the trailing edge, pass \u001b[36mtrailing=true\u001b[39m." ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?throttle" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.1.3 The Advertising dataset and the least-squares solution\n", "\n", "We return to the Advertising dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "

5 rows × 5 columns

" ], "text/latex": [ "\\begin{tabular}{r|ccccc}\n", "\t& Column1 & TV & radio & newspaper & sales\\\\\n", "\t\\hline\n", "\t& Int64 & Float64 & Float64 & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 1 & 230.1 & 37.8 & 69.2 & 22.1 \\\\\n", "\t2 & 2 & 44.5 & 39.3 & 45.1 & 10.4 \\\\\n", "\t3 & 3 & 17.2 & 45.9 & 69.3 & 9.3 \\\\\n", "\t4 & 4 & 151.5 & 41.3 & 58.5 & 18.5 \\\\\n", "\t5 & 5 & 180.8 & 10.8 & 58.4 & 12.9 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "5×5 DataFrame\n", "│ Row │ Column1 │ TV │ radio │ newspaper │ sales │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼─────────┼─────────┼─────────┼───────────┼─────────┤\n", "│ 1 │ 1 │ 230.1 │ 37.8 │ 69.2 │ 22.1 │\n", "│ 2 │ 2 │ 44.5 │ 39.3 │ 45.1 │ 10.4 │\n", "│ 3 │ 3 │ 17.2 │ 45.9 │ 69.3 │ 9.3 │\n", "│ 4 │ 4 │ 151.5 │ 41.3 │ 58.5 │ 18.5 │\n", "│ 5 │ 5 │ 180.8 │ 10.8 │ 58.4 │ 12.9 │" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame(CSV.File(\"advertising.csv\"))\n", "first(df,5)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = nrow(df)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We first compute the solution using the least-squares approach we detailed previously." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "X = reduce(hcat, [df[:,:TV], df[:,:radio], df[:,:newspaper]])\n", "Xaug = hcat(ones(n), X)\n", "y = df[:,:sales];" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.873739 seconds (2.89 M allocations: 141.946 MiB, 2.85% gc time)\n" ] }, { "data": { "text/plain": [ "4-element Array{Float64,1}:\n", " 2.938889369459415\n", " 0.04576464545539759\n", " 0.18853001691820445\n", " -0.0010374930424763011" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time q = Xaug\\y" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The MSE is:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "2.7841263145109356" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean((Xaug*q .- y).^2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.1.4 Solving the problem using Flux\n", "\n", "We use DataLoader to set up the data for Flux. Note that it takes the transpose of what we have been using, that is, the columns of the data matrix correspond to the samples. Here we take mini-batches of size batchsize=20 and the option shuffle=true indicates that we apply a random permutation of the samples on every pass through the data. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "Xtrain = X'\n", "ytrain = reshape(y, (1,length(y)))\n", "loader = DataLoader(Xtrain, ytrain; batchsize=64, shuffle=true);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "For example, the first component of the first item is the features for the first 64 samples (after random permutation)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3×64 Array{Float64,2}:\n", " 7.8 238.2 4.1 31.5 225.8 273.7 … 232.1 206.9 265.2 193.7 74.7\n", " 38.9 34.3 11.6 24.6 8.2 28.9 8.6 8.4 2.9 35.4 49.4\n", " 50.6 5.3 5.7 2.2 56.5 59.7 8.7 26.4 43.0 75.6 45.7" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first(loader)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Now we construct our model. It is simply an affine map from $\\mathbb{R}^3$ to $\\mathbb{R}$." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Dense(3, 1)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = Dense(3, 1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The loss function is the MSE." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "loss (generic function with 1 method)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss(x,y) = mse(m(x),y) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Finally, the function [train!](https://fluxml.ai/Flux.jl/stable/training/training/#Flux.Optimise.train!) runs an optimization method of our choice on the loss function. The ! in the function name indicates that it modifies the parameters we pass to it, in this case m.W and m.b. There are many [optimizers](https://fluxml.ai/Flux.jl/stable/training/optimisers/#Optimiser-Reference-1) available . Stochastic gradient descent can chosen with [Descent](https://fluxml.ai/Flux.jl/stable/training/optimisers/#Flux.Optimise.Descent). (But it is slow. Instead we will use the popular [ADAM](https://fluxml.ai/Flux.jl/stable/training/optimisers/#Flux.Optimise.ADAM). See this [post](https://hackernoon.com/demystifying-different-variants-of-gradient-descent-optimization-algorithm-19ae9ba2e9bc) for a brief explanation of many common optimizers.)\n", "\n", "We also pass the parameters to train! using params and a callback function evalcb() that prints progress. \n", "\n", "Choosing the right number of passes (i.e. epochs) through the data requires some experimenting. Here $10^4$ suffices." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "ps = params(m)\n", "opt = Descent(1e-5)\n", "evalcb =() -> @show(loss(Xtrain,ytrain));" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loss(Xtrain, ytrain) = 2160.6028562802303\n", "loss(Xtrain, ytrain) = 3.8571577593201845\n", " 18.668849 seconds (75.40 M allocations: 3.075 GiB, 4.78% gc time)\n" ] } ], "source": [ "@time train!(loss, ps, ncycle(loader, Int(1e4)), opt, cb = throttle(evalcb, 2))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The final parameters and loss are:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "1-element Array{Float32,1}:\n", " 0.3245267" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.b" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1×3 Array{Float32,2}:\n", " 0.0509482 0.218272 0.0144676" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.W" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3.9033906662010316" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loss(Xtrain,ytrain)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 6.2 Example 2: multinomial logistic regression" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We return to classification. We first appeal to [multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) to learn a classifier over $K$ labels. Recall that we encode label $i$ as the $K$-dimensional vector $\\mathbf{e}_i$ and that we allow the output of the classifier to be a probability distribution over the labels $\\{1,\\ldots,K\\}$. Observe that $\\mathbf{e}_i$ can itself be thought of as a probability distribution, one that assigns probability one to $i$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.2.1 Background\n", "\n", "In multinomial logistic regression, we once again use an affine function of the input data. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This time, we have $K$ functions that output a score associated to each label. We then transform these scores into a probability distribution over the $K$ labels. There are many ways of doing this. A standard approach is the [softmax function](https://en.wikipedia.org/wiki/Softmax_function): for $\\mathbf{z} \\in \\mathbb{R}^K$\n", "\n", "$$\n", "\\gamma(\\mathbf{z})_i\n", "= \\frac{e^{z_i}}{\\sum_{j=1}^K e^{z_j}},\n", "\\quad i=1,\\ldots,K.\n", "$$\n", "\n", "To explain the name, observe that the larger inputs are mapped to larger probabilities. \n", "\n", "In fact, since a probability distribution must to $1$, it is determined by the probabilities assigned to the first $K-1$ labels. In other words, we can drop the score associated to the last label. Formally, we use the modifed softmax function\n", "\n", "$$\n", "\\tilde\\gamma(\\mathbf{z})_i\n", "= \n", "\\begin{cases}\n", "\\frac{e^{z_i}}{1 + \\sum_{j=1}^{K-1} e^{z_j}}\n", "& i=1,\\ldots,K-1\\\\\n", "\\frac{1}{1 + \\sum_{j=1}^{K-1} e^{z_j}}\n", "& i=K\n", "\\end{cases}\n", "$$\n", "\n", "where this time $\\mathbf{z} \\in \\mathbb{R}^{K-1}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Hence we now have two layers, that is, $L = 2$. The layers are defined by the functions\n", "\n", "$$\n", "z_{2,k}\n", "= g_1(\\mathbf{w}_1;\\mathbf{z}_1)_k\n", "= \\sum_{j=1}^d w^{(k)}_{1,j} z_{1,j} + w^{(k)}_{1,d+1},\n", "\\quad k=1,\\ldots,K-1\n", "$$\n", "\n", "where $\\mathbf{w}_1 = (\\mathbf{w}^{(1)}_1;\\ldots;\\mathbf{w}^{(K-1)}_1)$ are the parameters with $\\mathbf{w}^{(k)}_1 \\in \\mathbb{R}^{d+1}$ and $\\mathbf{z}_1 = \\mathbf{x} \\in \\mathbb{R}^d$ is the input, and\n", "\n", "$$\n", "\\mathbf{z}_3\n", "= g_2(\\mathbf{z}_2) \n", "= \\tilde\\gamma(\\mathbf{z}_2),\n", "$$\n", "\n", "where $\\tilde\\gamma$ is the modified softmax function. Note that the latter has no associated parameter." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So the output of the classifier with parameters $\\mathbf{w} = \\mathbf{w}_1 =(\\mathbf{w}^{(1)}_1;\\ldots;\\mathbf{w}^{(K-1)}_1)$ on input $\\mathbf{x}$ is\n", "\n", "\n", "\\begin{align*}\n", "h_{\\mathbf{x}}(\\mathbf{w})_i\n", "&= g_2(g_1(\\mathbf{w}_1;\\mathbf{z}_1))_i\\\\\n", "&= \\tilde\\gamma\\left(\\left[\\sum_{j=1}^d w^{(k)}_{1,j} x_{j} + w^{(k)}_{1,d+1}\\right]_{k=1}^{K-1}\\right)_i,\n", "\\end{align*}\n", "\n", "for $i=1,\\ldots,K$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "It remains to define a loss function. To quantify the fit, it is natural to use a notion of distance between probability measures, here between the output $h_{\\mathbf{x}}(\\mathbf{w}) \\in \\Delta_K$ and the correct label $\\mathbf{y} \\in \\{\\mathbf{e}_1,\\ldots,\\mathbf{e}_{K}\\} \\subseteq \\Delta_K$. There are many such measures. In multinomial logistic regression, we use the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence). For two probability distributions $\\mathbf{p}, \\mathbf{q} \\in \\Delta_K$, it is defined as\n", "\n", "$$\n", "\\mathrm{KL}(\\mathbf{p} \\| \\mathbf{q})\n", "= \\sum_{i=1}^K p_i \\log \\frac{p_i}{q_i}\n", "$$\n", "\n", "where it will suffice to restrict ourselves to the case $\\mathbf{q} > \\mathbf{0}$ and where we use the convention $0 \\log 0 = 0$ (so that terms with $p_i = 0$ contribute $0$ to the sum)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Exercise:* Show that $\\log x \\leq x - 1$ for all $x > 0$. [Hint: Compute the derivative of $s(x) = x - 1 - \\log x$ and the value $s(1)$.] $\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Notice that $\\mathbf{p} = \\mathbf{q}$ implies $\\mathrm{KL}(\\mathbf{p} \\| \\mathbf{q}) = 0$. We show that $\\mathrm{KL}(\\mathbf{p} \\| \\mathbf{q}) \\geq 0$, a result known as *Gibbs' inequality*.\n", "\n", "***\n", "\n", "**Theorem (Gibbs)**: For any $\\mathbf{p}, \\mathbf{q} \\in \\Delta_K$ with $\\mathbf{q} > \\mathbf{0}$,\n", "\n", "$$\n", "\\mathrm{KL}(\\mathbf{p} \\| \\mathbf{q}) \\geq 0.\n", "$$\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Proof:* Let $I$ be the set of indices $i$ such that $p_i > 0$. Hence\n", "\n", "$$\n", "\\mathrm{KL}(\\mathbf{p} \\| \\mathbf{q}) \n", "= \\sum_{i \\in I} p_i \\log \\frac{p_i}{q_i}.\n", "$$\n", "\n", "Using the exercise above, $\\log x \\leq x - 1$ for all $x > 0$ so that\n", "\n", "\n", "\\begin{align*}\n", "\\mathrm{KL}(\\mathbf{p} \\| \\mathbf{q}) \n", "&= - \\sum_{i \\in I} p_i \\log \\frac{q_i}{p_i}\\\\\n", "&\\geq - \\sum_{i \\in I} p_i \\left(\\frac{q_i}{p_i} - 1\\right)\\\\\n", "&= - \\sum_{i \\in I} q_i + \\sum_{i \\in I} p_i\\\\\n", "&= - \\sum_{i \\in I} q_i + 1\\\\\n", "&\\geq 0\n", "\\end{align*}\n", "\n", "\n", "where we used that $\\log z^{-1} = - \\log z$ on the first line and the fact that $p_i = 0$ for all $i \\notin I$ on the fourth line. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Going back to the loss function, we use the identity $\\log\\frac{\\alpha}{\\beta} = \\log \\alpha - \\log \\beta$ to re-write \n", "\n", "\n", "\\begin{align*}\n", "\\mathrm{KL}(\\mathbf{y} \\| h_{\\mathbf{x}}(\\mathbf{w}))\n", "&= \\sum_{i=1}^K y_i \\log \\frac{y_i}{h_{\\mathbf{x}}(\\mathbf{w})_i}\\\\\n", "&= \\sum_{i=1}^K y_i \\log y_i\n", "- \\sum_{i=1}^K y_i \\log h_{\\mathbf{x}}(\\mathbf{w})_i.\n", "\\end{align*}\n", "\n", "\n", "Notice that the first term on right-hand side does not depend on $\\mathbf{w}$. Hence we can ignore when optimizing $\\mathrm{KL}(\\mathbf{y} \\| h_{\\mathbf{x}}(\\mathbf{w}))$. The remaining term\n", "\n", "$$\n", "H(\\mathbf{y}, h_{\\mathbf{x}}(\\mathbf{w}))\n", "= - \\sum_{i=1}^K y_i \\log h_{\\mathbf{x}}(\\mathbf{w})_i\n", "$$\n", "\n", "is known as the [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy). We use it to define our loss function. That is, we set\n", "\n", "$$\n", "\\ell_{\\mathbf{y}}(\\mathbf{z}_3)\n", "= H(\\mathbf{y}, \\mathbf{z}_3)\n", "= - \\sum_{i=1}^K y_i \\log z_{3,i}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Finally, \n", "\n", "$$\n", "f_{\\mathbf{x},\\mathbf{y}}(\\mathbf{w})\n", "= \\ell_{\\mathbf{y}}(h_{\\mathbf{x}}(\\mathbf{w}))\n", "= - \\sum_{i=1}^K y_i \\log\\tilde\\gamma\\left(\\left[\\sum_{j=1}^d w^{(k)}_{1,j} x_{j} + w^{(k)}_{1,d+1}\\right]_{k=1}^{K-1}\\right)_i.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.2.2 Computing the gradient\n", "\n", "The forward pass starts with the initialization $\\mathbf{z}_1 := \\mathbf{x}$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The forward layer loop has two steps. First we compute \n", "\n", "\n", "\\begin{align*}\n", "\\mathbf{z}_{2} \n", "&:= g_1(\\mathbf{w}_1;\\mathbf{z}_1)\n", "= \\left[\\sum_{j=1}^d w^{(k)}_{1,j} z_{1,j} + w^{(k)}_{1,d+1}\\right]_{k=1}^{K-1}\n", "= \\mathcal{W}_1 \\mathbf{z}_1^{;1}\\\\\n", "\\begin{pmatrix}\n", "A_1 & B_1\n", "\\end{pmatrix}\n", "&:= J_{g_1}(\\mathbf{w}_1;\\mathbf{z}_1)\n", "\\end{align*}\n", "\n", "\n", "where we define $\\mathcal{W} = \\mathcal{W}_1 \\in \\mathbb{R}^{(K-1)\\times (d+1)}$ as the matrix with rows $(\\mathbf{w}_1^{(1)})^T,\\ldots,(\\mathbf{w}_1^{(K-1)})^T$.\n", "To compute the Jacobian, let us look at the columns corresponding to the variables in $\\mathbf{w}_1^{(k)}$, that is, columns $\\alpha_k = (k-1) (d+1) + 1$ to $\\beta_k = k(d+1)$. Note that only component $k$ of $g_1$ depends on $\\mathbf{w}_1^{(k)}$, so the rows $\\neq k$ of $J_{g_1}$ are $0$ for those columns. Row $k$ on the other hand is $(\\mathbf{z}_1^{;1})^T$. Hence one way to write the columns $\\alpha_k$ to $\\beta_k$ of $J_{g_1}$ is $\\mathbf{e}_k (\\mathbf{z}_1^{;1})^T$, where here $\\mathbf{e}_k \\in \\mathbb{R}^{K-1}$ is the canonical basis of $\\mathbb{R}^{K-1}$ (in a slight abuse of notation). So $A_1$ can be written in block form as\n", "\n", "$$\n", "A_1 \n", "= \\begin{pmatrix}\n", "\\mathbf{e}_1 (\\mathbf{z}_1^{;1})^T\n", "& \\cdots & \\mathbf{e}_{K-1} (\\mathbf{z}_1^{;1})^T\n", "\\end{pmatrix}\n", "=: \\mathbb{A}_{d,K-1}[\\mathbf{z}_1],\n", "$$\n", "\n", "where the last equality is a definition." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As for the columns corresponding to the variables in $\\mathbf{z}_1$, that is, columns $(K-1) (d+1) + 1$ to $(K-1) (d+1) + d$, each row takes the same form. Indeed row $k$ is $((\\mathbf{w}^{(k)}_1)^{:d})^T$. So $B_1$ can be written as\n", "\n", "$$\n", "B_1\n", "=\n", "\\begin{pmatrix}\n", "((\\mathbf{w}^{(1)}_1)^{:d})^T\\\\\n", "\\vdots\\\\\n", "((\\mathbf{w}^{(K-1)}_1)^{:d})^T\n", "\\end{pmatrix}\n", "=: \\mathbb{B}_{d,K-1}[\\mathbf{w}_1].\n", "$$\n", "\n", "In fact, we will not need $B_1$ here, but we will need it in a later section." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In the second step of the forward layer loop, we compute\n", "\n", "\n", "\\begin{align*}\n", "\\mathbf{z}_3 \n", "&:= g_2(\\mathbf{z}_2)\n", "= \\tilde\\gamma(\\mathbf{z}_2)\\\\\n", "B_2\n", "&:= J_{g_2}(\\mathbf{z}_2)\n", "= J_{\\tilde{\\gamma}}(\\mathbf{z}_2).\n", "\\end{align*}\n", "\n", "\n", "So we need to compute the Jacobian of $\\tilde\\gamma$. We divide the computation into several cases. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "When $1 \\leq i = j \\leq K-1$,\n", "\n", "\n", "\\begin{align*}\n", "(B_2)_{ii}\n", "&= \\frac{\\partial}{\\partial z_{2,i}} \\left[ \\tilde\\gamma(\\mathbf{z}_2)_i \\right]\\\\\n", "&= \\frac{\\partial}{\\partial z_{2,i}} \\left[ \\frac{e^{z_{2,i}}}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}} \\right]\\\\\n", "&= \\frac{e^{z_{2,i}}\\left(1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}\\right) - e^{z_{2,i}}\\left(e^{z_{2,i}}\\right)}\n", "{\\left(1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}\\right)^2}\\\\\n", "&= \\frac{e^{z_{2,i}}\\left(1 + \\sum_{k = 1, k \\neq i}^{K-1} e^{z_{2,k}}\\right)}\n", "{\\left(1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}\\right)^2}\n", "=:\\mathbb{C}_{K}[\\mathbf{z}_2]_{ii},\n", "\\end{align*}\n", "\n", "\n", "by the [quotient rule](https://en.wikipedia.org/wiki/Quotient_rule)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "When $1 \\leq i, j \\leq K-1$ with $i \\neq j$,\n", "\n", "\n", "\\begin{align*}\n", "(B_2)_{ij}\n", "&= \\frac{\\partial}{\\partial z_{2,j}} \\left[ \\tilde\\gamma(\\mathbf{z}_2)_i \\right]\\\\\n", "&= \\frac{\\partial}{\\partial z_{2,j}} \\left[ \\frac{e^{z_{2,i}}}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}} \\right]\\\\\n", "&= \\frac{- e^{z_{2,i}}\\left(e^{z_{2,j}}\\right)}\n", "{\\left(1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}\\right)^2}\n", "=:\\mathbb{C}_{K}[\\mathbf{z}_2]_{ij}.\n", "\\end{align*}\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "When $i = K$ and $1 \\leq j \\leq K-1$,\n", "\n", "\n", "\\begin{align*}\n", "(B_2)_{ij}\n", "&= \\frac{\\partial}{\\partial z_{2,j}} \\left[ \\tilde\\gamma(\\mathbf{z}_2)_i \\right]\\\\\n", "&= \\frac{\\partial}{\\partial z_{2,j}} \\left[ \\frac{1}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}} \\right]\\\\\n", "&= \\frac{- \\left(e^{z_{2,j}}\\right)}\n", "{\\left(1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}\\right)^2}\n", "=:\\mathbb{C}_{K}[\\mathbf{z}_2]_{ij}.\n", "\\end{align*}\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The gradient of the loss function is\n", "\n", "$$\n", "\\mathbf{q}_3\n", "= \\nabla \\ell_{\\mathbf{y}} (\\mathbf{z}_3)\n", "= \\left[- \\frac{y_i}{z_{3,i}}\\right]_{i=1}^K.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The backward layer loop also has two steps. Because $g_2$ does not have parameters, we only need to compute $\\mathbf{q}_2$ and $\\mathbf{p}_1$. We get for $1 \\leq j \\leq K-1$\n", "\n", "\n", "\\begin{align*}\n", "q_{2,j} \n", "= [B_2^T \\mathbf{q}_{3}]_j\n", "&= \\sum_{i=1}^K \\mathbb{C}_{d,K}[\\mathbf{z}_2]_{ij} \\left(- \\frac{y_i}{\\tilde\\gamma(\\mathbf{z}_2)_i}\\right)\\\\\n", "&= - \\frac{y_j \\left(1 + \\sum_{k = 1, k \\neq j}^{K-1} e^{z_{2,k}}\\right)}\n", "{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}}\n", "+ \\sum_{k=1, k \\neq j}^{K} \\frac{y_k e^{z_{2,j}}}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}}\\\\\n", "&= - \\frac{y_j \\left(1 + \\sum_{k = 1, k \\neq j}^{K-1} e^{z_{2,k}}\\right)}\n", "{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}}\n", "+ \\sum_{k=1, k \\neq j}^{K} \\frac{y_k e^{z_{2,j}}}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}}\n", "+ \\frac{y_j e^{z_{2,j}} - y_j e^{z_{2,j}}}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}}\\\\\n", "&= - y_j\n", "+ \\sum_{k=1}^{K} \\frac{y_k e^{z_{2,j}}}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}}\\\\\n", "&= - y_j\n", "+ \\frac{e^{z_{2,j}}}{1 + \\sum_{k=1}^{K-1} e^{z_{2,k}}},\n", "\\end{align*}\n", "\n", "\n", "where we used that $\\sum_{k=1}^{K} y_k = 1$. That is, $\\mathbf{q}_2 = \\tilde\\gamma(\\mathbf{z}_2) - \\mathbf{y}^{:K-1}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "It remains to compute $\\mathbf{p}_1$. We have\n", "\n", "\n", "\\begin{align*}\n", "\\mathbf{p}_{1} \n", "&= A_1^T \\mathbf{q}_{2}\\\\\n", "&= \\mathbb{A}_{d,K}[\\mathbf{z}_1]^T (\\tilde\\gamma(\\mathbf{z}_2) - \\mathbf{y}^{:K-1})\\\\\n", "&= \\begin{pmatrix}\n", "\\mathbf{e}_1 (\\mathbf{z}_1^{;1})^T\n", "& \\cdots & \\mathbf{e}_{K-1} (\\mathbf{z}_1^{;1})^T\n", "\\end{pmatrix}^T (\\tilde\\gamma(\\mathbf{z}_2) - \\mathbf{y}^{:K-1})\\\\\n", "&= \\begin{pmatrix}\n", "\\mathbf{z}_1^{;1} \\mathbf{e}_1^T\\\\ \n", "\\vdots\\\\ \n", "\\mathbf{z}_1^{;1} \\mathbf{e}_{K-1}^T\n", "\\end{pmatrix} (\\tilde\\gamma(\\mathbf{z}_2) - \\mathbf{y}^{:K-1})\\\\\n", "&= \\begin{pmatrix}\n", "(\\tilde\\gamma(\\mathbf{z}_2)_1 - y_1)\\, \\mathbf{z}_1^{;1}\\\\ \n", "\\vdots\\\\ \n", "(\\tilde\\gamma(\\mathbf{z}_2)_{K-1} - y_{K-1}) \\, \\mathbf{z}_1^{;1}\n", "\\end{pmatrix}.\n", "\\end{align*}\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We summarize the whole procedure next." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Initialization:* \n", "\n", "$$\\mathbf{z}_1 := \\mathbf{x}$$\n", "\n", "*Forward layer loop:*\n", " \n", "\n", "\\begin{align*}\n", "\\mathbf{z}_{2} \n", "&:= g_1(\\mathbf{w}_1;\\mathbf{z}_1)\n", "= \\mathcal{W}_1 \\mathbf{z}_1^{;1}\\\\\n", "\\begin{pmatrix}\n", "A_1 & B_1\n", "\\end{pmatrix}\n", "&:= J_{g_1}(\\mathbf{w}_1;\\mathbf{z}_1)\n", "= \n", "\\begin{pmatrix}\n", "\\mathbb{A}_{d,K-1}[\\mathbf{z}_1] & \\mathbb{B}_{d,K-1}[\\mathbf{w}_1]\n", "\\end{pmatrix}\\end{align*}\n", "\n", "\n", "\n", "\\begin{align*}\n", "\\mathbf{z}_3 &:= g_2(\\mathbf{z}_2) = \\tilde\\gamma(\\mathbf{z}_2)\\\\\n", "B_2\n", "&:= J_{g_2}(\\mathbf{z}_2)\n", "= \\mathbb{C}_{K}[\\mathbf{z}_2]\n", "\\end{align*}\n", "\n", "\n", "\n", "\n", "*Loss:* \n", "\n", "\n", "\\begin{align*}\n", "\\mathbf{z}_4\n", "&:= \\ell_{\\mathbf{y}}(\\mathbf{z}_3) = - \\sum_{i=1}^K y_i \\log z_{3,i}\n", "= - \\sum_{i=1}^K y_i \\log \\tilde\\gamma\\left(\\mathcal{W}_1 \\mathbf{x}_1^{;1}\\right)_{i}\\\\\n", "\\mathbf{q}_3\n", "&:= \\nabla {\\ell_{\\mathbf{y}}}(\\mathbf{z}_3)\n", "= \\left[- \\frac{y_i}{z_{3,i}}\\right]_{i=1}^K.\n", "\\end{align*}\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Backward layer loop:*\n", "\n", "\n", "\\begin{align*}\n", "\\mathbf{q}_{2} &:= B_2^T \\mathbf{q}_{3}\n", "= \\mathbb{C}_{K}[\\mathbf{z}_2]^T \\mathbf{q}_{3}\n", "\\end{align*}\n", "\n", "\n", "\n", "\\begin{align*}\n", "\\mathbf{p}_{1} &:= A_1^T \\mathbf{q}_{2}\n", "= \\mathbb{A}_{d,K-1}[\\mathbf{z}_1]^T \\mathbf{q}_{2}\n", "\\end{align*}\n", "\n", "\n", "*Output:* \n", "\n", "$$\n", "\\nabla f_{\\mathbf{x},\\mathbf{y}}(\\mathbf{w})\n", "= \\mathbf{p}_1 = \n", "\\begin{pmatrix}\n", "\\left(\\tilde\\gamma\\left(\\mathcal{W} \\,\\mathbf{x}^{;1}\\right)_1 - y_1\\right)\\, \\mathbf{x}^{;1}\\\\ \n", "\\vdots\\\\ \n", "\\left(\\tilde\\gamma\\left(\\mathcal{W} \\,\\mathbf{x}^{;1}\\right)_{K-1} - y_{K-1}\\right) \\, \\mathbf{x}^{;1}\n", "\\end{pmatrix}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.2.3 MNIST dataset" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We will use the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset. Quoting [Wikipedia](https://en.wikipedia.org/wiki/MNIST_database):\n", "\n", "> The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by \"re-mixing\" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here is a sample of the images:\n", "\n", "![MNIST sample images](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)\n", "\n", "([Source](https://commons.wikimedia.org/wiki/File:MnistExamples.png))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We first load the data and convert it to an appropriate matrix representation. The data can be accessed with Flux.Data.MNIST." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "60000" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imgs = MNIST.images()\n", "labels = MNIST.labels()\n", "length(imgs)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "For example, the first image and its label are:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAHAAAABwCAAAAADji6uXAAAESmlDQ1BrQ0dDb2xvclNwYWNlR2VuZXJpY0dyYXkAADiNjVVbaBxVGP535+wGJA4+aBtaaAcvbSlpmESricXa7Wa7SRM362ZTmyrKZHY2O93ZmXFmdpuEPpWCb1oQpK+C+hgLIlgv2LzYl4rFkko1DwoRWowgKH1S8DtnJpvZDV5mOOd857+d//wXDlHPH5rrWkmFqGEHXr6UmT09e0bpuUlJkqmX8Gm672aKxUmObcc2aNt3/zYl+HrrELe1nf+vX6pi+DrWaxhOxdcbRAmVKF3VXS8g6rkM+vC5wOX4JvDD9XIpC7wOLEe6/Hskb9iGZ+pK3tMWlaLnVE0r7ut/8f/X17Cam+ftxej169MTWA/C54uGPTMNfAB4WddyHPcD326ZpwohTibd4HgplE8ONOszmYh+uuqdmInoF2vNMY4HgJeXauWXgB8CXrPnClOR/EbdmeB2+oikPt3PngF+HFitGeM8Twpw2XNKUxE9qBijOeBngS+bwXg5tC9967emcyFmtFTLFsKz2MBZ7WQReAfwUcPKl0I7rOwGRW5zGHjBtgqToc/siuHnoruz74NaeSyUTyUDr8x1HwXeVzVPjIf+p8Zq3lgp9CcVuJaoraeBl71mid99H/C65uXyoc30AxVtlMf5KeAhOpXQyCCH5jDrZNNfuK9PJrUEcskDr4q9RXlI2Bgedjp4eSCNFoGKMSkDOy4T7hSqYKfQvNDyBeJW7kZWsnvepyaoNdoAtQb0Av0oKAv0EzWwZkFtgjffZTeL1aYleKBEnt2LbDpsJ1PZkxhH2CR7jg2zEVLY8+wYO8pGQR1hR2Lex33n3t1rW3od58Z9X4FEAB0LntnQ8UWkluhP8OtCMhatS7uaB1z3nTcveK+Z+jdv/dYRPR/yod2fYdER9Jju9fOf98Xju8o+eeVW7/XzNBXPkshbpTtLqfXU3dQq5juptbiN1A+pNfx3tt2X+7OZlc3cZsCzBK2BYQqO37bWBA4wV4XOoQ6Lcey07c9jONtOcf4xJhxropZiN6val3a57qsf8GgabxTuF+hCv3pF3VDfU79Tf1VX1XeBfpHelj6WvpCuSp9KN0iRrkkr0pfSV9KH0mfYfQTqinS1q5LmO6unXbN6VGGcG4h8Z2JR4dTN+50Fb8tTQ8Sh84TO6m+fJR+Xd8uPyaPyXvkJeVI+KB+Wj8k75SGMQXlM3g/O7naUrCgDZlfHmTQrYhXmyRbdpIHfwKzF/AplYzFPPIg4m11dvtn9pujGsDod7DWaATLpnND1RX5s0f3d2kvidCfxMo8g28MG2XjUgxl2GF040dGPw7xL07n0aDpDSvpgeiQ9mD7J8VbtpveDO4I5F/PeaEd2q4fmRJ3WRYxaQsLHTIGxEPBHJuu4i545XwuUIVV9RsngeTWUcVsf6Fc0y1IEy1c8wze8llEZIP52h8/T7y+KNzmx44be9FrRm5VIfE30N7ePkzQTJdzgAAAAOGVYSWZNTQAqAAAACAABh2kABAAAAAEAAAAaAAAAAAACoAIABAAAAAEAAABwoAMABAAAAAEAAABwAAAAAP1Kc4sAAAKASURBVGgF7Zo9aBRBGIbjDxZKotgoBESSIoIosVBBAkGCiJAUQRuFNGqnIVUaOwtFUAsTUqQKpJC0aqXgTywEQTRpFPuonb+IJiTq8+oNrBtvdvcOPvBjXnjYmb3bG773Ze5mdq+lJSk5kBxIDiQHkgPJgeTA/+/AmqolrOOCzZmLztPeCF1wDq7BSfgOV+AiZLU227Fo+x9wfZGNO3jDBjgEPbAFjkNeC5wYg0H4AvMwC3n5t9S8wug83EcA9yE77/KZqP8DTsNXddBb+ACv1cnJvEL/A0Yz3Ir/T6Ejl4O6Ov8RDsMSFOXMW37Lv6XmFUa/S9/j+ij0wwvQd6U0B0dA8243jEBZmVfof8DoPAy5tNHQb9wknIEhuAmNyL+l5hVG52HI6HOt8al2PMtxBvQ7WFXmFfofsNQ8DDltonEHeuEY3IOq8m+peYWVMlRenfActJ55CM9gAn5CGZlX6H/AyhkqJ+0Bp6BVHXQBpuGdOgXyb6l5hQ1lqJj2wHXoUwdpvXMJ3qgTkXmF/gdsOEPFpHs2A6A5qQ96ANpzxOTfUvMKm8owZLVIQwvcZTgKj6CezCv0P2CpvcW/8tjLyROwH8KHvKT9GGLyb6l5hcH+mO1/vabnE8Ogdc32zCsrtLWmKdozmlfof8DS36XK6xTo+dJOyEr7C61nbmdP1mn7t9S8wsIMt5GF7omOw65cLrrvfRVuQdH8C5eaV+h/wLoZ6pmF9gvd0BECqB2fcNS+4i58q50re/BvqXmFqzI8SBh6VnEA2nPBKK8bcBn0zKIRmVfof8BVaxqtVUTQKxq6R6o1i/5noXtszci/peYVNhNHujY5kBxIDiQH/jjwCwxNTGDrgl7+AAAAAElFTkSuQmCC", "text/plain": [ "28×28 Array{Gray{N0f8},2} with eltype Gray{Normed{UInt8,8}}:\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) … Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) … Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) … Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " ⋮ ⋱ \n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) … Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) … Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)\n", " Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0) Gray{N0f8}(0.0)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imgs" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We first transform the images into vectors using reshape." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search: \u001b[0m\u001b[1mr\u001b[22m\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1mh\u001b[22m\u001b[0m\u001b[1ma\u001b[22m\u001b[0m\u001b[1mp\u001b[22m\u001b[0m\u001b[1me\u001b[22m p\u001b[0m\u001b[1mr\u001b[22momot\u001b[0m\u001b[1me\u001b[22m_\u001b[0m\u001b[1ms\u001b[22m\u001b[0m\u001b[1mh\u001b[22m\u001b[0m\u001b[1ma\u001b[22m\u001b[0m\u001b[1mp\u001b[22m\u001b[0m\u001b[1me\u001b[22m\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "reshape(A, dims...) -> AbstractArray\n", "reshape(A, dims) -> AbstractArray\n", "\\end{verbatim}\n", "Return an array with the same data as \\texttt{A}, but with different dimension sizes or number of dimensions. The two arrays share the same underlying data, so that the result is mutable if and only if \\texttt{A} is mutable, and setting elements of one alters the values of the other.\n", "\n", "The new dimensions may be specified either as a list of arguments or as a shape tuple. At most one dimension may be specified with a \\texttt{:}, in which case its length is computed such that its product with all the specified dimensions is equal to the length of the original array \\texttt{A}. The total number of elements must not change.\n", "\n", "\\section{Examples}\n", "\\begin{verbatim}\n", "julia> A = Vector(1:16)\n", "16-element Array{Int64,1}:\n", " 1\n", " 2\n", " 3\n", " 4\n", " 5\n", " 6\n", " 7\n", " 8\n", " 9\n", " 10\n", " 11\n", " 12\n", " 13\n", " 14\n", " 15\n", " 16\n", "\n", "julia> reshape(A, (4, 4))\n", "4×4 Array{Int64,2}:\n", " 1 5 9 13\n", " 2 6 10 14\n", " 3 7 11 15\n", " 4 8 12 16\n", "\n", "julia> reshape(A, 2, :)\n", "2×8 Array{Int64,2}:\n", " 1 3 5 7 9 11 13 15\n", " 2 4 6 8 10 12 14 16\n", "\n", "julia> reshape(1:6, 2, 3)\n", "2×3 reshape(::UnitRange{Int64}, 2, 3) with eltype Int64:\n", " 1 3 5\n", " 2 4 6\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "reshape(A, dims...) -> AbstractArray\n", "reshape(A, dims) -> AbstractArray\n", "\n", "\n", "Return an array with the same data as A, but with different dimension sizes or number of dimensions. The two arrays share the same underlying data, so that the result is mutable if and only if A is mutable, and setting elements of one alters the values of the other.\n", "\n", "The new dimensions may be specified either as a list of arguments or as a shape tuple. At most one dimension may be specified with a :, in which case its length is computed such that its product with all the specified dimensions is equal to the length of the original array A. The total number of elements must not change.\n", "\n", "# Examples\n", "\n", "jldoctest\n", "julia> A = Vector(1:16)\n", "16-element Array{Int64,1}:\n", " 1\n", " 2\n", " 3\n", " 4\n", " 5\n", " 6\n", " 7\n", " 8\n", " 9\n", " 10\n", " 11\n", " 12\n", " 13\n", " 14\n", " 15\n", " 16\n", "\n", "julia> reshape(A, (4, 4))\n", "4×4 Array{Int64,2}:\n", " 1 5 9 13\n", " 2 6 10 14\n", " 3 7 11 15\n", " 4 8 12 16\n", "\n", "julia> reshape(A, 2, :)\n", "2×8 Array{Int64,2}:\n", " 1 3 5 7 9 11 13 15\n", " 2 4 6 8 10 12 14 16\n", "\n", "julia> reshape(1:6, 2, 3)\n", "2×3 reshape(::UnitRange{Int64}, 2, 3) with eltype Int64:\n", " 1 3 5\n", " 2 4 6\n", "\n" ], "text/plain": [ "\u001b[36m reshape(A, dims...) -> AbstractArray\u001b[39m\n", "\u001b[36m reshape(A, dims) -> AbstractArray\u001b[39m\n", "\n", " Return an array with the same data as \u001b[36mA\u001b[39m, but with different dimension sizes\n", " or number of dimensions. The two arrays share the same underlying data, so\n", " that the result is mutable if and only if \u001b[36mA\u001b[39m is mutable, and setting elements\n", " of one alters the values of the other.\n", "\n", " The new dimensions may be specified either as a list of arguments or as a\n", " shape tuple. At most one dimension may be specified with a \u001b[36m:\u001b[39m, in which case\n", " its length is computed such that its product with all the specified\n", " dimensions is equal to the length of the original array \u001b[36mA\u001b[39m. The total number\n", " of elements must not change.\n", "\n", "\u001b[1m Examples\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[36m julia> A = Vector(1:16)\u001b[39m\n", "\u001b[36m 16-element Array{Int64,1}:\u001b[39m\n", "\u001b[36m 1\u001b[39m\n", "\u001b[36m 2\u001b[39m\n", "\u001b[36m 3\u001b[39m\n", "\u001b[36m 4\u001b[39m\n", "\u001b[36m 5\u001b[39m\n", "\u001b[36m 6\u001b[39m\n", "\u001b[36m 7\u001b[39m\n", "\u001b[36m 8\u001b[39m\n", "\u001b[36m 9\u001b[39m\n", "\u001b[36m 10\u001b[39m\n", "\u001b[36m 11\u001b[39m\n", "\u001b[36m 12\u001b[39m\n", "\u001b[36m 13\u001b[39m\n", "\u001b[36m 14\u001b[39m\n", "\u001b[36m 15\u001b[39m\n", "\u001b[36m 16\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> reshape(A, (4, 4))\u001b[39m\n", "\u001b[36m 4×4 Array{Int64,2}:\u001b[39m\n", "\u001b[36m 1 5 9 13\u001b[39m\n", "\u001b[36m 2 6 10 14\u001b[39m\n", "\u001b[36m 3 7 11 15\u001b[39m\n", "\u001b[36m 4 8 12 16\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> reshape(A, 2, :)\u001b[39m\n", "\u001b[36m 2×8 Array{Int64,2}:\u001b[39m\n", "\u001b[36m 1 3 5 7 9 11 13 15\u001b[39m\n", "\u001b[36m 2 4 6 8 10 12 14 16\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> reshape(1:6, 2, 3)\u001b[39m\n", "\u001b[36m 2×3 reshape(::UnitRange{Int64}, 2, 3) with eltype Int64:\u001b[39m\n", "\u001b[36m 1 3 5\u001b[39m\n", "\u001b[36m 2 4 6\u001b[39m" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?reshape" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "784-element Array{Float32,1}:\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " ⋮\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0\n", " 0.0" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reshape(Float32.(imgs),:)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Using a list comprehension and reduce with hcat, we do this for every image in the dataset." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "Xtrain = reduce(hcat, [reshape(Float32.(imgs[i]),:) for i = 1:length(imgs)]);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We can get back the original images by using reshape again." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAHAAAABwCAAAAADji6uXAAAESmlDQ1BrQ0dDb2xvclNwYWNlR2VuZXJpY0dyYXkAADiNjVVbaBxVGP535+wGJA4+aBtaaAcvbSlpmESricXa7Wa7SRM362ZTmyrKZHY2O93ZmXFmdpuEPpWCb1oQpK+C+hgLIlgv2LzYl4rFkko1DwoRWowgKH1S8DtnJpvZDV5mOOd857+d//wXDlHPH5rrWkmFqGEHXr6UmT09e0bpuUlJkqmX8Gm672aKxUmObcc2aNt3/zYl+HrrELe1nf+vX6pi+DrWaxhOxdcbRAmVKF3VXS8g6rkM+vC5wOX4JvDD9XIpC7wOLEe6/Hskb9iGZ+pK3tMWlaLnVE0r7ut/8f/X17Cam+ftxej169MTWA/C54uGPTMNfAB4WddyHPcD326ZpwohTibd4HgplE8ONOszmYh+uuqdmInoF2vNMY4HgJeXauWXgB8CXrPnClOR/EbdmeB2+oikPt3PngF+HFitGeM8Twpw2XNKUxE9qBijOeBngS+bwXg5tC9967emcyFmtFTLFsKz2MBZ7WQReAfwUcPKl0I7rOwGRW5zGHjBtgqToc/siuHnoruz74NaeSyUTyUDr8x1HwXeVzVPjIf+p8Zq3lgp9CcVuJaoraeBl71mid99H/C65uXyoc30AxVtlMf5KeAhOpXQyCCH5jDrZNNfuK9PJrUEcskDr4q9RXlI2Bgedjp4eSCNFoGKMSkDOy4T7hSqYKfQvNDyBeJW7kZWsnvepyaoNdoAtQb0Av0oKAv0EzWwZkFtgjffZTeL1aYleKBEnt2LbDpsJ1PZkxhH2CR7jg2zEVLY8+wYO8pGQR1hR2Lex33n3t1rW3od58Z9X4FEAB0LntnQ8UWkluhP8OtCMhatS7uaB1z3nTcveK+Z+jdv/dYRPR/yod2fYdER9Jju9fOf98Xju8o+eeVW7/XzNBXPkshbpTtLqfXU3dQq5juptbiN1A+pNfx3tt2X+7OZlc3cZsCzBK2BYQqO37bWBA4wV4XOoQ6Lcey07c9jONtOcf4xJhxropZiN6val3a57qsf8GgabxTuF+hCv3pF3VDfU79Tf1VX1XeBfpHelj6WvpCuSp9KN0iRrkkr0pfSV9KH0mfYfQTqinS1q5LmO6unXbN6VGGcG4h8Z2JR4dTN+50Fb8tTQ8Sh84TO6m+fJR+Xd8uPyaPyXvkJeVI+KB+Wj8k75SGMQXlM3g/O7naUrCgDZlfHmTQrYhXmyRbdpIHfwKzF/AplYzFPPIg4m11dvtn9pujGsDod7DWaATLpnND1RX5s0f3d2kvidCfxMo8g28MG2XjUgxl2GF040dGPw7xL07n0aDpDSvpgeiQ9mD7J8VbtpveDO4I5F/PeaEd2q4fmRJ3WRYxaQsLHTIGxEPBHJuu4i545XwuUIVV9RsngeTWUcVsf6Fc0y1IEy1c8wze8llEZIP52h8/T7y+KNzmx44be9FrRm5VIfE30N7ePkzQTJdzgAAAAOGVYSWZNTQAqAAAACAABh2kABAAAAAEAAAAaAAAAAAACoAIABAAAAAEAAABwoAMABAAAAAEAAABwAAAAAP1Kc4sAAAKASURBVGgF7Zo9aBRBGIbjDxZKotgoBESSIoIosVBBAkGCiJAUQRuFNGqnIVUaOwtFUAsTUqQKpJC0aqXgTywEQTRpFPuonb+IJiTq8+oNrBtvdvcOPvBjXnjYmb3bG773Ze5mdq+lJSk5kBxIDiQHkgPJgeTA/+/AmqolrOOCzZmLztPeCF1wDq7BSfgOV+AiZLU227Fo+x9wfZGNO3jDBjgEPbAFjkNeC5wYg0H4AvMwC3n5t9S8wug83EcA9yE77/KZqP8DTsNXddBb+ACv1cnJvEL/A0Yz3Ir/T6Ejl4O6Ov8RDsMSFOXMW37Lv6XmFUa/S9/j+ij0wwvQd6U0B0dA8243jEBZmVfof8DoPAy5tNHQb9wknIEhuAmNyL+l5hVG52HI6HOt8al2PMtxBvQ7WFXmFfofsNQ8DDltonEHeuEY3IOq8m+peYWVMlRenfActJ55CM9gAn5CGZlX6H/AyhkqJ+0Bp6BVHXQBpuGdOgXyb6l5hQ1lqJj2wHXoUwdpvXMJ3qgTkXmF/gdsOEPFpHs2A6A5qQ96ANpzxOTfUvMKm8owZLVIQwvcZTgKj6CezCv0P2CpvcW/8tjLyROwH8KHvKT9GGLyb6l5hcH+mO1/vabnE8Ogdc32zCsrtLWmKdozmlfof8DS36XK6xTo+dJOyEr7C61nbmdP1mn7t9S8wsIMt5GF7omOw65cLrrvfRVuQdH8C5eaV+h/wLoZ6pmF9gvd0BECqB2fcNS+4i58q50re/BvqXmFqzI8SBh6VnEA2nPBKK8bcBn0zKIRmVfof8BVaxqtVUTQKxq6R6o1i/5noXtszci/peYVNhNHujY5kBxIDiQH/jjwCwxNTGDrgl7+AAAAAElFTkSuQmCC", "text/plain": [ "28×28 Array{Gray{Float32},2} with eltype Gray{Float32}:\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) … Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) … Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) … Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " ⋮ ⋱ \n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) … Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) … Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)\n", " Gray{Float32}(0.0) Gray{Float32}(0.0) Gray{Float32}(0.0)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Gray.(reshape(Xtrain[:,1],(28,28)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We also convert the labels into vectors. We use [one-hot encoding](https://fluxml.ai/Flux.jl/stable/data/onehot/), that is, we convert the label 0 to the standard basis $\\mathbf{e}_1 \\in \\mathbb{R}^{10}$, the label 1 to $\\mathbf{e}_2 \\in \\mathbb{R}^{10}$, and so on. The functions onehot and onehotbatch perform this transformation, while onecold undoes it." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search:\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "onehotbatch(ls, labels[, unk...])\n", "\\end{verbatim}\n", "Return a \\texttt{OneHotMatrix} where \\texttt{k}th column of the matrix is \\texttt{onehot(ls[k], labels)}.\n", "\n", "If one of the input labels \\texttt{ls} is not found in \\texttt{labels} and \\texttt{unk} is given, return \\href{@ref}{\\texttt{onehot(unk, labels)}} ; otherwise the function will raise an error.\n", "\n", "\\section{Examples}\n", "\\begin{verbatim}\n", "julia> Flux.onehotbatch([:b, :a, :b], [:a, :b, :c])\n", "3×3 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:\n", " 0 1 0\n", " 1 0 1\n", " 0 0 0\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "onehotbatch(ls, labels[, unk...])\n", "\n", "\n", "Return a OneHotMatrix where kth column of the matrix is onehot(ls[k], labels).\n", "\n", "If one of the input labels ls is not found in labels and unk is given, return [onehot(unk, labels)](@ref) ; otherwise the function will raise an error.\n", "\n", "# Examples\n", "\n", "jldoctest\n", "julia> Flux.onehotbatch([:b, :a, :b], [:a, :b, :c])\n", "3×3 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:\n", " 0 1 0\n", " 1 0 1\n", " 0 0 0\n", "\n" ], "text/plain": [ "\u001b[36m onehotbatch(ls, labels[, unk...])\u001b[39m\n", "\n", " Return a \u001b[36mOneHotMatrix\u001b[39m where \u001b[36mk\u001b[39mth column of the matrix is \u001b[36monehot(ls[k],\n", " labels)\u001b[39m.\n", "\n", " If one of the input labels \u001b[36mls\u001b[39m is not found in \u001b[36mlabels\u001b[39m and \u001b[36munk\u001b[39m is given,\n", " return \u001b[36monehot(unk, labels)\u001b[39m ; otherwise the function will raise an error.\n", "\n", "\u001b[1m Examples\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[36m julia> Flux.onehotbatch([:b, :a, :b], [:a, :b, :c])\u001b[39m\n", "\u001b[36m 3×3 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:\u001b[39m\n", "\u001b[36m 0 1 0\u001b[39m\n", "\u001b[36m 1 0 1\u001b[39m\n", "\u001b[36m 0 0 0\u001b[39m" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?onehotbatch" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search: SkipC\u001b[0m\u001b[1mo\u001b[22m\u001b[0m\u001b[1mn\u001b[22mn\u001b[0m\u001b[1me\u001b[22m\u001b[0m\u001b[1mc\u001b[22mti\u001b[0m\u001b[1mo\u001b[22mn Exp\u001b[0m\u001b[1mo\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1me\u001b[22mntialBa\u001b[0m\u001b[1mc\u001b[22mk\u001b[0m\u001b[1mO\u001b[22mff c\u001b[0m\u001b[1mo\u001b[22mmpo\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1me\u001b[22mnt_\u001b[0m\u001b[1mc\u001b[22mentr\u001b[0m\u001b[1mo\u001b[22mids\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "onecold(y[, labels = 1:length(y)])\n", "\\end{verbatim}\n", "Inverse operations of \\href{@ref}{\\texttt{onehot}}.\n", "\n", "\\section{Examples}\n", "\\begin{verbatim}\n", "julia> Flux.onecold([true, false, false], [:a, :b, :c])\n", ":a\n", "\n", "julia> Flux.onecold([0.3, 0.2, 0.5], [:a, :b, :c])\n", ":c\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "onecold(y[, labels = 1:length(y)])\n", "\n", "\n", "Inverse operations of [onehot](@ref).\n", "\n", "# Examples\n", "\n", "jldoctest\n", "julia> Flux.onecold([true, false, false], [:a, :b, :c])\n", ":a\n", "\n", "julia> Flux.onecold([0.3, 0.2, 0.5], [:a, :b, :c])\n", ":c\n", "\n" ], "text/plain": [ "\u001b[36m onecold(y[, labels = 1:length(y)])\u001b[39m\n", "\n", " Inverse operations of \u001b[36monehot\u001b[39m.\n", "\n", "\u001b[1m Examples\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[36m julia> Flux.onecold([true, false, false], [:a, :b, :c])\u001b[39m\n", "\u001b[36m :a\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> Flux.onecold([0.3, 0.2, 0.5], [:a, :b, :c])\u001b[39m\n", "\u001b[36m :c\u001b[39m" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?onecold" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "For example, on the first label we get:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "10-element Flux.OneHotVector:\n", " 0\n", " 0\n", " 0\n", " 0\n", " 0\n", " 1\n", " 0\n", " 0\n", " 0\n", " 0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "onehot(labels, 0:9)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "onecold(ans, 0:9)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We do this for all labels simultaneously." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "ytrain = onehotbatch(labels, 0:9);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We will also use a test dataset provided in MNIST to assess the accuracy of our classifiers. We perform the same transformation." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "10000" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_imgs = MNIST.images(:test)\n", "test_labels = MNIST.labels(:test)\n", "length(test_labels)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "Xtest = reduce(hcat, \n", " [reshape(Float32.(test_imgs[i]),:) for i = 1:length(test_imgs)])\n", "ytest = onehotbatch(test_labels, 0:9);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 6.2.4 Implementation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We appeal to [multinomial logistic regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) to learn a classifier for the MNIST data. Here the model takes the form of an affine map from $\\mathbb{R}^{784}$ to $\\mathbb{R}^{10}$ (where $784 = 28^2$ is the size of the images in vector form and $10$ is the dimension of the one-hot encoding of the labels) composed with the [softmax](https://en.wikipedia.org/wiki/Softmax_function) function which returns a probability distribution over the $10$ labels. The loss function is the [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression), as we explained above." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "In Flux, composition of functions can be achieved with Chain." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search: \u001b[0m\u001b[1mC\u001b[22m\u001b[0m\u001b[1mh\u001b[22m\u001b[0m\u001b[1ma\u001b[22m\u001b[0m\u001b[1mi\u001b[22m\u001b[0m\u001b[1mn\u001b[22m bat\u001b[0m\u001b[1mc\u001b[22m\u001b[0m\u001b[1mh\u001b[22med_\u001b[0m\u001b[1ma\u001b[22mdjo\u001b[0m\u001b[1mi\u001b[22m\u001b[0m\u001b[1mn\u001b[22mt \u001b[0m\u001b[1mc\u001b[22m\u001b[0m\u001b[1mh\u001b[22m\u001b[0m\u001b[1ma\u001b[22mnnelv\u001b[0m\u001b[1mi\u001b[22mew\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "Chain(layers...)\n", "\\end{verbatim}\n", "Chain multiple layers / functions together, so that they are called in sequence on a given input.\n", "\n", "\\texttt{Chain} also supports indexing and slicing, e.g. \\texttt{m} or \\texttt{m[1:end-1]}. \\texttt{m[1:3](x)} will calculate the output of the first three layers.\n", "\n", "\\section{Examples}\n", "\\begin{verbatim}\n", "julia> m = Chain(x -> x^2, x -> x+1);\n", "\n", "julia> m(5) == 26\n", "true\n", "\n", "julia> m = Chain(Dense(10, 5), Dense(5, 2));\n", "\n", "julia> x = rand(10);\n", "\n", "julia> m(x) == m(m(x))\n", "true\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "Chain(layers...)\n", "\n", "\n", "Chain multiple layers / functions together, so that they are called in sequence on a given input.\n", "\n", "Chain also supports indexing and slicing, e.g. m or m[1:end-1]. m[1:3](x) will calculate the output of the first three layers.\n", "\n", "# Examples\n", "\n", "jldoctest\n", "julia> m = Chain(x -> x^2, x -> x+1);\n", "\n", "julia> m(5) == 26\n", "true\n", "\n", "julia> m = Chain(Dense(10, 5), Dense(5, 2));\n", "\n", "julia> x = rand(10);\n", "\n", "julia> m(x) == m(m(x))\n", "true\n", "\n" ], "text/plain": [ "\u001b[36m Chain(layers...)\u001b[39m\n", "\n", " Chain multiple layers / functions together, so that they are called in\n", " sequence on a given input.\n", "\n", " \u001b[36mChain\u001b[39m also supports indexing and slicing, e.g. \u001b[36mm\u001b[39m or \u001b[36mm[1:end-1]\u001b[39m. \u001b[36mm[1:3](x)\u001b[39m\n", " will calculate the output of the first three layers.\n", "\n", "\u001b[1m Examples\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[36m julia> m = Chain(x -> x^2, x -> x+1);\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> m(5) == 26\u001b[39m\n", "\u001b[36m true\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> m = Chain(Dense(10, 5), Dense(5, 2));\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> x = rand(10);\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> m(x) == m(m(x))\u001b[39m\n", "\u001b[36m true\u001b[39m" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?Chain" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Hence our model is:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Chain(Dense(784, 10), softmax)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = Chain(\n", " Dense(28^2, 10), \n", " softmax\n", ")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "At initialization, the parameters are set randomly.\n", "\n", "For example, one the first sample, we get the following probability distribution over labels:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "10-element Array{Float32,1}:\n", " 0.11419689\n", " 0.08392158\n", " 0.1407485\n", " 0.032359276\n", " 0.11348817\n", " 0.10049408\n", " 0.1500063\n", " 0.053701017\n", " 0.12868376\n", " 0.08240051" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m(Xtrain[:,1])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We also define a function which computes the accuracy of the predictions." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "accuracy(x, y) = mean(onecold(m(x), 0:9) .== onecold(y, 0:9));" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "With random initialization, the current accuracy on the test dataset is close to $10\\%$, as one would expect from a purely random guess among $10$ choices." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.052" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy(Xtest, ytest)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We are now ready to make mini-batches and set the parameters of the optimizer. " ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "loader = DataLoader(Xtrain, ytrain; batchsize=128, shuffle=true);" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "loss(x, y) = crossentropy(m(x), y)\n", "ps = params(m)\n", "opt = ADAM()\n", "evalcb = () -> @show(accuracy(Xtest,ytest));" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We run ADAM for $10$ epochs. You can check for yourself that running it much longer does not lead to much improvement." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy(Xtest, ytest) = 0.0608\n", "accuracy(Xtest, ytest) = 0.9254\n", " 8.873735 seconds (18.12 M allocations: 5.078 GiB, 4.46% gc time)\n" ] } ], "source": [ "@time train!(loss, ps, ncycle(loader, Int(1e1)), opt, cb = throttle(evalcb, 2))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The final accuracy is:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.9257" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy(Xtest, ytest)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "To make a prediction, we use m(x) which returns a probability distribution over the $10$ labels. The function onecold then returns the label with highest probability." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "10-element Array{Float32,1}:\n", " 1.4131784f-5\n", " 2.004657f-10\n", " 2.7680435f-5\n", " 0.00446238\n", " 8.006697f-7\n", " 2.7219783f-5\n", " 1.4060293f-9\n", " 0.9950832\n", " 2.7037542f-5\n", " 0.00035753092" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m(Xtest[:,1])" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "onecold(ans, 0:9)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The true label in that case was:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "onecold(ytest[:,1], 0:9)" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.5.1", "language": "julia", "name": "julia-1.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.5.1" } }, "nbformat": 4, "nbformat_minor": 2 }