{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# TOPIC 0 \n", "\n", "# Introduction\n", "\n", "## 3 Further observations\n", "\n", "***\n", "*Course:* [Math 535](http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS) \n", "*Author:* [Sebastien Roch](http://www.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison \n", "*Updated:* Sep 1, 2020 \n", "*Copyright:* © 2020 Sebastien Roch\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We make a few more observations that will hint at things to come in subsequent topics." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 3.1 Matrix form of $k$-means clustering" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The $k$-means clustering objective can be written in matrix form. We will need a notion of matrix norm. A natural way to define a norm for matrices is to notice that an $n \\times m$ matrix $A$ can be thought of as an $nm$ vector, with one element for each entry of $A$. Indeed, addition and scalar multiplication work exactly in the same way. Hence, we can define the $2$-norm of a matrix in terms of the sum of its squared entries.\n", "\n", "**Definition (Frobenius Norm):** The Frobenius norm of an $n \\times m$ matrix $A \\in \\mathbb{R}^{n \\times m}$ is defined as\n", "\n", "$$\n", "\\|A\\|_F\n", "= \\sqrt{\\sum_{i=1}^n \\sum_{j=1}^m a_{ij}^2}.\n", "$$\n", "\n", "$\\lhd$\n", "\n", "We will encounter other matrix norms later in the course." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As we indicated before, for a collection of $n$ data vectors $\\mathbf{x}_1, \\ldots, \\mathbf{x}_n$ in $\\mathbb{R}^d$, it is often convenient to stack them up into a matrix\n", "\n", "$$\n", "X =\n", "\\begin{bmatrix}\n", "\\mathbf{x}_1^T \\\\\n", "\\mathbf{x}_2^T \\\\\n", "\\vdots \\\\\n", "\\mathbf{x}_n^T \\\\\n", "\\end{bmatrix}\n", "=\n", "\\begin{bmatrix}\n", "x_{11} & x_{12} & \\cdots & x_{1d} \\\\\n", "x_{21} & x_{22} & \\cdots & x_{2d} \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "x_{n1} & x_{n2} & \\cdots & x_{nd} \\\\\n", "\\end{bmatrix}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can do the same with cluster representatives. Given $\\boldsymbol{\\mu}_1,\\ldots,\\boldsymbol{\\mu}_k$ also in $\\mathbb{R}^d$, we form the matrix \n", "\n", "$$\n", "U =\n", "\\begin{bmatrix}\n", "\\boldsymbol{\\mu}_1^T \\\\\n", "\\boldsymbol{\\mu}_2^T \\\\\n", "\\vdots \\\\\n", "\\boldsymbol{\\mu}_k^T \\\\\n", "\\end{bmatrix}\n", "=\n", "\\begin{bmatrix}\n", "\\mu_{11} & \\mu_{12} & \\cdots & \\mu_{1d} \\\\\n", "\\mu_{21} & \\mu_{22} & \\cdots & \\mu_{2d} \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "\\mu_{k1} & \\mu_{k2} & \\cdots & \\mu_{kd} \\\\\n", "\\end{bmatrix}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Perhaps less obviously, cluster assignments can also be encoded in matrix form. Recall that, given a partition $C_1,\\ldots,C_k$ of $[n]$, we define $c(i) = j$ if $i \\in C_j$. For $i=1,\\ldots,n$ and $j=1,\\ldots,k$, set $z_{ij} = 1$ if $c(i) = j$ and $0$ otherwise, and let $Z$ be the $n \\times k$ matrix with entries $z_{ij}$. That is, row $i$ has exactly one entry with value $1$, corresponding to the assigned cluster $c(i)$ of data point $\\mathbf{x}_i$, and all other entries $0$.\n", "\n", "With this notation, the representative of the cluster assigned to data point $\\mathbf{x}_i$ is obtained through the matrix product\n", "\n", "$$\n", "\\boldsymbol{\\mu}_{c(i)}^T \n", "= \\sum_{j = 1}^k z_{ij} \\boldsymbol{\\mu}_{j}^T\n", "= \\left(Z U\\right)_{i,\\cdot}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So\n", "\n", "\n", "\\begin{align*}\n", "G(C_1,\\ldots,C_k; \\boldsymbol{\\mu}_1, \\ldots, \\boldsymbol{\\mu}_k)\n", "&= \\sum_{i=1}^n \\|\\mathbf{x}_i - \\boldsymbol{\\mu}_{c(i)}\\|^2\\\\\n", "&= \\sum_{i=1}^n \\sum_{\\ell=1}^d \\left(x_{i,\\ell} - (Z U)_{i,\\ell}\\right)^2\\\\\n", "&= \\|X - Z U \\|^2_F,\n", "\\end{align*}\n", "\n", "\n", "where we used the definition of the Frobenius norm." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In other words, minimizing the $k$-means objective is equivalent to finding a matrix factorization of the form $ZU$ that is a good fit to the data matrix $X$ in Frobenius form. This formulation expresses in a more compact form the idea of representing $X$ as a combination of a small number of representatives. Matrix factorization will come back repeatedly in this course. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**NUMERICAL CORNER** In Julia, the Frobenius norm of a matrix can be computed using the function [norm](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.norm)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Julia version: 1.5.1\n", "using Plots, LinearAlgebra, Statistics" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "3×2 Array{Float64,2}:\n", " 1.0 0.0\n", " 0.0 1.0\n", " 0.0 0.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = [1. 0.; 0. 1.; 0. 0.]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "1.4142135623730951" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm(A)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 3.2 High-dimensional space\n", "\n", "

a high-dimensional space is a lonely place

— Bernhard Schölkopf (@bschoelkopf) August 24, 2014
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Applying Chebyshev's inequality to sums of independent random variables has useful statistical implications: it shows that, with a large enough number of samples $n$, the sample mean is close to the population mean. Hence it allows us to infer properties of a population from samples. Interestingly, one can apply a similar argument to a different asymptotic regime: the limit of large dimension $d$. But as we will see in this section, the statistical implications are quite different." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### 3.2.1 High-dimensional cube" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "To start explaining the quote above, we consider a simple experiment. Let $\\mathcal{C} = [-1/2,1/2]^d$ be the $d$-cube with side lengths $1$ centered at the origin and let $\\mathcal{B} = \\{\\mathbf{x} \\in \\mathbb{R}^d : \\|x\\|\\leq 1/2\\}$ be the inscribed $d$-ball. In $d=2$ dimensions:\n", "\n", "![In the plane](https://media.geeksforgeeks.org/wp-content/uploads/square-3.png)\n", "([Source](https://www.geeksforgeeks.org/program-to-calculate-area-of-an-circle-inscribed-in-a-square/))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now pick a point $\\mathbf{X}$ uniformly at random in $\\mathcal{C}$. What is the probability that it falls in $\\mathcal{B}$? \n", "\n", "To generate $\\mathbf{X}$, we pick $d$ independent random variables $X_1, \\ldots, X_d \\sim \\mathrm{U}[-1/2, 1/2]$, and form the vector $\\mathbf{X} = (X_1, \\ldots, X_d)$. Indeed, the PDF of $\\mathbf{X}$ is then $f_{\\mathbf{X}}(\\mathbf{x})= 1^d = 1$ if $\\mathbf{x} \\in \\mathcal{C}$ and $0$ otherwise.\n", "\n", "The event we are interested in is $A = \\left\\{\\|\\mathbf{X}\\| \\leq 1/2\\right\\}$. The uniform distribution over the set $\\mathcal{C}$ has the property that $\\mathbb{P}[A]$ is the volume of $A$ divided by the volume of $\\mathcal{C}$. In this case, the volume of $\\mathcal{C}$ is $1^d = 1$ and the volume of $A$ has an [explicit formula](https://en.wikipedia.org/wiki/Volume_of_an_n-ball)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This leads to the following surprising fact:\n", "\n", "***\n", "**Theorem (High-dimensional Cube)** Let \n", "$\\mathcal{B} = \\{\\mathbf{x} \\in \\mathbb{R}^d \\,:\\, \\|\\mathbf{x}\\|\\leq 1/2\\}$ and\n", "$\\mathcal{C} = [-1/2,1/2]^d$. Pick $\\mathbf{X} \\sim \\mathrm{U}[\\mathcal{C}]$. Then, as $d \\to +\\infty$,\n", "\n", "$$\n", "\\mathbb{P}[\\mathbf{X} \\in \\mathcal{B}]\n", "\\to 0.\n", "$$\n", "***\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In words, in high dimension if one picks a point at random from the cube, it is unlikely to be close to the origin. Instead it is likely to be in the corners. A geometric interpretation is that a high-dimensional cube is a bit like a spiky ball. A visualization of this theorem:\n", "![Curse of dimensionality](http://www.visiondummy.com/wp-content/uploads/2014/04/sparseness.png)\n", "([Source](https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We give a proof based on Chebyshev's inequality. It has the advantage of providing some insight into this counter-intuitive phenomenon by linking it to the concentration of sums of independent random variables, in this case the squared norm of $\\mathbf{X}$.\n", "\n", "*Proof idea:* We think of $\\|\\mathbf{X}\\|^2$ as a sum of independent random variables and apply Chebyshev's inequality. It implies that the norm of $\\mathbf{X}$ is concentrated around its mean, which grows like $\\sqrt{d}$. The latter is larger than $1/2$ for $d$ large." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof:* Write out $\\|\\mathbf{X}\\|^2\n", "= \\sum_{i=1}^d X_i^2$. Using linearity of expectation and the fact that the $X_i$'s are independent, we get\n", "\n", "$$\n", "\\mathbb{E}\\left[\n", "\\|\\mathbf{X}\\|^2\n", "\\right]\n", "= \\sum_{i=1}^d \\mathbb{E}[X_i^2]\n", "= d \\,\\mathbb{E}[X_1^2]\n", "$$\n", "\n", "and\n", "\n", "$$\n", "\\mathrm{Var}\\left[\n", "\\|\\mathbf{X}\\|^2\n", "\\right]\n", "= \\sum_{i=1}^d \\mathrm{Var}[X_i^2]\n", "= d \\,\\mathrm{Var}[X_1^2].\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We bound the probability of interest as follows. We first square the norm and center around the mean:\n", "\n", "\n", "\\begin{align}\n", "\\mathbb{P}\n", "\\left[\n", "\\|\\mathbf{X}\\| \\leq 1/2\n", "\\right]\n", "&= \n", "\\mathbb{P}\n", "\\left[\n", "\\|\\mathbf{X}\\|^2 \\leq 1/4\n", "\\right]\\\\\n", "&=\n", "\\mathbb{P}\n", "\\left[\n", "\\|\\mathbf{X}\\|^2 \n", "- \\mathbb{E}[\\|\\mathbf{X}\\|^2]\\leq 1/4 - d \\,\\mathbb{E}[X_1^2]\n", "\\right].\n", "\\end{align}\n", "\n", "\n", "Now notice that $\\mathbb{E}[X_1^2] > 0$ does not depend on $d$. Take $d$ large enough\n", "that $d \\,\\mathbb{E}[X_1^2] > 1/4$. We then use the following fact: if $\\alpha = d \\mathbb{E}[X_1^2] - 1/4 > 0$ and $Z = \\|X\\|^2 - \\mathbb{E}\\|X\\|^2$, we can write by monotonicity and the definition of the absolute value \n", "\n", "$$\n", "\\mathbb{P}[Z \\leq -\\alpha] \\leq \\mathbb{P}[Z \\leq -\\alpha\\text{ or }Z \\geq \\alpha] = \\mathbb{P}[|Z| \\geq \\alpha]. \n", "$$\n", "\n", "We arrive at\n", "\n", "$$\n", "\\mathbb{P}\n", "\\left[\n", "\\|\\mathbf{X}\\| \\leq 1/2\n", "\\right]\n", "\\leq \n", "\\mathbb{P}\n", "\\left[\n", "\\left|\n", "\\|\\mathbf{X}\\|^2 \n", "- \\mathbb{E}[\\|\\mathbf{X}\\|^2]\n", "\\right|\n", "\\geq d \\,\\mathbb{E}[X_1^2] - 1/4\n", "\\right].\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can now apply Chebyshev's inequality to the right-hand side, which gives\n", "\n", "\n", "\\begin{align}\n", "\\mathbb{P}\n", "\\left[\n", "\\|\\mathbf{X}\\| \\leq 1/2\n", "\\right]\n", "&\\leq\n", "\\frac{\\mathrm{Var}\\left[\n", "\\|\\mathbf{X}\\|^2\n", "\\right]}\n", "{(d \\,\\mathbb{E}[X_1^2] - 1/4)^2}\\\\\n", "&=\n", "\\frac{d \\,\\mathrm{Var}[X_1^2]}\n", "{(d \\,\\mathbb{E}[X_1^2] - 1/4)^2}\\\\\n", "&=\n", "\\frac{1}{d} \\cdot \\frac{\\mathrm{Var}[X_1^2]}\n", "{(\\mathbb{E}[X_1^2] - 1/(4d))^2}\n", ".\n", "\\end{align}\n", "\n", "\n", "Again, $\\mathrm{Var}[X_1^2]$ does not depend on $d$. So the right-hand side goes to $0$ as $d \\to +\\infty$, as claimed.$\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We will see later in the course that this high-dimensional phenomenon has implications for data science problems. It is behind what is referred to as the [Curse of Dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**NUMERICAL CORNER** We can check the theorem in a simulation. Here we pick $n$ points uniformly at random in the $d$-cube $\\mathcal{C}$, for a range of dimensions $[d_\\min, d_\\max]$. We then plot the frequency of landing in the inscribed $d$-ball $\\mathcal{B}$ and see that it rapidly converges to $0$. Alternatively, we could just plot the formula for the volume of $\\mathcal{B}$. But knowing how to do simulations is useful in situations where explicit formulas are unavailable or intractable." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "highdim_cube (generic function with 1 method)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function highdim_cube(dmax, n)\n", " in_ball = zeros(Float64, dmax) # in-ball freq\n", " for d=1:dmax # for each dimension \n", " in_ball[d] = mean([(norm(rand(d) .- 1/2) < 1/2) for i=1:n])\n", " end\n", " plot(1:dmax, in_ball, \n", " legend=false, markershape=:diamond, xlabel=\"dim\", ylabel=\"in-ball freq\") \n", "end" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "highdim_cube(10, 1000)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 3.2.2 Gaussians in high dimension [optional]\n", "\n", "In this optional section, we turn our attention to the [Gaussian (or Normal) distribution](https://en.wikipedia.org/wiki/Normal_distribution) and its behavior in high dimension. \n", "Using Chebyshev's inequality, we show that a standard Normal vector has the following counter-intuititve property in high dimension: a typical draw has $2$-norm that is highly likely to be around $\\sqrt{d}$. Visually, when $d$ is large, the joint PDF looks something like this:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![Spherical shell](https://upload.wikimedia.org/wikipedia/commons/thumb/0/07/Kugelschale.svg/640px-Kugelschale.svg.png)\n", "([Source](https://en.wikipedia.org/wiki/Spherical_shell))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "This is unexpected because the joint PDF is maximized at $\\mathbf{x} = \\mathbf{0}$ for all $d$ (including $d=1$ as can be seen in the figure above). But the rough intuition is the following: (1) there is only \"one way\" to obtain $\\|\\mathbf{X}\\|^2 = 0$ -- every coordinate must be $0$ by the point-separating property of the $2$-norm; (2) on the other hand, there are \"many ways\" to obtain $\\|\\mathbf{X}\\|^2 = \\sqrt{d}$ -- and that compensates for the lower density." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "***\n", "**Theorem (High-dimensional Gaussians)** Let $\\mathbf{X}$ be a standard Normal $d$-vector. Then, for any $\\varepsilon > 0$,\n", "\n", "$$\n", "\\mathbb{P}\n", "\\left[\\,\n", "\\|\\mathbf{X}\\| \\notin (\\sqrt{d(1-\\varepsilon)}, \\sqrt{d(1+\\varepsilon)})\n", "\\,\\right]\n", "\\to 0\n", "$$\n", "\n", "as $d \\to +\\infty$.\n", "\n", "***\n", "\n", "*Proof idea:* We apply Chebyshev's inequality to the squared norm, which is a sum of independent random variables." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Proof:* Let $Z = \\|\\mathbf{X}\\|^2 = \\sum_{i=1}^d X_i^2$ and notice that, by definition, it is a sum of independent random variables. Appealing to the expectation and variance formulas from the previous sections:\n", "\n", "$$\n", "\\mathbb{E}[\\|\\mathbf{X}\\|^2] = d \\,\\mathbb{E}[X_1^2] = d \\,\\mathrm{Var}[X_1] = d\n", "$$\n", "\n", "and\n", "\n", "$$\n", "\\mathrm{Var}[\\|\\mathbf{X}\\|^2] = d \\,\\mathrm{Var}[X_1^2]\n", "$$\n", "\n", "where $\\mathrm{Var}[X_1^2]$ does not depend on $d$. By Chebyshev's inequality\n", "\n", "$$\n", "\\mathbb{P}\\left[\n", "\\|\\mathbf{X}\\|^2 \\notin \n", "\\left(\n", "d(1-\\varepsilon),d(1+\\varepsilon)\n", "\\right)\n", "\\right]\n", "=\n", "\\mathbb{P}[|\\|\\mathbf{X}\\|^2 - d| \\geq \\varepsilon d]\n", "\\leq \\frac{d \\,\\mathrm{Var}[X_1^2]}{\\varepsilon^2 d^2}\n", "= \\frac{\\mathrm{Var}[X_1^2]}{d \\varepsilon^2}.\n", "$$\n", "\n", "Taking a square root inside the probability on the leftmost side and taking a limit as $d \\to +\\infty$ on the rightmost side gives the claim.$\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**NUMERICAL CORNER** We check our claim in a simulation. We generate standard Normal $d$-vectors using the [randn](https://docs.julialang.org/en/v1/stdlib/Random/#Base.randn) function and plot the histogram of their $2$-norm." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "normal_shell (generic function with 1 method)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function normal_shell(d, n)\n", " one_sample_norm = [norm(randn(d)) for i=1:n]\n", " histogram(one_sample_norm, \n", " legend=false, xlims=(0,maximum(one_sample_norm)), nbin=20)\n", "end" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normal_shell(1, 10000)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "In higher dimension:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": false, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normal_shell(100, 10000)" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.5.1", "language": "julia", "name": "julia-1.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.5.1" } }, "nbformat": 4, "nbformat_minor": 2 }