{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# TOPIC 2 \n", "\n", "# Spectral and singular value decompositions\n", "\n", "## 2 Matrix norms and low-rank approximations\n", "\n", "***\n", "*Course:* [Math 535](http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS) \n", "*Author:* [Sebastien Roch](http://www.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison \n", "*Updated:* Sep 28, 2020 \n", "*Copyright:* © 2020 Sebastien Roch\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "In this section, we discuss low-rank approximations of matrices. We first introduce matrix norms which allow us in particular to talk about the distance between two matrices. Throughout this section, $\\|\\mathbf{x}\\|$ refers to the $2$-norm of $\\mathbf{x}$ (although the concepts we derive below can be extended to [other vector norms](https://en.wikipedia.org/wiki/Matrix_norm#Matrix_norms_induced_by_vector_norms))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 2.1 Matrix norms\n", "\n", "Recall that the Frobenius norm of an $n \\times m$ matrix $A \\in \\mathbb{R}^{n \\times m}$ is defined as\n", "\n", "$$\n", "\\|A\\|_F\n", "= \\sqrt{\\sum_{i=1}^n \\sum_{j=1}^m a_{ij}^2}.\n", "$$\n", "\n", "The Frobenius norm does not directly relate to $A$ as a representation of a [linear map](https://en.wikipedia.org/wiki/Linear_map). In particular, it is desirable in many contexts to quantify how two matrices differ in terms of how they act on vectors." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For instance, one is often interested in bounding quantities of the following form. Let $B, B' \\in \\mathbb{R}^{n \\times m}$ and let $\\mathbf{x} \\in \\mathbb{R}^m$ be of unit norm. What can be said about $\\|B \\mathbf{x} - B' \\mathbf{x}\\|$? Intuitively, what we would like is this: if the norm of $B - B'$ is small then $B$ is close to $B'$ as a linear map, that is, the vector norm $\\|B \\mathbf{x} - B' \\mathbf{x}\\|$ is small for any unit vector $\\mathbf{x}$. The following definition provides us with such a notion. Define $\\mathbb{S}^{m-1} = \\{\\mathbf{x} \\in \\mathbb{R}^m\\,:\\,\\|\\mathbf{x}\\| = 1\\}$.\n", "\n", "**Definition (Induced Norm):** The $2$-norm of a matrix $A \\in \\mathbb{R}^{n \\times m}$ is\n", "\n", "$$\n", "\\|A\\|_2\n", "= \\max_{\\mathbf{0} \\neq \\mathbf{x} \\in \\mathbb{R}^m} \\frac{\\|A \\mathbf{x}\\|}{\\|\\mathbf{x}\\|} = \\max_{\\mathbf{x} \\in \\mathbb{S}^{m-1}} \\|A \\mathbf{x}\\|.\n", "$$\n", "\n", "$\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The equality in the definition uses the homogeneity of the vector norm. Also the definition implicitly uses the *Extreme Value Theorem*. In this case, we use the fact that the function $f(\\mathbf{x}) = \\|A \\mathbf{x}\\|$ is continuous and the set $\\mathbb{S}^{m-1}$ is closed and bounded to conclude that there exists $\\mathbf{x}^* \\in \\mathbb{S}^{m-1}$ such that $f(\\mathbf{x}^*) \\geq f(\\mathbf{x})$ for all $\\mathbf{x} \\in \\mathbb{S}^{m-1}$.\n", "\n", "*Exercise:* Let $A \\in \\mathbb{R}^{n \\times m}$. Use Cauchy-Schwarz to show that\n", "\n", "$$\n", "\\|A\\|_2 \n", "= \\max \\left\\{\n", "\\mathbf{x}^T A \\mathbf{y}\n", "\\,:\\,\n", "\\|\\mathbf{x}\\| = \\|\\mathbf{y}\\| = 1\n", "\\right\\}.\n", "$$\n", "\n", "$\\lhd$\n", "\n", "*Exercise:* Let $A \\in \\mathbb{R}^{n \\times m}$. \n", "\n", "(a) Show that $\\|A\\|_F^2 = \\sum_{j=1}^m \\|A \\mathbf{e}_j\\|^2$.\n", "\n", "(b) Use (a) and Cauchy-Schwarz to show that $\\|A\\|_2 \\leq \\|A\\|_F$. \n", "\n", "(c) Give an example such that $\\|A\\|_F = \\sqrt{n} \\|A\\|_2$. $\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The $2$-norm of a matrix has many other useful properties.\n", "\n", "***\n", "**Lemma (Properties of the Induced Norm):** Let $A, B \\in \\mathbb{R}^{n \\times m}$ and $\\alpha \\in \\mathbb{R}$. The following hold:\n", "\n", "(a) $\\|A \\mathbf{x}\\| \\leq \\|A\\|_2 \\|\\mathbf{x}\\|$, $\\forall \\mathbf{0} \\neq \\mathbf{x} \\in \\mathbb{R}^m$\n", "\n", "(b) $\\|A\\|_2 \\geq 0$\n", "\n", "(c) $\\|A\\|_2 = 0$ if and only if $A = 0$\n", "\n", "(d) $\\|\\alpha A\\|_2 \\leq |\\alpha| \\|A\\|_2$\n", "\n", "(e) $\\|A + B \\|_2 \\leq \\|A\\|_2 + \\|B\\|_2$\n", "\n", "(f) $\\|A B \\|_2 \\leq \\|A\\|_2 \\|B\\|_2$.\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* These properties all follow from the definition of the induced norm and the corresponding properties for the vector norm:\n", "\n", "- Claims (a) and (b) are immediate from the definition.\n", "\n", "- For (c) note that $\\|A\\|_2 = 0$ implies $\\|A \\mathbf{x}\\|_2 = 0, \\forall \\mathbf{x} \\in \\mathbb{S}^{m-1}$, so that $A \\mathbf{x} = \\mathbf{0}, \\forall \\mathbf{x} \\in \\mathbb{S}^{m-1}$. In particular, $a_{ij} = \\mathbf{e}_i^T A \\mathbf{e}_j = 0, \\forall i,j$.\n", "\n", "- For (d), (e), (f), observe that for all $\\mathbf{x} \\in \\mathbb{S}^{m-1}$\n", "\n", "$$\n", "\\|\\alpha A \\mathbf{x}\\| = |\\alpha| \\|A \\mathbf{x}\\|\n", "$$\n", "\n", "$$\\|(A+B)\\mathbf{x}\\|\n", "= \\|A\\mathbf{x} + B\\mathbf{x}\\| \\leq \\|A\\mathbf{x}\\| + \\|B\\mathbf{x}\\|\n", "\\leq \\|A\\|_2 + \\|B\\|_2\n", "$$\n", "\n", "$$\n", "\\|(AB)\\mathbf{x}\\|\n", "= \\|A(B\\mathbf{x})\\| \\leq \\|A\\|_2 \\|B\\mathbf{x}\\|\n", "\\leq \\|A\\|_2 \\|B\\|_2.$$\n", "$\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Exercise:* Use Cauchy-Schwarz to show that for any $A, B$ it holds that\n", "\n", "$$\n", "\\|A B \\|_F \\leq \\|A\\|_F \\|B\\|_F.\n", "$$\n", "\n", "$\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**NUMERICAL CORNER** In Julia, the Frobenius norm of a matrix can be computed using the function [norm](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.norm) while the induced norm can be computed using the function [opnorm](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.opnorm)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "#Julia version: 1.5.1\n", "using LinearAlgebra" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "3×2 Array{Float64,2}:\n", " 1.0 0.0\n", " 0.0 1.0\n", " 0.0 0.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = [1. 0.; 0. 1.; 0. 0.]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1.4142135623730951" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm(A)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "opnorm(A)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 2.2 Rank-$k$ approximation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Now that we have defined a notion of distance between matrices, we will consider the problem of finding a good approximation to a matrix $A$ among all matrices of rank at most $k$. We will start with the Frobenius norm, which is easier to work with, and we will show later on that the solution is the same under the induced norm. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "From the proof of the *Row Rank Equals Column Rank Lemma*, it follows that a rank-$r$ matrix $A$ can be written as a sum of $r$ rank-$1$ matrices\n", "\n", "$$\n", "A = \n", "\\sum_{i=1}^r \\mathbf{b}_i \\mathbf{c}_i^T.\n", "$$\n", "\n", "We will now consider the problem of finding a \"simpler\" approximation to $A$\n", "\n", "$$\n", "A \\approx \\sum_{i=1}^k \\mathbf{b}'_i (\\mathbf{c}'_i)^T\n", "$$\n", "\n", "where $k < r$. Here we measure the quality of this approximation using a matrix norm. We will see in the next subsection that this problem has a natural interpretation in a data analysis context." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "But, first, we are ready to state our key observation. In words, the best rank-$k$ approximation to $A$ in Frobenius norm is obtained by projecting the rows of $A$ onto a linear subspace of dimension $k$. We will come back to how one finds the best such subspace below. \n", "\n", "***\n", "**Lemma (Projection and Rank-$k$ Approximation):** Let $A \\in \\mathbb{R}^{n \\times m}$. For any matrix $B \\in \\mathbb{R}^{n \\times m}$ of rank $k \\leq \\min\\{n,m\\}$,\n", "\n", "$$\n", "\\|A - B_{\\perp}\\|_F \\leq \\|A - B\\|_F\n", "$$\n", "\n", "where $B_{\\perp} \\in \\mathbb{R}^{n \\times m}$ is the matrix of rank at most $k$ obtained as follows. Denote row $i$ of $A$, $B$ and $B_{\\perp}$ respectively by $\\boldsymbol{\\alpha}_i^T$, $\\mathbf{b}_{i}^T$ and $\\mathbf{b}_{\\perp,i}^T$, $i=1,\\ldots, n$. Set $\\mathbf{b}_{\\perp,i}$ to be the orthogonal projection of $\\boldsymbol{\\alpha}_i$ onto $\\mathcal{Z} = \\mathrm{span}(\\mathbf{b}_{i}, i=1,\\ldots,n)$.\n", "\n", "***\n", "\n", "*Proof idea:* The square of the Frobenius norm decomposes as a sum of squared row norms. Each term in the sum is minimized by the orthogonal projection. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* By definition of the Frobenius norm, we note that\n", "\n", "$$\n", "\\|A - B\\|_F^2\n", "= \\sum_{i=1}^n \\sum_{j=1}^m (a_{i,j} - b_{i,j})^2 \n", "= \\sum_{i=1}^n \\|\\boldsymbol{\\alpha}_i - \\mathbf{b}_{i}\\|^2\n", "$$\n", "\n", "and similarly for $\\|A - B_{\\perp}\\|_F$. We make two key observations:\n", "\n", "(1) Because the orthogonal projection of $\\boldsymbol{\\alpha}_i$ onto $\\mathcal{Z}$ minimizes the distance to $\\mathcal{Z}$, it follows that term by term $\\|\\boldsymbol{\\alpha}_i - \\mathbf{b}_{\\perp,i}\\| \\leq \\|\\boldsymbol{\\alpha}_i - \\mathbf{b}_{i}\\|$ so that\n", "\n", "$$\n", "\\|A - B_\\perp\\|_F^2 =\n", "\\sum_{i=1}^n \\|\\boldsymbol{\\alpha}_i - \\mathbf{b}_{\\perp,i}\\|^2\n", "\\leq \\sum_{i=1}^n \\|\\boldsymbol{\\alpha}_i - \\mathbf{b}_{i}\\|^2\n", "= \\|A - B\\|_F^2.\n", "$$\n", "\n", "(2) Moreover, because the projections satisfy $\\mathbf{b}_{\\perp,i} \\in \\mathcal{Z}$ for all $i$, $\\mathrm{row}(B_\\perp) \\subseteq \\mathrm{row}(B)$ and, hence, the rank of $B_\\perp$ is at most the rank of $B$.\n", "\n", "That concludes the proof. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.3 Approximating subspaces\n", "\n", "Think of the rows $\\boldsymbol{\\alpha}_i^T$ of $A \\in \\mathbb{R}^{n \\times m}$ as a collection of $n$ data points in $\\mathbb{R}^m$. A natural way to identify a low-dimensional structure in this dataset is to find a low-dimensional linear subspace $\\mathcal{Z}$ of $\\mathbb{R}^m$ such that the $\\boldsymbol{\\alpha}_i$'s are \"close to it.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![PCA](https://i.stack.imgur.com/DAX3R.png)\n", "\n", "([Source](https://i.stack.imgur.com/DAX3R.png))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Again the squared $2$-norm turns out to be convenient computationally. So we are looking for a linear subspace $\\mathcal{Z}$ that minimizes\n", "\n", "$$\n", "\\sum_{i=1}^n \\|\\boldsymbol{\\alpha}_i - \\mathrm{proj}_\\mathcal{Z}(\\boldsymbol{\\alpha}_i)\\|^2 \n", "$$\n", "\n", "over all linear subspaces of $\\mathbb{R}^m$ of dimension at most $k$. By the *Projection and Rank-$k$ Approximation Lemma*, this problem is equivalent to finding a matrix $B$ that minimizes \n", "\n", "$$\n", "\\|A - B\\|_F\n", "$$\n", "\n", "among all matrices in $\\mathbb{R}^{n \\times m}$ of rank at most $k$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The following observation gives a related, useful characterization.\n", "\n", "***\n", "**Lemma:** Let $\\boldsymbol{\\alpha}_i$, $i =1\\ldots,n$, be vectors in $\\mathbb{R}^m$. A linear subspace $\\mathcal{Z}$ of $\\mathbb{R}^m$ that minimizes\n", "\n", "$$\n", "\\sum_{i=1}^n \\|\\boldsymbol{\\alpha}_i - \\mathrm{proj}_\\mathcal{Z}(\\boldsymbol{\\alpha}_i)\\|^2 \n", "$$\n", "\n", "over all subspaces of dimension at most $k$ also maximizes\n", "\n", "$$\n", "\\sum_{i=1}^n \\|\\mathrm{proj}_\\mathcal{Z}(\\boldsymbol{\\alpha}_i)\\|^2 \n", "$$\n", "\n", "over the same subspaces.\n", "***\n", "\n", "*Proof idea:* This is a straightforward application of the triangle inequality." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![Orthogonal Projection](https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/Gram–Schmidt_process.svg/640px-Gram–Schmidt_process.svg.png)\n", "\n", "([Source](https://commons.wikimedia.org/wiki/File:Gram–Schmidt_process.svg))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* By Pythagoras, \n", "\n", "$$\n", "\\|\\boldsymbol{\\alpha}_i - \\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)\\|^2 + \\|\\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)\\|^2 \n", "= \\|\\boldsymbol{\\alpha}_i\\|^2\n", "$$\n", "\n", "since, by the *Orthogonal Decomposition Lemma*, $\\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)$ is orthogonal to $\\boldsymbol{\\alpha}_i - \\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)$. Rearranging, \n", "\n", "$$\n", "\\|\\boldsymbol{\\alpha}_i - \\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)\\|^2\n", "= \\|\\boldsymbol{\\alpha}_i\\|^2 - \\|\\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)\\|^2.\n", "$$\n", "\n", "The result follows from the fact that the first term on the right-hand side does not depend on the choice of $\\mathcal{Z}$. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The lemma immediately gives a characterization of the solution to the simplest version of the problem, the rank-$1$ case. Indeed a one-dimensional space $\\mathcal{Z}$ is determined by a unit vector $\\mathbf{v}$. The projection $\\boldsymbol{\\alpha}_i$ onto $\\mathbf{v}$ is given by the inner product formula $\\langle \\boldsymbol{\\alpha}_i, \\mathbf{v} \\rangle \\,\\mathbf{v}$. So \n", "\n", "$$\n", "\\sum_{i=1}^n \\|\\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)\\|^2\n", "= \\sum_{i=1}^n \\langle \\boldsymbol{\\alpha}_i, \\mathbf{v} \\rangle ^2\n", "= \\|A \\mathbf{v}\\|^2\n", "$$\n", "\n", "where, again, $A$ is the matrix with rows $\\boldsymbol{\\alpha}_i ^T$, $i=1,\\ldots, n$. So the solution is\n", "\n", "$$\n", "\\mathbf{v}_1 \\in \\arg\\max\\{\\|A \\mathbf{v}\\|:\\|\\mathbf{v}\\| = 1\\}.\n", "$$\n", "\n", "Here $\\arg\\max$ means that $\\mathbf{v}_1$ is a $\\mathbf{v}$ that achieves the maximum. Note that there could be more than one such $\\mathbf{v}$, in which case we pick an arbitrary one.\n", "\n", "*Exercise:* Construct a matrix $A \\in \\mathbb{R}^{n \\times n}$ for which there exist multiple solutions to the maximization problem above. $\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For general $k$, we are looking for an orthonormal set $\\mathbf{v}_1,\\ldots,\\mathbf{v}_k$ of size $k$ that maximizes\n", "\n", "\n", "\\begin{align}\n", "\\sum_{i=1}^n \\|\\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)\\|^2\n", "&=\\sum_{i=1}^n \\left\\|\\sum_{j=1}^k \\langle \\boldsymbol{\\alpha}_i, \\mathbf{v}_j \\rangle \\,\\mathbf{v}_j\\right\\|^2\\\\\n", "&= \\sum_{i=1}^n \\sum_{j=1}^k \\langle \\boldsymbol{\\alpha}_i, \\mathbf{v}_j \\rangle ^2\\\\\n", "&= \\sum_{j=1}^k \\left(\\sum_{i=1}^n \\langle \\boldsymbol{\\alpha}_i, \\mathbf{v}_j \\rangle ^2\\right)\\\\\n", "&= \\sum_{j=1}^k \\|A \\mathbf{v}_j\\|^2\n", "\\end{align}\n", "\n", "\n", "by orthonormality, where $\\mathcal{Z}$ is the subspace spanned by $\\mathbf{v}_1,\\ldots,\\mathbf{v}_k$. We show next that a simple algorithm solves this problem." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.4 A greedy algorithm\n", "\n", "Remarkably, the following sequence of subproblems leads to the solution. We first compute\n", "\n", "$$\n", "\\mathbf{v}_1 \\in \\arg\\max\\{\\|A \\mathbf{v}\\|:\\|\\mathbf{v}\\| = 1\\}.\n", "$$\n", "\n", "By the *Extreme Value Theorem*, we know such a $\\mathbf{v}_1$ exists (but may not be unique). \n", "\n", "Then we consider all unit vectors orthogonal to $\\mathbf{v}_1$ and compute\n", "\n", "$$\n", "\\mathbf{v}_2\\in \\arg\\max\n", "\\{\\|A \\mathbf{v}\\|:\\|\\mathbf{v}\\| = 1,\\ \\langle \\mathbf{v}, \\mathbf{v}_1 \\rangle = 0\\}.\n", "$$\n", "\n", "Again, such a $\\mathbf{v}_2$ exists by the *Extreme Value Theorem*. Then proceeding by induction\n", "\n", "$$\n", "\\mathbf{v}_i\\in \\arg\\max\n", "\\{\\|A \\mathbf{v}\\|:\\|\\mathbf{v}\\| = 1, \\ \\langle \\mathbf{v}, \\mathbf{v}_j \\rangle = 0, \\forall j \\leq i-1\\}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In words, the claim -- which requires a proof -- is that the best $k$-dimensional approximating subspace is obtained by finding the best $1$-dimensional subspace, then the best $1$-dimensional subspace orthogonal to the first one, and so on. That is, we can proceed [greedily](https://en.wikipedia.org/wiki/Greedy_algorithm). While it is clear that, after $k$ steps, this procedure constructs an orthonormal set of size $k$, it is far from obvious that it maximizes $\\sum_{j=1}^k \\|A \\mathbf{v}_j\\|^2$ over all such sets. We establish this claim next." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "First, we will need the following basic linear algebra fact, which is left as an exercise.\n", "\n", "*Exercise:* Establish the following facts:\n", "\n", "(a) Let $\\mathcal{Z} \\subseteq \\mathcal{W}$ be linear subspaces such that $\\mathrm{dim}(\\mathcal{Z}) < \\mathrm{dim}(\\mathcal{W})$. Show that there exists a unit vector $\\mathbf{w} \\in \\mathcal{W}$ that is orthogonal to $\\mathcal{Z}$.\n", "\n", "(b) Let $\\mathcal{W} = \\mathrm{span}(\\mathbf{w}_1,\\ldots,\\mathbf{w}_\\ell)$ and $\\mathbf{z} \\in \\mathcal{W}$ of unit norm. Use the *Linear Dependence Lemma* and Gram-Schmidt to show that there exists an orthonormal basis of $\\mathcal{W}$ that includes $\\mathbf{z}$. $\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "**Theorem (Greedy Finds Best Fit):** Let $A \\in \\mathbb{R}^{n \\times m}$ be a matrix with rows $\\boldsymbol{\\alpha}_i^T$, $i=1,\\ldots,n$. For any $k \\leq \\mathrm{rk}(A)$, let $\\mathbf{v}_1,\\ldots,\\mathbf{v}_k$ be the greedy sequence constructed above. Then $\\mathcal{Z}^* = \\mathrm{span}(\\mathbf{v}_1,\\ldots,\\mathbf{v}_k)$ is a solution to the minimization problem\n", "\n", "$$\n", "\\min \\left\\{\n", "\\sum_{i=1}^n \\|\\boldsymbol{\\alpha}_i - \\mathrm{proj}_{\\mathcal{Z}}(\\boldsymbol{\\alpha}_i)\\|^2\\ :\\ \\text{\\mathcal{Z} is a linear subspace of dimension \\leq k} \n", "\\right\\}.\n", "$$\n", "***\n", "\n", "*Proof idea:* We proceed by induction. For an arbitrary orthonormal set $\\mathbf{w}_1,\\ldots,\\mathbf{w}_k$, we use the exercise above to find an orthonormal basis of their span containing an element orthogonal to $\\mathbf{v}_1,\\ldots,\\mathbf{v}_{k-1}$. Then we use the defintion of $\\mathbf{v}_k$ to conclude." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* As we explained in the previous subsection, we reformulate the problem as the maximization\n", "\n", "$$\n", "\\max \\left\\{\n", "\\sum_{j=1}^k \\|A \\mathbf{w}_j\\|^2\\ :\\ \\text{\\{\\mathbf{w}_1,\\ldots,\\mathbf{w}_k\\} is an orthonormal set} \n", "\\right\\}.\n", "$$\n", "\n", "We proceed by induction. For $k=1$, we define $\\mathbf{v}_1$ precisely as a solution of the above maximization. For $\\ell < k$, assume that for any orthonormal set $\\{\\mathbf{w}'_1,\\ldots,\\mathbf{w}'_\\ell\\}$, we have\n", "\n", "$$\n", "\\sum_{j=1}^\\ell \\|A \\mathbf{w}'_j\\|^2 \\leq \\sum_{j=1}^\\ell \\|A \\mathbf{v}_j\\|^2.\n", "$$\n", "\n", "Now consider an orthonormal set $\\{\\mathbf{w}_1,\\ldots,\\mathbf{w}_k\\}$ and let its span be $\\mathcal{W} = \\mathrm{span}(\\mathbf{w}_1,\\ldots,\\mathbf{w}_k)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***Step 1.*** For $j=1,\\ldots,k-1$, let $\\mathbf{v}_j'$ be the orthogonal projection of\n", "$\\mathbf{v}_j$ onto $\\mathcal{W}$ and let $\\mathcal{V}' = \\mathrm{span}(\\mathbf{v}'_1,\\ldots,\\mathbf{v}'_{k-1})$. Because $\\mathcal{V}' \\subseteq \\mathcal{W}$ has dimension at most $k-1$ while $\\mathcal{W}$ itself has dimension $k$, by the exercise above we can find an orthonormal basis $\\mathbf{w}'_1,\\ldots,\\mathbf{w}'_{k}$ of $\\mathcal{W}$ such that $\\mathbf{w}'_k$ is orthogonal to $\\mathcal{V}'$. Then, for any $j=1,\\ldots,k-1$, we have the decomposition $\\mathbf{v}_j = \\mathbf{v}'_j + (\\mathbf{v}_j - \\mathbf{v}'_j)$ where $\\mathbf{v}'_j \\in \\mathcal{V}'$ is orthogonal to $\\mathbf{w}'_k$ and $\\mathbf{v}_j - \\mathbf{v}'_j$ is also orthogonal to $\\mathbf{w}'_k \\in \\mathcal{W}$ by the properties of the orthogonal projection. Hence \n", "\n", "$$\n", "\\left\\langle \\sum_{j=1}^{k-1}\\beta_j \\mathbf{v}_j, \\mathbf{w}'_k \\right\\rangle = 0\n", "$$\n", "\n", "for any $\\beta_j$'s. That is, $\\mathbf{w}'_k$ is orthogonal to $\\mathrm{span}(\\mathbf{v}_1,\\ldots,\\mathbf{v}_{k-1})$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***Step 2.*** By the induction hypothesis\n", "\n", "$$\n", "\\sum_{j=1}^{k-1} \\|A \\mathbf{w}'_j\\|^2 \\leq \\sum_{j=1}^{k-1} \\|A \\mathbf{v}_j\\|^2. \n", "$$\n", "\n", "Moreover, recalling that the $\\boldsymbol{\\alpha}_i^T$'s are the rows of $A$,\n", "\n", "$$\n", "\\sum_{j=1}^k \\|A \\mathbf{w}_j\\|^2\n", "= \\sum_{i=1}^n \\|\\mathrm{proj}_{\\mathcal{W}}(\\boldsymbol{\\alpha}_i)\\|^2\n", "= \\sum_{j=1}^k \\|A \\mathbf{w}_j'\\|^2\n", "$$\n", "\n", "since the $\\mathbf{w}_j$'s and $\\mathbf{w}'_j$'s form an orthonormal basis of the same subspace $\\mathcal{W}$. Since $\\mathbf{w}'_k$ is orthogonal to $\\mathrm{span}(\\mathbf{v}_1,\\ldots,\\mathbf{v}_{k-1})$, by definition of $\\mathbf{v}_k$\n", "\n", "$$\n", "\\|A \\mathbf{w}'_k\\|^2 \\leq \\|A \\mathbf{v}_k\\|^2.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***Step 3.*** Putting everything together \n", "\n", "$$\n", "\\sum_{j=1}^k \\|A \\mathbf{w}_j\\|^2\n", "=\n", "\\sum_{j=1}^{k-1} \\|A \\mathbf{w}'_j\\|^2 \n", "+ \\|A \\mathbf{w}'_k\\|^2 \\leq \\sum_{j=1}^{k} \\|A \\mathbf{v}_j\\|^2\n", "$$\n", "\n", "which proves the claim. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Note that we have not entirely solved the best approximating subspace problem from a computational point of view, as we have not given an efficient computational procedure to find a solution to the $1$-dimensional subproblems. We've only shown that the solutions exist and have the right properties. We will take care of this in the next section." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.5.1", "language": "julia", "name": "julia-1.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.5.1" } }, "nbformat": 4, "nbformat_minor": 2 }