{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# TOPIC 3 \n", "\n", "# Optimality, convexity, and gradient descent\n", "\n", "## 2 Optimality conditions\n", "\n", "***\n", "*Course:* [Math 535](http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS) \n", "*Author:* [Sebastien Roch](http://www.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison \n", "*Updated:* Oct 10, 2020 \n", "*Copyright:* © 2020 Sebastien Roch\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "In this section, we derive optimality conditions for unconstrained continuous optimization problems." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 2.1 Multivariate version of Taylor's theorem" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We will make use of *Taylor's Theorem*, a powerful generalization of the *Mean Value Theorem* that provides polynomial approximations to a function around a point. We restrict ourselves to the case of a linear approximation with second-order error term, which will suffice for our purposes. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 2.1.1 Single-variable case\n", "\n", "We begin by reviewing the single-variable case, which we will use to prove the general verison." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "\n", "**Theorem (Taylor):** Let $f: D \\to \\mathbb{R}$ where $D \\subseteq \\mathbb{R}$. Suppose $f$ has a continuous derivative on $[a,b]$ and that its second derivative exists on $(a,b)$. Then for any $x \\in [a, b]$\n", "\n", "$$\n", "f(x)\n", "= f(a) + (x-a) f'(a) + \\frac{1}{2} (x-a)^2 f''(\\xi)\n", "$$\n", "\n", "for some $a < \\xi < x$.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof idea:* The *Mean Value Theorem* implies that there is $a < \\xi< x$ such that $f(x) = f(a) + (x - a)f'(\\xi)$. One way to think of the proof of that result is the following: we constructed an affine function that agrees with $f$ at $a$ and $x$, then used *Rolle* to express the coefficient of the linear term using $f'$. Here we do the same with a polynomial of degree $2$. But we now have an extra degree of freedom in choosing this polynomial. Because we are looking for a good approximation close to $a$, we choose to make the first derivative at $a$ also agree. Applying *Rolle* twice gives the claim." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* Let\n", "\n", "$$\n", "P(t) = \\alpha_0 + \\alpha_1 (t-a) + \\alpha_2 (t-a)^2.\n", "$$\n", "\n", "We choose the $\\alpha_i$'s so that $P(a) = f(a)$, $P'(a) = f'(a)$, and $P(x) = f(x)$. The first two lead to the conditions\n", "\n", "$$\n", "\\alpha_0 = f(a), \\quad \\alpha_1 = f'(a). \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let $\\phi(t) = f(t) - P(t)$. By construction $\\phi(a) = \\phi(x) = 0$. By *Rolle*, there is a $\\xi' \\in (a, x)$ such that $\\phi'(\\xi') = 0$. Moreover, $\\phi'(a) = 0$. Hence we can apply *Rolle* again - this time to $\\phi'$ on $[a, \\xi']$. It implies that there is $\\xi \\in (a, \\xi')$ such that $\\phi''(\\xi) = 0$.\n", "\n", "The second derivative of $\\phi$ at $\\xi$ is\n", "\n", "$$\n", "0 = \\phi''(\\xi) \n", "= f''(\\xi) - P''(\\xi) \n", "= f''(\\xi) - 2 \\alpha_2\n", "$$\n", "\n", "so $\\alpha_2 = f''(\\xi)/2$. Plugging into $P$ and using $\\phi(x) = 0$ gives the claim. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The third term on the right-hand side of *Taylor's Theorem* is called the remainder. It can be seen as an error term between $f(x)$ and the linear approximation $f(a) + (x-a) f'(a)$. There are [other forms](https://en.wikipedia.org/wiki/Taylor%27s_theorem#Explicit_formulas_for_the_remainder) for the remainder. The form we stated here is useful when one has information about the the second derivative. Here is an example." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Example:** Consider $f(x) = e^x$. Then $f'(x) = f''(x) = e^x$. Suppose we are interested in approximating $f$ in the interval $[0,1]$. We take $a=0$ and $b=1$ in Taylor's Theorem. The linear term is \n", "\n", "$$\n", "f(a) + (x-a) f'(a) = 1 + x e^0 = 1 + x.\n", "$$\n", "\n", "Then for any $x \\in [0,1]$\n", "\n", "$$\n", "f(x) = 1 + x + \\frac{1}{2}x^2 e^{\\xi_x}\n", "$$\n", "\n", "where $\\xi_x \\in (0,1)$ depends on $x$. We get a uniform bound on the error over $[0,1]$ by replacing $\\xi_x$ with its worst possible value over $[0,1]$ \n", "\n", "$$\n", "|f(x) - (1+x)| \\leq \\frac{1}{2}x^2 e^{\\xi_x} \\leq \\frac{e}{2} x^2.\n", "$$" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Julia version: 1.5.1\n", "using Plots" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = LinRange(0,1,100)\n", "y = exp.(x) # function f\n", "taylor = 1 .+ x # linear approximation\n", "err = (exp(1)/2)*x.^2 # our error bound\n", "plot(x, y, lw=2, label=\"f\", legend=:topleft)\n", "plot!(x, taylor, lw=2, label=\"taylor\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "If we plot the upper and lower bounds, we see that $f$ indeed falls within them." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot!(x, taylor-err, linestyle=:dot, linecolor=:green, label=\"lower\")\n", "plot!(x, taylor+err, linestyle=:dash, linecolor=:green, label=\"upper\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 2.1.2 General case\n", "\n", "In the case of several variables, we again restrict ourselves to the second order. We start with an important special case: a *Multivariate Mean Value Theorem*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "\n", "**Theorem (Multivariate Mean Value):** Let $f : D \\to \\mathbb{R}$ where $D \\subseteq \\mathbb{R}^d$. Let $\\mathbf{x}_0 \\in D$ and $\\delta > 0$ be such that $B_\\delta(\\mathbf{x}_0) \\subseteq D$. If $f$ is continuously differentiable on $B_\\delta(\\mathbf{x}_0)$, then for any $\\mathbf{x} \\in B_\\delta(\\mathbf{x}_0)$\n", "\n", "$$\n", "f(\\mathbf{x})\n", "= f(\\mathbf{x}_0) + \\nabla f(\\mathbf{x}_0 + \\xi \\mathbf{p})^T \\mathbf{p} \n", "$$\n", "\n", "for some $\\xi \\in (0,1)$, where $\\mathbf{p} = \\mathbf{x} - \\mathbf{x}_0$.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof idea:* We apply the single-variable result and the *Chain Rule*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* Let $\\phi(t) = f(\\boldsymbol{\\alpha}(t))$ where $\\boldsymbol{\\alpha}(t) = \\mathbf{x}_0 + t \\mathbf{p}$. Observe that $\\phi(0) = f(\\mathbf{x}_0)$ and $\\phi(1) = f(\\mathbf{x})$. By the *Chain Rule*, \n", "\n", "$$\n", "\\phi'(t)\n", "= \\mathbf{J}_f(\\boldsymbol{\\alpha}(t)) \\,\\mathbf{J}_{\\boldsymbol{\\alpha}}(t)\n", "= \\nabla f(\\boldsymbol{\\alpha}(t))^T \\mathbf{p}\n", "= \\nabla f(\\mathbf{x}_0 + t \\mathbf{p})^T \\mathbf{p}.\n", "$$\n", "\n", "In particular, $\\phi$ has a continuous first derivative on $[0,1]$. By the *Mean Value Theorem* in the single-variable case\n", "\n", "$$\n", "\\phi(t)\n", "= \\phi(0) + t \\phi'(\\xi)\n", "$$\n", "\n", "for some $\\xi \\in (0,t)$. Plugging in the expressions for $\\phi(0)$ and $\\phi'(\\xi)$ and taking $t=1$ gives the claim. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We move on to the second-order result. For the more general version, see e.g. [Wikipedia](https://en.wikipedia.org/wiki/Taylor's_theorem#Taylor's_theorem_for_multivariate_functions)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "\n", "**Theorem (Multivariate Taylor):** Let $f : D \\to \\mathbb{R}$ where $D \\subseteq \\mathbb{R}^d$. Let $\\mathbf{x}_0 \\in D$ and $\\delta > 0$ be such that $B_\\delta(\\mathbf{x}_0) \\subseteq D$. If $f$ is twice continuously differentiable on $B_\\delta(\\mathbf{x}_0)$, then for any $\\mathbf{x} \\in B_\\delta(\\mathbf{x}_0)$\n", "\n", "$$\n", "f(\\mathbf{x})\n", "= f(\\mathbf{x}_0) + \\nabla f(\\mathbf{x}_0)^T \\mathbf{p} \n", "+ \\frac{1}{2} \\mathbf{p}^T \n", "\\,\\mathbf{H}_f(\\mathbf{x}_0 + \\xi \\mathbf{p})\n", "\\,\\mathbf{p}\n", "$$\n", "\n", "for some $\\xi \\in (0,1)$, where $\\mathbf{p} = \\mathbf{x} - \\mathbf{x}_0$.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof idea:* We apply the single-variable result and the *Chain Rule*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Proof:* Let $\\phi(t) = f(\\boldsymbol{\\alpha}(t))$ where $\\boldsymbol{\\alpha}(t) = \\mathbf{x}_0 + t \\mathbf{p}$. Observe that $\\phi(0) = f(\\mathbf{x}_0)$ and $\\phi(1) = f(\\mathbf{x})$. As observed in the proof of the *Multivariate Mean Value Theorem*, $\\phi'(t) = \\nabla f(\\boldsymbol{\\alpha}(t))^T \\mathbf{p}$. By the *Chain Rule*,\n", "\n", "$$\n", "\\phi''(t)\n", "= \\frac{\\mathrm{d}}{\\mathrm{d} t}\n", "\\left[\\sum_{i=1}^d \\frac{\\partial f(\\boldsymbol{\\alpha}(t))}{\\partial x_i} p_i \\right]\n", "= \\sum_{i=1}^d \\sum_{j=1}^d \\frac{\\partial^2 f(\\boldsymbol{\\alpha}(t))}{\\partial x_j \\partial x_i} p_j p_i\n", "= \\mathbf{p}^T \n", "\\,\\mathbf{H}_f(\\mathbf{x}_0 + t \\mathbf{p})\n", "\\,\\mathbf{p}.\n", "$$\n", "\n", "In particular, $\\phi$ has continuous first and second derivatives on $[0,1]$. By *Taylor's Theorem* in the single-variable case\n", "\n", "$$\n", "\\phi(t)\n", "= \\phi(0) + t \\phi'(0) + \\frac{1}{2} t^2 \\phi''(\\xi)\n", "$$\n", "\n", "for some $\\xi \\in (0,t)$. Plugging in the expressions for $\\phi(0)$, $\\phi'(0)$ and $\\phi''(\\xi)$ and taking $t=1$ gives the claim. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "**Example:** Consider the function $f(x_1, x_2) = x_1 x_2 + x_1^2 + e^{x_1} \\cos x_2$. We apply *Taylor's Theorem* with $\\mathbf{x}_0 = (0, 0)$ and $\\mathbf{x} = (x_1, x_2)$. The gradient is\n", "\n", "$$\n", "\\nabla f(x_1, x_2) = (x_2 + 2 x_1 + e^{x_1} \\cos x_2, x_1 - e^{x_1} \\sin x_2 )^T\n", "$$\n", "\n", "and the Hessian is\n", "\n", "$$\n", "\\mathbf{H}_f(x_1, x_2)\n", "= \\begin{pmatrix}\n", "2 + e^{x_1} \\cos x_2 & 1 - e^{x_1} \\sin x_2\\\\\n", "1 - e^{x_1} \\sin x_2 & - e^{x_1} \\cos x_2 \n", "\\end{pmatrix}.\n", "$$\n", "\n", "So $f(0,0) = 1$ and $\\nabla f(0,0) = (1, 0)^T$. Thus, \n", "by the *Multivariate Taylor's Theorem*, there is $\\xi \\in (0,1)$ such that\n", "\n", "$$\n", "f(x_1, x_2)\n", "= 1 + x_1 \n", "+ \\frac{1}{2}[2 x_1^2 + 2 x_1 x_2 \n", "+ (x_1^2 - x_2^2) \\,e^{\\xi x_1} \\cos(\\xi x_2)\n", "- 2 x_1 x_2 e^{\\xi x_1} \\sin(\\xi x_2)].\n", "$$\n", "\n", "$\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 2.2 Unconstrained optimization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We will be interested in unconstrained optimization of the form:\n", "\n", "$$\n", "\\min_{\\mathbf{x} \\in \\mathbb{R}^d} f(\\mathbf{x})\n", "$$\n", "\n", "where $f : \\mathbb{R}^d \\to \\mathbb{R}$. In this subsection, we define several notions of solution and derive characterizations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 2.2.1 Definitions\n", "\n", "Ideally, we would like to find a global minimizer to the optimization problem above. Recall the definition of a global minimizer." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Definition (Global minimizer):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$. The point $\\mathbf{x}^* \\in \\mathbb{R}^d$ is a global minimizer of $f$ over $\\mathbb{R}^d$ if \n", "\n", "$$\n", "f(\\mathbf{x}) \n", "\\geq f(\\mathbf{x}^*), \\quad \\forall \\mathbf{x} \\in \\mathbb{R}^d. \n", "$$\n", "\n", "$\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We have observed before that, in general, finding a global minimizer and certifying that one has been found can be difficult unless some special structure is present. Therefore weaker notions of solution are needed. Recall the notion of a local minimizer." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Definition (Local minimizer):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$. The point $\\mathbf{x}^* \\in \\mathbb{R}^d$ is a local minimizer of $f$ over $\\mathbb{R}^d$ if there is $\\delta > 0$ such that \n", "\n", "$$\n", "f(\\mathbf{x}) \n", "\\geq f(\\mathbf{x}^*), \\quad \\forall \\mathbf{x} \\in B_{\\delta}(\\mathbf{x}^*) \\setminus \\{\\mathbf{x}^*\\}. \n", "$$\n", "\n", "If the inequality is strict, we say that $\\mathbf{x}^*$ is a strict local minimizer. $\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In words, $\\mathbf{x}^*$ is a local minimizer if there is open ball around $\\mathbf{x}^*$ where it attains the minimum value. The difference between global and local minimizers is illustrated in the next figure.\n", "\n", "![Local minimizer](https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Extrema_example_original.svg/300px-Extrema_example_original.svg.png)\n", "\n", "([Source](https://commons.wikimedia.org/wiki/File:Extrema_example_original.svg))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 2.2.2 Necessary conditions\n", "\n", "Local minimizers can be characterized in terms of the gradient and Hessian of the function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We first generalize the *Descent Direction Lemma* to the multivariate case. We first need to define what a descent direction is. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Definition (Descent Direction):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$. A vector $\\mathbf{v}$ is a descent direction for $f$ at $\\mathbf{x}_0$ if there is $\\alpha^* > 0$ such that\n", "\n", "$$\n", "f(\\mathbf{x}_0 + \\alpha \\mathbf{v})\n", "< f(\\mathbf{x}_0), \\quad \\forall \\alpha \\in (0,\\alpha^*).\n", "$$\n", "\n", "$\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "In the continuously differentiable case, the directional derivative gives a criterion for descent directions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "***\n", "\n", "**Lemma (Descent Direction and Directional Derivative):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$ be continuously differentiable at $\\mathbf{x}_0$. A vector $\\mathbf{v}$ is a descent direction for $f$ at $\\mathbf{x}_0$ if\n", "\n", "$$\n", "\\frac{\\partial f (\\mathbf{x}_0)}{\\partial \\mathbf{v}} \n", "= \\nabla f(\\mathbf{x}_0)^T \\mathbf{v}\n", "< 0\n", "$$\n", "\n", "that is, if the directional derivative of $f$ at $\\mathbf{x}_0$ in the direction $\\mathbf{v}$ is negative.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Proof idea:* We use the *Multivariate Mean Value Theorem* to show that $f$ takes smaller values in direction $\\mathbf{v}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* Suppose there is $\\mathbf{v} \\in \\mathbb{R}^d$ such that $\\nabla f(\\mathbf{x}_0)^T \\mathbf{v} = -\\eta < 0$. For $\\alpha > 0$, the *Multivariate Mean Value Theorem* implies that there is $\\xi_\\alpha \\in (0,1)$ such that\n", "\n", "$$\n", "f(\\mathbf{x}_0 + \\alpha \\mathbf{v})\n", "= f(\\mathbf{x}_0) + \\nabla f(\\mathbf{x}_0 + \\xi_\\alpha \\alpha \\mathbf{v})^T(\\alpha \\mathbf{v})\n", "= f(\\mathbf{x}_0) + \\alpha \\nabla f(\\mathbf{x}_0 + \\xi_\\alpha \\alpha \\mathbf{v})^T \\mathbf{v}.\n", "$$\n", "\n", "We cannot immediately apply our condition on $\\mathbf{v}$ as the gradient in the previous equation is taken at $\\mathbf{x}_0 + \\xi_\\alpha \\alpha \\mathbf{v}$, not $\\mathbf{x}_0$. But the gradient is continuous, so there is $\\delta > 0$ such that\n", "\n", "$$\n", "\\left|\\nabla f(\\mathbf{x})^T \\mathbf{v}\n", "- \\nabla f(\\mathbf{x}_0)^T \\mathbf{v}\\right| < \\eta/2\n", "$$\n", "\n", "for all $\\mathbf{x} \\in B_\\delta(\\mathbf{x}_0)$. Hence there is $\\alpha^* > 0$ small enough such that\n", "\n", "$$\n", "\\nabla f(\\mathbf{x}_0 + \\xi_\\alpha \\alpha \\mathbf{v})^T \\mathbf{v} < - \\eta/2 < 0,\n", "\\quad \\forall \\alpha \\in (0,\\alpha^*).\n", "$$\n", "\n", "That implies\n", "\n", "$$\n", "f(\\mathbf{x}_0 + \\alpha \\mathbf{v})\n", "< f(\\mathbf{x}_0) - \\alpha \\eta/2 < f(\\mathbf{x}_0),\n", "\\quad \\forall \\alpha \\in (0,\\alpha^*)\n", "$$\n", "\n", "and proves the claim. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "\n", "**Lemma (Descent Direction: Multivariate Version):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$ be continuously differentiable at $\\mathbf{x}_0$ and assume that $\\nabla f(\\mathbf{x}_0) \\neq 0$. Then $f$ has a descent direction at $\\mathbf{x}_0$.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof:* Take $\\mathbf{v} = - \\nabla f(\\mathbf{x}_0)$. Then $\\nabla f(\\mathbf{x}_0)^T \\mathbf{v} = - \\|\\nabla f(\\mathbf{x}_0)\\|^2 < 0$ since $\\nabla f(\\mathbf{x}_0) \\neq 0$. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "This leads to the following fundamental result." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "\n", "**Theorem (First-Order Necessary Condition):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$ be continuously differentiable on $\\mathbb{R}^d$. If $\\mathbf{x}_0$ is a local minimizer, then $\\nabla f(\\mathbf{x}_0) = 0$.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof idea:* In a descent direction, $f$ decreases hence there cannot be one at a local minimizer. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* We argue by contradiction. Suppose that $\\nabla f(\\mathbf{x}_0) \\neq 0$. By the *Descent Direction Lemma*, there is a descent direction $\\mathbf{v} \\in \\mathbb{R}^d$ at $\\mathbf{x}_0$. That implies\n", "\n", "$$\n", "f(\\mathbf{x}_0 + \\alpha \\mathbf{v})\n", "< f(\\mathbf{x}_0), \\quad \\forall \\alpha \\in (0, \\alpha^*) \n", "$$\n", "\n", "for some $\\alpha^* > 0$. So every open ball around $\\mathbf{x}_0$ has a point achieving a smaller value than $f(\\mathbf{x}_0)$. Thus $\\mathbf{x}_0$ is not a local minimizer, a contradiction. So it must be that $\\nabla f(\\mathbf{x}_0) = 0$. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "When $f$ is twice continuously differentiable, we also get a condition on the Hessian." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "\n", "**Theorem (Second-Order Necessary Condition):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$ be twice continuously differentiable on $\\mathbb{R}^d$. If $\\mathbf{x}_0$ is a local minimizer, then $\\mathbf{H}_f(\\mathbf{x}_0)$ is positive semidefinite.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof idea:* The proof goes along the same lines as the argument leading to the *First-Order Necessary Condition*, but we use the *Multivariate Taylor's Theorem* to the second order instead." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* We argue by contradiction. Suppose that $\\mathbf{H}_f(\\mathbf{x}_0)$ is not positive semidefinite. From the *Symmetry of the Hessian Theorem* and the *Spectral Theorem*, $\\mathbf{H}_f(\\mathbf{x}_0)$ has a spectral decomposition. From the *Characterization of Positive Semidefiniteness*, however, it follows that $\\mathbf{H}_f(\\mathbf{x}_0)$ must have at least one negative eigenvalue $- \\eta < 0$. Let $\\mathbf{v}$ be a corresponding eigenvector. \n", "\n", "We have that $\\langle \\mathbf{v}, \\mathbf{H}_f(\\mathbf{x}_0) \\,\\mathbf{v} \\rangle = - \\eta < 0$. For $\\alpha > 0$, the *Multivariate Taylor's Theorem* implies that there is $\\xi_\\alpha \\in (0,1)$ such that\n", "\n", "\n", "\\begin{align*}\n", "f(\\mathbf{x}_0 + \\alpha \\mathbf{v})\n", "&= f(\\mathbf{x}_0) \n", "+ \\nabla f(\\mathbf{x}_0)^T(\\alpha \\mathbf{v})\n", "+ (\\alpha \\mathbf{v})^T \\mathbf{H}_f(\\mathbf{x}_0 + \\xi_\\alpha \\alpha \\mathbf{v}) (\\alpha \\mathbf{v})\\\\\n", "&= f(\\mathbf{x}_0) \n", "+ \\alpha^2 \\mathbf{v}^T \\mathbf{H}_f(\\mathbf{x}_0 + \\xi_\\alpha \\alpha \\mathbf{v})\\,\\mathbf{v}\n", "\\end{align*}\n", "\n", "\n", "where we used $\\nabla f(\\mathbf{x}_0) = 0$ by the *First-Order Necessary Condition*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Since the Hessian is continuous, there is $\\delta > 0$ such that\n", "\n", "$$\n", "\\left| \\mathbf{v}^T \\mathbf{H}_f(\\mathbf{x}) \\,\\mathbf{v}\n", "- \\mathbf{v}^T \\mathbf{H}_f(\\mathbf{x}_0) \\,\\mathbf{v} \\right| < \\eta/2\n", "$$\n", "\n", "for all $\\mathbf{x} \\in B_\\delta(\\mathbf{x}_0)$. So taking $\\alpha$ small enough gives\n", "\n", "$$\n", "\\mathbf{v}^T \\mathbf{H}_f(\\mathbf{x}_0 + \\xi_\\alpha \\alpha \\mathbf{v}) \\,\\mathbf{v} < - \\eta/2 < 0.\n", "$$\n", "\n", "That implies\n", "\n", "$$\n", "f(\\mathbf{x}_0 + \\alpha \\mathbf{v})\n", "< f(\\mathbf{x}_0) - \\alpha^2 \\eta/2 < f(\\mathbf{x}_0). \n", "$$\n", "\n", "Since this holds for all sufficiently small $\\alpha$, every open ball around $\\mathbf{x}_0$ has a point achieving a lower value than $f(\\mathbf{x}_0)$. Thus $\\mathbf{x}_0$ is not a local minimizer, a contradiction. So it must be that $\\mathbf{H}_f(\\mathbf{x}_0) \\succeq 0$. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### 2.2.3 Sufficient conditions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The necessary conditions in the previous subsection are not in general sufficient, as the following example shows." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Definition (Stationary Point):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$ be continuously differentiable on $\\mathbb{R}^d$. If $\\nabla f(\\mathbf{x}_0) = 0$, we say that $\\mathbf{x}_0$ is a stationary point of $f$.$\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Example:** Let $f(x) = x^3$. Then $f'(x) = 3 x^2$ and $f''(x) = 6 x$ so that $f'(0) = 0$ and $f''(0) \\geq 0$. Hence $x=0$ is a stationary point. But $x=0$ is not a local minimizer. Indeed $f(0) = 0$ but, for any $\\delta > 0$, $f(-\\delta) < 0$." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f(x) = x^3\n", "x = LinRange(-2,2, 100)\n", "y = f.(x)\n", "plot(x, y, lw=2, legend=false, ylim = (-5,5))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "$\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We give sufficient conditions for a local minimizer. We will need the following lemma, whose proof relies on the next exercises." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Exercise:* Prove the following claim, which is known as the *Subspace Intersection Lemma*. Let $\\mathcal{S}_1$ and $\\mathcal{S}_2$ be linear subspaces of $\\mathbb{R}^d$ and let\n", "\n", "$$\n", "\\mathcal{S}_1 + \\mathcal{S}_2\n", "= \\{\\mathbf{x}_1 + \\mathbf{x}_2 \\,:\\, \\forall \\mathbf{x}_1 \\in \\mathcal{S}_1, \\mathbf{x}_2 \\in \\mathcal{S}_2\\}.\n", "$$\n", "\n", "Then it holds that\n", "\n", "$$\n", "\\mathrm{dim}(\\mathcal{S}_1 + \\mathcal{S}_2)\n", "= \\mathrm{dim}(\\mathcal{S}_1) + \\mathrm{dim}(\\mathcal{S}_2) \n", "- \\mathrm{dim}(\\mathcal{S}_1 \\cap \\mathcal{S}_2).\n", "$$\n", "\n", "[Hint: Consider a basis of $\\mathcal{S}_1 \\cap \\mathcal{S}_2$ and complete into bases of $\\mathcal{S}_1$ and $\\mathcal{S}_2$. Show that the reuslting list of vectors is linear independent.] $\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Exercise:* Show that, for any linear subspaces $\\mathcal{S}_1, \\ldots, \\mathcal{S}_m$ of $\\mathcal{V} = \\mathbb{R}^d$, it holds that\n", "\n", "$$\n", "\\mathrm{dim}\\left(\\bigcap_{k=1}^m \\mathcal{S}_k\\right)\n", "\\geq \\sum_{k=1}^m \\mathrm{dim}\\left(\\mathcal{S}_k\\right)\n", "- (m-1) \\,\\mathrm{dim}(\\mathcal{V}).\n", "$$\n", "\n", "[Hint: Use the Subspace Intersection Lemma and induction.] $\\lhd$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For a symmetric matrix $C \\in \\mathbb{R}^{d \\times d}$, we let $\\lambda_j(C)$, $j=1, \\ldots, d$, be the eigenvalues of $C$ in non-increasing order with corresponding orthonormal eigenvectors $\\mathbf{v}_j$, $j=1, \\ldots, d$. Define the subspaces\n", "\n", "$$\n", "\\mathcal{V}_k(C) = \\mathrm{span}(\\mathbf{v}_1, \\ldots, \\mathbf{v}_k)\n", "\\quad\\text{and}\\quad\n", "\\mathcal{W}_{d-k+1}(C) = \\mathrm{span}(\\mathbf{v}_k, \\ldots, \\mathbf{v}_d).\n", "$$\n", "\n", "The following lemma is one version of what is known as Weyl's Inequality.\n", "\n", "***\n", "\n", "**Lemma (Weyl's Inequality):** Let $A \\in \\mathbb{R}^{d \\times d}$ and $B \\in \\mathbb{R}^{d \\times d}$ be symmetric matrices. Then, for all $j=1, \\ldots, d$, \n", "\n", "$$\n", "\\max_{j \\in [d]} \\left|\\lambda_j(B) - \\lambda_j(A)\\right|\n", "\\leq \\|B- A\\|_2\n", "$$\n", "\n", "where $\\|C\\|_2$ is the induced $2$-norm of $C$.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof idea:* We use the extremal characterization of the eigenvalues together with a dimension argument. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* Let $H = B - A$. We prove only the upper bound. The other direction follows from interchanging the roles of $A$ and $B$. Because\n", "\n", "$$\n", "\\mathrm{dim}(\\mathcal{V}_j(B)) \n", "+ \\mathrm{dim}(\\mathcal{W}_j(A))\n", "+ \\mathrm{dim}(\\mathcal{W}_1(H))\n", "= j + (d-j+1) + d \n", "= 2d+1\n", "$$\n", "\n", "it follows from the exercise above that\n", "\n", "$$\n", "\\mathrm{dim}\\left(\\mathcal{V}_j(B) \\cap \\mathcal{W}_j(A) \\cap \\mathcal{W}_1(H)\\right)\n", "\\geq (2d+1) - 2d = 1.\n", "$$\n", "\n", "Hence the $\\mathcal{V}_j(B) \\cap \\mathcal{W}_j(A) \\cap \\mathcal{W}_1(H)$ is non-empty. Let $\\mathbf{v}$ be a unit vector in that intersection. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "By *Courant-Fischer*,\n", "\n", "$$\n", "\\lambda_j(B)\n", "\\leq \\langle \\mathbf{v}, (A+H) \\mathbf{v}\\rangle\n", "= \\langle \\mathbf{v}, A \\mathbf{v}\\rangle + \\langle \\mathbf{v}, H \\mathbf{v}\\rangle\n", "\\leq \\lambda_j(A) + \\langle \\mathbf{v}, H \\mathbf{v}\\rangle.\n", "$$\n", "\n", "Moreover, by *Cauchy-Schwarz*, since $\\|\\mathbf{v}\\|=1$\n", "\n", "$$\n", "\\langle \\mathbf{v}, H \\mathbf{v}\\rangle \\leq \\|\\mathbf{v}\\| \\|H\\mathbf{v}\\| \\leq \\|H\\|_2\n", "$$\n", "\n", "which proves the claim. $\\square$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "\n", "**Theorem (Second-Order Sufficient Condition):** Let $f : \\mathbb{R}^d \\to \\mathbb{R}$ be twice continuously differentiable on $\\mathbb{R}^d$. If $\\nabla f(\\mathbf{x}_0) = \\mathbf{0}$ and $\\mathbf{H}_f(\\mathbf{x}_0)$ is positive definite, then $\\mathbf{x}_0$ is a strict local minimizer.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Proof idea:* We use the *Multivariate Taylor's Theorem* again. This time we use the positive definiteness of the Hessian to bound the value of the funciton from below. We use *Weyl's Inequality* to show that the eigenvalues of the Hessian are continuous, which implies that the Hessian remains positive definite in an open ball around $\\mathbf{x}_0$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "*Proof:* We first claim that there is $\\rho > 0$ such that $\\mathbf{H}_f(\\mathbf{x})$ is positive definite for all $\\mathbf{x} \\in B_{\\rho}(\\mathbf{x}_0)$. That is the case when $\\mathbf{x} = \\mathbf{x}_0$ and let $\\mu_1 > 0$ be the smallest eigenvalue $\\mathbf{H}_f(\\mathbf{x}_0)$. We use *Weyl's Inequality* to bound the eigenvalues of the Hessian from below around $\\mathbf{x}_0$. For any vector $\\mathbf{v} \\in \\mathbb{R}^d$, we have for all $j =1, \\ldots, d$\n", "\n", "$$\n", "\\lambda_j(\\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v}))\n", "\\geq \\lambda_j(\\mathbf{H}_f(\\mathbf{x}_0)) - \\|\\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v}) - \\mathbf{H}_f(\\mathbf{x}_0)\\|_2\n", "$$\n", "\n", "where we used the notation introduced above. We bound $\\lambda_j(\\mathbf{H}_f(\\mathbf{x}_0))\n", "\\geq \\lambda_1(\\mathbf{H}_f(\\mathbf{x}_0)) = \\mu_1$ and\n", "\n", "$$\n", "\\|\\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v}) - \\mathbf{H}_f(\\mathbf{x}_0)\\|_2\n", "\\leq \\|\\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v}) - \\mathbf{H}_f(\\mathbf{x}_0)\\|_F.\n", "$$\n", "\n", "The Frobenius norm above is continuous in $\\mathbf{v}$ as a composition of continuous functions. Moreover, we of course have at $\\mathbf{v} = \\mathbf{0}$ that this Frobenius is $0$. Hence, by definition of continuity, there is $\\rho > 0$ such that for all $\\mathbf{v} \\in B_{\\rho}(\\mathbf{0})$ we have $\\|\\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v}) - \\mathbf{H}_f(\\mathbf{x}_0)\\|_F < \\mu_1/2$. Plugging back above finally gives for such $\\mathbf{v}$'s and all $j=1,\\ldots,d$\n", "\n", "$$\n", "\\lambda_j(\\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v})) > \\mu_1/2 > 0.\n", "$$\n", "\n", "In particular, $\\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v})$ is positive definite and $\\langle \\mathbf{u}, \\mathbf{H}_f(\\mathbf{x}_0 + \\mathbf{v}) \\,\\mathbf{u} \\rangle > \\frac{\\mu_1}{2} \\|\\mathbf{u}\\|^2$ for any $\\mathbf{u}$ by *Courant-Fischer*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "By the *Multivariate Taylor's Theorem*, $\\forall \\mathbf{v} \\in B_{\\rho}(\\mathbf{0}) \\setminus \\{\\mathbf{0}\\}$ there is $\\xi \\in (0,1)$\n", "\n", "\n", "\\begin{align*}\n", "f(\\mathbf{x}_0 + \\mathbf{v})\n", "&= f(\\mathbf{x}_0) \n", "+ \\nabla f(\\mathbf{x}_0)^T \\mathbf{v}\n", "+ \\mathbf{v} \\,\\mathbf{H}_f(\\mathbf{x}_0 + \\xi \\mathbf{v}) \\,\\mathbf{v}\\\\\n", "&> f(\\mathbf{x}_0) \n", "+ \\frac{\\mu_1}{2} \\|\\mathbf{v}\\|^2\\\\\n", "&> f(\\mathbf{x}_0) \n", "\\end{align*}\n", "\n", "\n", "where we used that $\\|\\xi \\mathbf{v}\\| = \\xi \\|\\mathbf{v}\\| \\leq \\rho$. Therefore $\\mathbf{x}_0$ is a strict local minimizer. $\\square$" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.5.1", "language": "julia", "name": "julia-1.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.5.1" } }, "nbformat": 4, "nbformat_minor": 2 }