TOPIC 3

Optimality, convexity, and gradient descent

4 Gradient descent


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Oct 17, 2020
Copyright: © 2020 Sebastien Roch


Let $f : \mathbb{R}^d \to \mathbb{R}$ is continuously differentiable. We restrict ourselves to unconstrained minimization problems of the form

$$ \min_{\mathbf{x} \in \mathbb{R}^d} f(\mathbf{x}). $$

Ideally one would like to identify a global minimizer of $f$. A naive approach might be to evaluate $f$ at a large number of points $\mathbf{x}$, say on a dense grid. However, even if we were satisfied with an approximate solution and limited ourselves to a bounded subset of the domain of $f$, this type of exhaustive search is wasteful and impractical in large dimension $d$, as the number of points interrogated grows exponentially with $d$.

A less naive approach might be to find all stationary points of $f$, that is, those $\mathbf{x}$'s such that $\nabla f(\mathbf{x}) = \mathbf{0}$. And then choose that $\mathbf{x}$ among them that produces the smallest value of $f(\mathbf{x})$. This indeed works in many problems, like the following example we have encountered previously.

Example: Consider the least-squares problem

$$ \min_{\mathbf{x} \in \mathbb{R}^d} \|A \mathbf{x} - \mathbf{b}\|^2 $$

where $A \in \mathbb{R}^{n \times d}$ has full column rank. In particular, $d \leq n$. The objective function is a quadratic function

$$ f(\mathbf{x}) = \|A \mathbf{x} - \mathbf{b}\|^2 = (A \mathbf{x} - \mathbf{b})^T(A \mathbf{x} - \mathbf{b}) = \mathbf{x}^T A^T A \mathbf{x} - 2 \mathbf{b}^T A \mathbf{x} + \mathbf{b}^T \mathbf{b}. $$

By a previous example,

$$ \nabla f(\mathbf{x}) = 2 A^T A \mathbf{x} - 2 A^T \mathbf{b} $$

where we used that $A^T A$ is symmetric.

So the stationary points satisfy

$$ A^T A \mathbf{x} = A^T \mathbf{b} $$

which you may recognize as the normal equations for the least-squares problem. We have previously shown that there is a unique solution to this system when $A$ has full column rank. Moreover, this optimization problem is convex. Indeed, by our previous example, the Hessian of $f$ is

$$ \mathbf{H}_f(\mathbf{x}) = 2 A^T A. $$

This Hessian is positive semidefinite since, for any $\mathbf{z} \in \mathbf{R}^d$,

$$ \langle \mathbf{z}, 2 A^T A \mathbf{z}\rangle = 2 (A \mathbf{z})^T (A \mathbf{z}) = 2 \|A \mathbf{z}\|^2 \geq 0. $$

So any local minimizer, which is necessarily a stationary point, is also a global minimizer. So we have found all global minimizers. $\lhd$

4.1 Steepest descent

In steepest descent, we attempt to find smaller and smaller values of $f$ by successively following directions in which $f$ decreases. As we have seen in the proof of the First-Order Necessary Condition, $- \nabla f$ provides such a direction. In fact, it is the direction of steepest descent in the following sense.

Recall from the Descent Direction and Directional Derivative Lemma that $\mathbf{v}$ is a descent direction at $\mathbf{x}_0$ if the directional derivative of $f$ at $\mathbf{x}_0$ in the direction $\mathbf{v}$ is negative.


Lemma (Steepest Descent): Let $f : \mathbb{R}^d \to \mathbb{R}$ be continuously differentiable at $\mathbf{x}_0$. For any unit vector $\mathbf{v} \in \mathbb{R}^d$,

$$ \frac{\partial f (\mathbf{x}_0)}{\partial \mathbf{v}} \geq \frac{\partial f (\mathbf{x}_0)}{\partial \mathbf{v}^*} $$

where

$$ \mathbf{v}^* = - \frac{\nabla f(\mathbf{x}_0)}{\|\nabla f(\mathbf{x}_0)\|}. $$

Proof idea: This is an immediate application of the Chain Rule and Cauchy-Schwarz.

Proof: By the Chain Rule and Cauchy-Schwarz,

$$ \begin{align*} \frac{\partial f (\mathbf{x}_0)}{\partial \mathbf{v}} &= \nabla f(\mathbf{x}_0)^T \mathbf{v}\\ &\leq \|\nabla f(\mathbf{x}_0)\| \|\mathbf{v}\|\\ &= \|\nabla f(\mathbf{x}_0)\|\\ &= \nabla f(\mathbf{x}_0)^T \left(- \frac{\nabla f(\mathbf{x}_0)}{\|\nabla f(\mathbf{x}_0)\|}\right)\\ &= \frac{\partial f (\mathbf{x}_0)}{\partial \mathbf{v}^*}. \end{align*} $$

$\square$

At each interation of steepest descent, we take a step in the direction of the negative of the gradient, that is,

$$ \mathbf{x}^{k+1} = \mathbf{x}^k - \alpha_k \nabla f(\mathbf{x}^k), \quad k=0,1,2\ldots $$

for a sequence of steplengths $\alpha_k > 0$.

In general, we will not be able to guarantee that a global minimizer is reached in the limit, even if one exists. Our goal for now is more modest: to find a point where the gradient of $f$ approximately vanishes.

In [2]:
function desc_update(grad_f, x, β)
    return x .- β*grad_f(x)
end
Out[2]:
desc_update (generic function with 1 method)
In [3]:
function mmids_gd(f, grad_f, x0; β=1e-3, niters=1e6)
    
    xk = x0 # initialization
    for _ = 1:niters
        xk = desc_update(grad_f, xk, β) # gradient step
    end
    
    return xk, f(xk)
end
Out[3]:
mmids_gd (generic function with 1 method)
In [4]:
f(x) = (x-1)^2+10
xgrid = LinRange(-5,5,100)
plot(xgrid, f.(xgrid), lw=2, legend=false)
Out[4]:
In [5]:
grad_f(x) = 2*(x-1)
Out[5]:
grad_f (generic function with 1 method)
In [6]:
mmids_gd(f, grad_f, 0)
Out[6]:
(0.9999999999999722, 10.0)
In [7]:
f(x) = 4*(x-1)^2*(x+1)^2 - 2*(x-1)
xgrid = LinRange(-2,2,100)
plot(xgrid, f.(xgrid), lw=2, label="f", ylim=(-1,10), legend=:bottomleft)
Out[7]:
In [8]:
grad_f(x) = 8*(x-1)*(x+1)^2 + 8*(x-1)^2*(x+1) - 2
Out[8]:
grad_f (generic function with 1 method)
In [9]:
plot!(xgrid, grad_f.(xgrid), lw=2, label="grad", ylim=(-10,10))
Out[9]:
In [10]:
mmids_gd(f, grad_f, 0)
Out[10]:
(1.057453770738375, -0.0590145651028224)
In [11]:
mmids_gd(f, grad_f, -2)
Out[11]:
(-0.9304029265558538, 3.933005966859003)
In [12]:
f(x) = x^3
xgrid = LinRange(-2,2,100)
plot(xgrid, f.(xgrid), lw=2, label="f", ylim=(-10,10), legend=:bottomright)
Out[12]:
In [13]:
grad_f(x) = 3*x^2
plot!(xgrid, grad_f.(xgrid), lw=2, label="grad", ylim=(-10,10))
Out[13]:
In [14]:
mmids_gd(f, grad_f, 2)
Out[14]:
(0.00033327488712690107, 3.701755838398568e-11)
In [15]:
mmids_gd(f, grad_f, -2)
Out[15]:
(-Inf, -Inf)

4.2 Convergence analysis

We will use the following notation. Let $A, B \in \mathbb{R}^{d \times d}$ be symmetric matrices. Recall that $A \succeq 0$ means that $A$ is PSD. We write $A \preceq B$ (respectively $A \succeq B$) to indicate that $B - A \succeq 0$ (respectively $A - B \succeq 0$). We will need the following statement, whose proof is left as an exercise.

Exercise: Let $A \in \mathbb{R}^{d \times d}$ be a symmetric matrix. Show that $A \preceq M I_{d \times d}$ if and only if the eigenvalues of $A$ are at most $M$. Similarly, $m I_{d \times d} \preceq A$ if and only if the eigenvalues of $A$ are at least $m$. [Hint: Observe that the eigenvectors of $A$ are also eigenvectors of the identity matrix $I_{d \times d}$.] $\lhd$

Definition (Smooth Function): Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable. We say that $f$ is $L$-smooth if

$$ - L I_{d \times d} \preceq \mathbf{H}_f(\mathbf{x}) \preceq L I_{d \times d}, \quad \forall \mathbf{x} \in \mathbb{R}^d. $$

$\lhd$


Lemma (Quadratic Bound for Smooth Functions): Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable. Then $f$ is $L$-smooth if and only if for all $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ it holds that

$$ \left|f(\mathbf{y}) - \{f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x})\}\right| \leq \frac{L}{2} \|\mathbf{y} - \mathbf{x}\|^2. $$

Proof idea: We apply the Multivariate Taylor's Theorem, then use the extremal characterization of the eigenvalues to bound the second-order term.

Proof: By the Multivariate Taylor's Theorem, for any $\alpha > 0$ there is $\xi_\alpha \in (0,1)$

$$ f(\mathbf{x} + \alpha \mathbf{p}) = f(\mathbf{x}) + \alpha \nabla f(\mathbf{x})^T \mathbf{p} + \frac{1}{2} \alpha^2 \mathbf{p}^T \,\mathbf{H}_f(\mathbf{x} + \xi_\alpha \alpha \mathbf{p}) \,\mathbf{p} $$

where $\mathbf{p} = \mathbf{y} - \mathbf{x}$.

If $f$ is L-smooth, then at $\alpha = 1$ by Courant-Fischer

$$ - L \|\mathbf{p}\|^2 \leq \mathbf{p}^T \,\mathbf{H}_f(\mathbf{x} + \xi_1 \mathbf{p}) \,\mathbf{p} \leq L \|\mathbf{p}\|^2. $$

That implies the inequality in the statement.

On the other hand, if that inequality holds, by combining with the Taylor expansion above we get

$$ \left|\,\frac{1}{2} \alpha^2 \mathbf{p}^T \,\mathbf{H}_f(\mathbf{x} + \xi_\alpha \alpha \mathbf{p}) \,\mathbf{p}\,\right| \leq \frac{L}{2} \alpha^2 \|\mathbf{p}\|^2 $$

where we used that $\|\alpha \mathbf{p}\| = \alpha \|\mathbf{p}\|$ by homogeneity of the norm. Dividing by $\alpha^2/2$, then taking $\alpha \to 0$ and using the continuity of the Hessian gives

$$ \left|\, \mathbf{p}^T \,\mathbf{H}_f(\mathbf{x}) \,\mathbf{p} \,\right| \leq L \|\mathbf{p}\|^2. $$

By Courant-Fischer again, that implies that $f$ is $L$-smooth. $\square$


Theorem (Convergence of Steepest Descent in Smooth Case): Suppose that $f : \mathbb{R}^d \to \mathbb{R}$ is $L$-smooth and bounded from below, that is, there $\bar{f} > - \infty$ such that $f(\mathbf{x}) \geq \bar{f}$, $\forall \mathbf{x} \in \mathbb{R}^d$. Then steepest descent with $\alpha = 1/L$ started from any $\mathbf{x}^0$ produces a sequence $\mathbf{x}^t$, $t=1,2,\ldots$ such that

$$ f(\mathbf{x}^{t+1}) \leq f(\mathbf{x}^t), \quad \forall t $$

and

$$ \lim_{t \to +\infty} \|\nabla f(\mathbf{x}^t)\| = 0. $$

Moreover, after $H$ steps, there is a $t$ in $\{0,\ldots,H\}$ such that

$$ \|\nabla f(\mathbf{x}^t)\| \leq \sqrt{\frac{2 L \left[f(\mathbf{x}^0) - \bar{f}\right]}{H}}. $$

Lemma (Descent Guarantee in the Smooth Case): Suppose that $f : \mathbb{R}^d \to \mathbb{R}$ is $L$-smooth. For any $\mathbf{x} \in \mathbb{R}^d$,

$$ f(\mathbf{x} - (1/L)\nabla f(\mathbf{x})) \leq f(\mathbf{x}) - \frac{1}{2 L} \|\nabla f(\mathbf{x})\|^2. $$

Proof (Descent Guarantee in the Smooth Case): By the Quadratic Bound for Smooth Functions, letting $\mathbf{p} = - \nabla f(\mathbf{x})$

$$ \begin{align*} f(\mathbf{x} + \alpha \mathbf{p}) &\leq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\alpha \mathbf{p}) + \frac{L}{2} \|\alpha \mathbf{p}\|^2\\ &= f(\mathbf{x}) - \alpha \|\nabla f(\mathbf{x})\|^2 + \alpha^2 \frac{L}{2} \|\nabla f(\mathbf{x})\|^2\\ &= f(\mathbf{x}) + \left( - \alpha + \alpha^2 \frac{L}{2} \right) \|\nabla f(\mathbf{x})\|^2. \end{align*} $$

The quadratic function in parentheses is convex and minimized at the stationary point $\alpha$ satisfying

$$ \frac{\mathrm{d}}{\mathrm{d} \alpha}\left( - \alpha + \alpha^2 \frac{L}{2} \right) = -1 + \alpha L = 0. $$

Taking $\alpha = 1/L$ and replacing in the inequality above gives

$$ f(\mathbf{x} -(1/L) \nabla f(\mathbf{x})) \leq f(\mathbf{x}) - \frac{1}{2L}\|\nabla f(\mathbf{x})\|^2 $$

as claimed. $\square$

Exercise: For $a \in [-1,1]$ and $b \in \{0,1\}$, let $\hat{f}(x, a) = \sigma(x a)$ where

$$ \sigma(t) = \frac{1}{1 + e^{-t}} $$

be a classifier parametrized by $x \in \mathbb{R}$. For a dataset $a_i \in [-1,1]$ and $b_i \in \{0,1\}$, $i=1,\ldots,n$, let the cross-entropy loss be

$$ \mathcal{L}(x, \{(a_i, b_i)\}_{i=1}^n) = \frac{1}{n} \sum_{i=1}^n \ell(x, a_i, b_i) $$

where

$$ \ell(x, a, b) = - b \log(\hat{f}(x, a)) - (1-b) \log(1 - \hat{f}(x, a)). $$
In [16]:
n = 10000
a = 2*rand(n) .- 1
b = rand(0:1,n)
fhat(x) = 1 ./ ( 1 .+ exp.(-x.*a))
loss(x) = mean(-b.*log.(fhat(x)) - (1 .- b).*log.(1 .- fhat(x)))
grad(x) = -mean((b .- fhat(x)).*a)
x = LinRange(-1,1,100)
plot(x, loss.(x), lw=2, label="loss", legend=:bottomright)
Out[16]:
In [17]:
x0 = -0.3
x = LinRange(x0-0.05,x0+0.05,100)
upper = loss(x0) .+ (x .- x0)*grad(x0) .+ (1/2)*(x .- x0).^2 # upper approximation
lower = loss(x0) .+ (x .- x0)*grad(x0) .- (1/2)*(x .- x0).^2 # lower approximation
plot(x, loss.(x), lw=2, label="loss")
plot!(x, upper, lw=2, label="upper")
plot!(x, lower, lw=2, label="lower")
Out[17]:

Proof (Convergence of Steepest Descent in Smooth Case): By the Descent Guarantee in the Smooth Case,

$$ f(\mathbf{x}^{t+1}) \leq f(\mathbf{x}^t) - \frac{1}{2 L}\|\nabla f(\mathbf{x}^t)\|^2 \leq f(\mathbf{x}^t), \quad \forall t. $$

Furthermore, using a telescoping sum, we get

$$ \begin{align*} f(\mathbf{x}^H) &= f(\mathbf{x}^0) + \sum_{t=0}^{H-1} [f(\mathbf{x}^{t+1}) - f(\mathbf{x}^t)]\\ &= f(\mathbf{x}^0) - \frac{1}{2 L}\sum_{t=0}^{H-1} \|\nabla f(\mathbf{x}^t)\|^2. \end{align*} $$

Rearranging and using $f(\mathbf{x}^H) \geq \bar{f}$ leads to

$$ \sum_{t=0}^{H-1} \|\nabla f(\mathbf{x}^t)\|^2 \leq 2L [f(\mathbf{x}^0) - \bar{f}]. $$

So as $H \to +\infty$, we must have $\|\nabla f(\mathbf{x}^H)\|^2 \to 0$. We also get the more quantitative bound

$$ \min_{t=0,\ldots, H-1} \|\nabla f(\mathbf{x}^t)\|^2 \leq \frac{2L [f(\mathbf{x}^0) - \bar{f}]}{H} $$

as the minimum is necessarily less or equal than the average. That proves the last claim. $\square$

Definition (Strongly Convex Function): Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable and let $m > 0$. We say that $f$ is $m$-strongly convex if

$$ \mathbf{H}_f(\mathbf{x}) \succeq m I_{d \times d}, \quad \forall \mathbf{x} \in \mathbb{R}^d. $$

$\lhd$


Lemma (Quadratic Bound for Strongly Convex Functions): Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable. Then $f$ is $m$-strongly convex if and only if for all $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ it holds that

$$ f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y} - \mathbf{x}) + \frac{m}{2} \|\mathbf{y} - \mathbf{x}\|^2. $$

Theorem (Global Minimizer of Strongly Convex Functions): Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable and $m$-strongly convex with $m>0$. If $\nabla f(\mathbf{x}^*) = \mathbf{0}$, then $\mathbf{x}^*$ is a unique global minimizer of $f$.


Proof: If $\nabla f(\mathbf{x}^*) = \mathbf{0}$, by the Quadratic Bound for Strongly Convex Functions,

$$ f(\mathbf{y}) \geq f(\mathbf{x}^*) + \nabla f(\mathbf{x}^*)^T(\mathbf{y} - \mathbf{x}^*) + \frac{m}{2} \|\mathbf{y} - \mathbf{x}^*\|^2 > f(\mathbf{x}^*) $$

for all $\mathbf{y} \neq \mathbf{x}^*$, which proves the claim. $\square$

If $f$ is $m$-strongly convex and has a global minimizer $\mathbf{x}^*$, then the global minimizer is unique and characterized by $\nabla f(\mathbf{x}^*) = \mathbf{0}$. Strong convexity allows us to relate the value of the function at a point $\mathbf{x}$ and the gradient of $f$ at that point. This is proved in the following lemma, which is key to our convergence results.


Lemma (Relating $f$ and $\nabla f$): Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable, $m$-strongly convex with a global minimizer at $\mathbf{x}^*$. Then for any $\mathbf{x} \in \mathbb{R}^d$

$$ f(\mathbf{x}) - f(\mathbf{x}^*) \leq \frac{\|\nabla f(\mathbf{x})\|^2}{2 m}. $$

Proof: By the Quadratic Bound for Strongly Convex Functions,

$$ \begin{align*} f(\mathbf{x}^*) &\geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{x}^* - \mathbf{x}) + \frac{m}{2} \|\mathbf{x}^* - \mathbf{x}\|^2\\ &= f(\mathbf{x}) + \nabla f(\mathbf{x})^T \mathbf{w} + \frac{1}{2} \mathbf{w}^T (m I_{d \times d}) \mathbf{w} \end{align*} $$

where on the second line we defined $\mathbf{w} = \mathbf{x}^* - \mathbf{x}$. The right-hand side is a quadratic function in $\mathbf{w}$ (for $\mathbf{x}$ fixed). So the inequality is still valid if we replace $\mathbf{w}$ with the global minimizer $\mathbf{w}^*$ of that quadratic function.

The matrix $m I_{d \times d}$ is positive definite since its eigenvalues are all equal to $m > 0$. By a previous example, we know that

$$ \mathbf{w}^* = - (m I_{d \times d})^{-1} \nabla f(\mathbf{x}) = - \frac{1}{m} \nabla f(\mathbf{x}). $$

So, replacing $\mathbf{w}$ with $\mathbf{w}^*$, we have the inequality

$$ \begin{align*} f(\mathbf{x}^*) & \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T \left\{- \frac{1}{m} \nabla f(\mathbf{x})\right\} + \frac{1}{2} \left\{- \frac{1}{m} \nabla f(\mathbf{x})\right\}^T (m I_{d \times d}) \left\{- \frac{1}{m} \nabla f(\mathbf{x})\right\}\\ & = f(\mathbf{x}) - \frac{1}{2m} \|\nabla f(\mathbf{x})\|^2. \end{align*} $$

Rearranging gives the claim. $\square$


Theorem (Convergence of Steepest Descent in Smooth Case): Suppose that $f : \mathbb{R}^d \to \mathbb{R}$ is $L$-smooth and $m$-strongly convex with a global minimizer at $\mathbf{x}^*$. Then steepest descent with $\alpha = 1/L$ started from any $\mathbf{x}^0$ produces a sequence $\mathbf{x}^t$, $t=1,2,\ldots$ such that

$$ \lim_{t \to +\infty} f(\mathbf{x}^t) = f(\mathbf{x}^*). $$

Moreover, after $H$ steps, we have

$$ f(\mathbf{x}^H) - f(\mathbf{x}^*) \leq \left(1 - \frac{m}{L}\right)^H [f(\mathbf{x}^0) - f(\mathbf{x}^*)]. $$

Observe that $f(\mathbf{x}^H) - f(\mathbf{x}^*)$ decreases exponentially fast in $H$.

Proof (Convergence of Steepest Descent in Smooth Case): By the Descent Guarantee for Smooth Functions together and the lemma above, we have for all $t$

$$ f(\mathbf{x}^{t+1}) \leq f(\mathbf{x}^t) - \frac{1}{2L} \|\nabla f(\mathbf{x}^t)\|^2 \leq f(\mathbf{x}^t) - \frac{m}{L} [f(\mathbf{x}^t) - f(\mathbf{x}^*)]. $$

Subtracting $f(\mathbf{x}^*)$ on both sides gives

$$ f(\mathbf{x}^{t+1}) - f(\mathbf{x}^*) \leq \left(1 - \frac{m}{L}\right)[f(\mathbf{x}^t) - f(\mathbf{x}^*)]. $$

Recursing gives the claim. $\square$