TOPIC 3

Optimality, convexity, and gradient descent

3 Convexity


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Oct 10, 2020
Copyright: © 2020 Sebastien Roch


3.1 Convex sets

We start with convex sets.

Definition (Convex Set): A set $D \subseteq \mathbb{R}^d$ is convex if for all $\mathbf{x}, \mathbf{y} \in D$ and all $\alpha \in [0,1]$

$$ (1-\alpha) \mathbf{x} + \alpha \mathbf{y} \in D. $$

$\lhd$

The following is a convex set:

Convex set

(Source)

The following is not a convex set:

Not a convex set

(Source)

Example: Any open ball in $\mathbb{R}^d$ is convex. Indeed, let $\delta > 0$ and $\mathbf{x}_0$. For any $\mathbf{x}, \mathbf{y} \in B_{\delta}(\mathbf{x}_0)$ and any $\alpha \in [0,1]$, we have

$$ \begin{align*} \|[(1-\alpha) \mathbf{x} + \alpha \mathbf{y}] - \mathbf{x}_0\| &= \|(1-\alpha) (\mathbf{x} - \mathbf{x}_0) + \alpha (\mathbf{y} - \mathbf{x}_0)\|\\ &\leq \|(1-\alpha) (\mathbf{x} - \mathbf{x}_0)\| + \|\alpha (\mathbf{y} - \mathbf{x}_0)\|\\ &= (1-\alpha) \|\mathbf{x} - \mathbf{x}_0\| + \alpha \|\mathbf{y} - \mathbf{x}_0\|\\ &\leq (1-\alpha) \delta + \alpha \delta\\ &= \delta \end{align*} $$

where we used the triangle inequality on the second line. Hence we have established that $(1-\alpha) \mathbf{x} + \alpha \mathbf{y} \in B_{\delta}(\mathbf{x}_0)$. $\lhd$

A number of operations preserve convexity.


Lemma (Operations that Preserve Convexity): Let $S_1$ and $S_2$ be convex sets in $\mathbb{R}^d$. Let $\beta \in \mathbb{R}$ and $\mathbf{b} \in \mathbb{R}^d$. The following sets are also convex:

(a) $\beta S_1 = \{\beta \mathbf{x}\,:\, \mathbf{x} \in S_1\}$

(b) $S_1 + \mathbf{b} = \{\mathbf{x} + \mathbf{b}\,:\, \mathbf{x} \in S_1\}$

(c) $S_1 + S_2 = \{\mathbf{x}_1 + \mathbf{x}_2\,:\, \mathbf{x}_1 \in S_1 \text{ and } \mathbf{x}_2 \in S_2\}$

(d) $S_1 \times S_2 = \{(\mathbf{x}_1, \mathbf{x}_2) \,:\, \mathbf{x}_1 \in S_1 \text{ and } \mathbf{x}_2 \in S_2\}$

(e) $T = \{\mathbf{x}_1\,:\, (\mathbf{x}_1, \mathbf{x}_2) \in S_1 \times \mathbb{R}^d\}$

(f) $S_1 \cap S_2 = \{\mathbf{x}\,:\, \mathbf{x} \in S_1 \text{ and } \mathbf{x} \in S_2\}$


Proof: We only prove (f). The other statements are left as an exercise. Suppose $\mathbf{x}, \mathbf{y} \in S_1 \cap S_2$ and $\alpha \in [0,1]$. Then, by the convexity of $S_1$, $(1-\alpha) \mathbf{x} + \alpha \mathbf{y} \in S_1$ and, by the convexity of $S_2$, $(1-\alpha) \mathbf{x} + \alpha \mathbf{y} \in S_2$. Hence

$$ (1-\alpha) \mathbf{x} + \alpha \mathbf{y} \in S_1 \cap S_2. $$

$\square$

3.2 Convex functions

Our main interest is in convex functions.

Definition (Convex Function): A function $f : \mathbb{R}^d \to \mathbb{R}$ is convex if, for all $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ and all $\alpha \in [0,1]$

$$ f((1-\alpha) \mathbf{x} + \alpha \mathbf{y}) \leq (1-\alpha) f(\mathbf{x}) + \alpha f(\mathbf{y}). $$

More generally, a function $f : D \to \mathbb{R}$ over a convex domains $D \subseteq \mathbb{R}^d$ is convex if the definition above holds over all $\mathbf{x}, \mathbf{y} \in D$. $\lhd$

Convex function

(Source)


Lemma (Affine Functions are Convex): Let $\mathbf{w} \in \mathbb{R}^d$ and $b \in \mathbb{R}$. The function $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$ is convex.


Proof: For any $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ and $\alpha \in [0,1]$,

$$ f((1-\alpha) \mathbf{x} + \alpha \mathbf{y}) = \mathbf{w}^T [(1-\alpha) \mathbf{x} + \alpha \mathbf{y}] + b = (1-\alpha)[\mathbf{w}^T \mathbf{x} + b] + \alpha [\mathbf{w}^T \mathbf{y} + b] $$

which proves the claim. $\square$


Lemma (First-Order Convexity Condition): Let $f : \mathbb{R}^d \to \mathbb{R}$ be continuously differentiable. Then $f$ is convex if and only if for all $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$

$$ f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y}-\mathbf{x}). $$

The condition is illustrated below.

First-order convexity

(Source)

Proof (First-Order Convexity Condition): Suppose first that $f(\mathbf{z}_2) \geq f(\mathbf{z}_1) + \nabla f(\mathbf{z}_1)^T (\mathbf{z}_2-\mathbf{z}_1)$ for all $\mathbf{z}_1, \mathbf{z}_2$. For any $\mathbf{x}, \mathbf{y}$ and $\alpha \in [0,1]$, let $\mathbf{w} = (1-\alpha) \mathbf{x} + \alpha \mathbf{y}$. Then taking $\mathbf{z}_1 = \mathbf{w}$ and $\mathbf{z}_2 = \mathbf{x}$ gives

$$ f(\mathbf{x}) \geq f(\mathbf{w}) + \nabla f(\mathbf{w})^T (\mathbf{x}-\mathbf{w}) $$

and taking $\mathbf{z}_1 = \mathbf{w}$ and $\mathbf{z}_2 = \mathbf{y}$ gives

$$ f(\mathbf{y}) \geq f(\mathbf{w}) + \nabla f(\mathbf{w})^T (\mathbf{y}-\mathbf{w}). $$

Multiplying the first inequality by $(1-\alpha)$ and the second one by $\alpha$, and adding them up gives

$$ (1-\alpha) f(\mathbf{x}) + \alpha f(\mathbf{y}) \geq f(\mathbf{w}) + \nabla f(\mathbf{w})^T ([(1-\alpha) \mathbf{x} + \alpha \mathbf{y}] - \mathbf{w}) = f(\mathbf{w}) $$

proving convexity.

For the other direction, assume that $f$ is convex. For any $\mathbf{x}, \mathbf{y}$ and $\alpha \in (0,1)$, by the Multivariate Mean Value Theorem, for some $\xi \in (0,1)$ it holds that

$$ f(\mathbf{w}) = f(\mathbf{x} + \alpha (\mathbf{y} - \mathbf{x})) = f(\mathbf{x}) + \alpha (\mathbf{y} - \mathbf{x})^T \nabla f (\mathbf{x} + \xi \alpha (\mathbf{y} - \mathbf{x})) $$

while convexity implies

$$ f(\mathbf{w}) \leq (1-\alpha) f(\mathbf{x}) + \alpha f(\mathbf{y}). $$

Combining, rearranging and dividing by $\alpha$ gives

$$ (\mathbf{y} - \mathbf{x})^T \nabla f (\mathbf{x} + \xi \alpha (\mathbf{y} - \mathbf{x})) \leq f(\mathbf{y}) - f(\mathbf{x}). $$

Taking $\alpha \to 0$ gives the claim. $\square$


Lemma (Second-Order Convexity Condition): Let $f : \mathbb{R}^d \to \mathbb{R}$ be twice continuously differentiable. Then $f$ is convex if and only if, for all $\mathbf{x} \in \mathbb{R}^d$, $\mathbf{H}_f(\mathbf{x})$ is positive semidefinite.


Example: Consider the quadratic function

$$ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r. $$

We showed previously that the Hessian is

$$ \mathbf{H}_f(\mathbf{x}) = \frac{1}{2}[P + P^T]. $$

So $f$ is convex if and only if the matrix $\frac{1}{2}[P + P^T]$ has only nonnegative eigenvalues. $\lhd$

3.3 Convexity and unconstrained optimization

A key property of convex functions is the following.


Theorem (Global Minimizers of Convex Functions): Let $f : \mathbb{R}^d \to \mathbb{R}$ be a convex function. Then any local minimizer of $f$ is also a global minimizer.


Proof: By contradiction, suppose $\mathbf{x}_0$ is a local minimizer, but not a global minimizer. Then there is $\mathbf{y}$ such that

$$ f(\mathbf{y}) < f(\mathbf{x}_0). $$

By convexity, for any $\alpha \in (0,1)$

$$ f(\mathbf{x}_0 + \alpha (\mathbf{y} - \mathbf{x}_0)) \leq (1-\alpha) f(\mathbf{x}_0) + \alpha f(\mathbf{y}) < f(\mathbf{x}_0). $$

But that implies that every open ball around $\mathbf{x}_0$ contains a point taking a smaller value than $f(\mathbf{x}_0)$, a contradiction. $\square$


Theorem (First-Order Sufficient Condition for Convex Functions): Let $f : \mathbb{R}^d \to \mathbb{R}$ be a continuously differentiable, convex function. If $\nabla f(\mathbf{x}_0) = \mathbf{0}$ then $\mathbf{x}_0$ is a global minimizer.


Proof: Assume $\nabla f(\mathbf{x}_0) = \mathbf{0}$. By the First-Order Convexity Condition, for any $\mathbf{y}$

$$ f(\mathbf{y}) - f(\mathbf{x}_0) \geq \nabla f(\mathbf{x}_0)^T (\mathbf{y} - \mathbf{x}_0) = 0. $$

That proves the claim. $\square$

Example: Consider the quadratic function

$$ f(\mathbf{x}) = \frac{1}{2} \mathbf{x}^T P \mathbf{x} + \mathbf{q}^T \mathbf{x} + r $$

where $P$ is symmetric and positive definite. The Hessian is then

$$ \mathbf{H}_f(\mathbf{x}) = \frac{1}{2}[P + P^T] = P $$

for any $\mathbf{x}$. So $f$ is convex. Further the gradient is

$$ \nabla f(\mathbf{x}) = P\mathbf{x} + \mathbf{q} $$

for all $\mathbf{x}$.

Any $\mathbf{x}$ satisfying

$$ P\mathbf{x} + \mathbf{q} = \mathbf{0} $$

is a global minimizer. If $P = Q \Lambda Q^T$ is a spectral decomposition of $P$, where all diagonal entries of $\Lambda$ are stricly positive, then $P^{-1} = Q \Lambda^{-1} Q^T$ where the diagonal entries of $\Lambda^{-1}$ are the inverses of those of $\Lambda$. That can be seen by checking that

$$ Q \Lambda Q^T Q \Lambda^{-1} Q^T = Q I_{d\times d} Q^T = I_{d \times d}. $$

So the following is a global minimizer

$$ \mathbf{x}^* = - Q \Lambda^{-1} Q^T \mathbf{q}. $$

$\lhd$