REVIEW¶

Matrix perturbation theory¶

Course: Math 888 - High-Dimensional Probability and Statistics (HDPS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 22, 2021
Copyright: © 2021 Sebastien Roch

Recall: Matrix norms¶

Throughout this section, $\|\mathbf{v}\|$ is the Euclidean norm of $\mathbf{v} \in\mathbb{R}^d$. The Euclidean norm naturally generalizes to matrices. Indeed recall that the Frobenius norm of an $n \times m$ matrix $A \in \mathbb{R}^{n \times m}$ is defined as

$$ \|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m a_{ij}^2}. $$

Notice that the Frobenius norm does not directly relate to $A$ as a representation of a linear map. It is desirable in many contexts to quantify how two matrices differ in terms of how they act on vectors.

For instance, one is often interested in bounding quantities of the following form. Let $B, B' \in \mathbb{R}^{n \times m}$ and let $\mathbf{x} \in \mathbb{R}^m$ be of unit norm. What can be said about $\|B \mathbf{x} - B' \mathbf{x}\|$? Intuitively, what we would like is this: if the norm of $B - B'$ is small then $B$ is close to $B'$ as a linear map, that is, the vector norm $\|B \mathbf{x} - B' \mathbf{x}\|$ is small for any unit vector $\mathbf{x}$. The following definition provides us with such a notion. Define $\mathbb{S}^{m-1} = \{\mathbf{x} \in \mathbb{R}^m\,:\,\|\mathbf{x}\| = 1\}$.

Definition (Induced Norm): The $2$-norm of a matrix $A \in \mathbb{R}^{n \times m}$ is

$$ \|A\|_2 = \max_{\mathbf{0} \neq \mathbf{x} \in \mathbb{R}^m} \frac{\|A \mathbf{x}\|}{\|\mathbf{x}\|} = \max_{\mathbf{x} \in \mathbb{S}^{m-1}} \|A \mathbf{x}\|. $$

$\lhd$

The equality in the definition uses the homogeneity of the vector norm. Also the definition implicitly uses the Extreme Value Theorem. In this case, we use the fact that the function $f(\mathbf{x}) = \|A \mathbf{x}\|$ is continuous and the set $\mathbb{S}^{m-1}$ is closed and bounded to conclude that there exists $\mathbf{x}^* \in \mathbb{S}^{m-1}$ such that $f(\mathbf{x}^*) \geq f(\mathbf{x})$ for all $\mathbf{x} \in \mathbb{S}^{m-1}$.

Exercise: Let $A \in \mathbb{R}^{n \times m}$. Use Cauchy-Schwarz to show that

$$ \|A\|_2 = \max \left\{ \mathbf{x}^T A \mathbf{y} \,:\, \|\mathbf{x}\| = \|\mathbf{y}\| = 1 \right\}. $$

$\lhd$

Exercise: Let $A \in \mathbb{R}^{n \times m}$.

(a) Show that $\|A\|_F^2 = \sum_{j=1}^m \|A \mathbf{e}_j\|^2$.

(b) Use (a) and Cauchy-Schwarz to show that $\|A\|_2 \leq \|A\|_F$.

The $2$-norm of a matrix has many other useful properties.

Lemma (Properties of the Induced Norm): Let $A, B \in \mathbb{R}^{n \times m}$ and $\alpha \in \mathbb{R}$. The following hold:

(a) $\|A \mathbf{x}\| \leq \|A\|_2 \|\mathbf{x}\|$, $\forall \mathbf{0} \neq \mathbf{x} \in \mathbb{R}^m$

(b) $\|A\|_2 \geq 0$

(d) $\|\alpha A\|_2 = |\alpha| \|A\|_2$

(e) $\|A + B \|_2 \leq \|A\|_2 + \|B\|_2$

(f) $\|A B \|_2 \leq \|A\|_2 \|B\|_2$.

1 Statement and proof of the spectral theorem¶

Theorem (Spectral Theorem): Let $A \in \mathbb{R}^{d \times d}$ be a symmetric matrix, that is, $A^T = A$. Then $A$ has $d$ orthonormal eigenvectors $\mathbf{q}_1, \ldots, \mathbf{q}_d$ with corresponding (not necessarily distinct) real eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$. In matrix form, this is written as the matrix factorization

$$ A = Q \Lambda Q^T = \sum_{i=1}^d \lambda_i \mathbf{q}_i \mathbf{q}_i^T $$

where $Q$ has columns $\mathbf{q}_1, \ldots, \mathbf{q}_d$ and $\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_d)$. We refer to this factorization as a spectral decomposition of $A$.

Proof idea (Spectral Theorem): Use a greedy sequence maximizing the quadratic form $\langle \mathbf{v}, A \mathbf{v}\rangle$. How is this quadratic form is related to eigenvalues? Note that, for a unit eigenvector $\mathbf{v}$ with eigenvalue $\lambda$, we have $\langle \mathbf{v}, A \mathbf{v}\rangle = \langle \mathbf{v}, \lambda \mathbf{v}\rangle = \lambda$.

Exercise: Consider the block matrices

$$ \begin{pmatrix} \mathbf{y}\\ \mathbf{z} \end{pmatrix} \quad \text{and} \quad \begin{pmatrix} A & B\\ C & D \end{pmatrix} $$

where $\mathbf{y}\in\mathbb{R}^{d_1}$, $\mathbf{z}\in\mathbb{R}^{d_2}$, $A \in \mathbb{R}^{d_1 \times d_1}$, $B \in \mathbb{R}^{d_1 \times d_2}$, $C \in \mathbb{R}^{d_2 \times d_1}$, and $D \in \mathbb{R}^{d_2 \times d_2}$. Show that

$$ \begin{pmatrix} \mathbf{y}\\ \mathbf{z} \end{pmatrix}^T \begin{pmatrix} A & B\\ C & D \end{pmatrix} \begin{pmatrix} \mathbf{y}\\ \mathbf{z} \end{pmatrix} = \mathbf{y}^T A \mathbf{y} + \mathbf{y}^T B \mathbf{z} + \mathbf{z}^T C \mathbf{y} + \mathbf{z}^T D \mathbf{z}. $$

$\lhd$

Proof (Spectral Theorem): We proceed by induction.

A first eigenvector: Let $A_1 = A$ and

$$ \mathbf{v}_1 \in \arg\max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\} $$

and

$$ \lambda_1 = \max\{\langle \mathbf{v}, A_1 \mathbf{v}\rangle:\|\mathbf{v}\| = 1\}. $$

Complete $\mathbf{v}_1$ into an orthonormal basis of $\mathbb{R}^d$, $\mathbf{v}_1$, $\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d$, and form the block matrix

$$ \hat{W}_1 = \begin{pmatrix} \mathbf{v}_1 & \hat{V}_1 \end{pmatrix} $$

where the columns of $\hat{V}_1$ are $\hat{\mathbf{v}}_2, \ldots, \hat{\mathbf{v}}_d$. Note that $\hat{W}_1$ is orthogonal by construction.

Getting one step closer to diagonalization: We show next that $W_1$ gets us one step closer to a diagonal matrix by similarity transformation. Note first that

$$ \hat{W}_1^T A_1 \hat{W}_1 = \begin{pmatrix} \lambda_1 & \mathbf{w}_1^T \\ \mathbf{w}_1 & A_2 \end{pmatrix} $$

where $\mathbf{w}_1 = \hat{V}_1^T A_1 \mathbf{v}_1$ and $A_2 = \hat{V}_1^T A_1 \hat{V}_1$. The key claim is that $\mathbf{w}_1 = \mathbf{0}$. This follows from a contradiction argument.

Indeed, suppose $\mathbf{w}_1 \neq \mathbf{0}$ and consider the unit vector

$$ \mathbf{z} = \hat{W}_1 \times \frac{1}{\sqrt{1 + \delta^2 \|\mathbf{w}_1\|^2}} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix} $$

which, by the exercise above, achieves objective value

$$ \begin{align*} \mathbf{z}^T A_1 \mathbf{z} &= \frac{1}{1 + \delta^2 \|\mathbf{w}_1\|^2} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix}^T \begin{pmatrix} \lambda_1 & \mathbf{w}_1^T \\ \mathbf{w}_1 & A_2 \end{pmatrix} \begin{pmatrix} 1 \\ \delta \mathbf{w}_1 \end{pmatrix}\\ &= \frac{1}{1 + \delta^2 \|\mathbf{w}_1\|^2} \left( \lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + \delta^2 \mathbf{w}_1^T A_2 \mathbf{w}_1 \right). \end{align*} $$

By a Taylor expansion, for $\epsilon \in (0,1)$ small,

$$ \frac{1}{\sqrt{1 + \epsilon^2}} \approx 1 - \epsilon^2/2. $$

So, for $\delta$ small,

$$ \begin{align} \mathbf{z}^T A_1 \mathbf{z} &\approx (\lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + \delta^2 \mathbf{w}_1^T A_2 \mathbf{w}_1) (1 - \delta^2 \|\mathbf{w}_1\|^2/2)\\ &\approx \lambda_1 + 2 \delta \|\mathbf{w}_1\|^2 + C \delta^2\\ &> \lambda_1 \end{align} $$

where $C \in \mathbb{R}$ depends on $\mathbf{w}_1$ and $A_2$. That gives the desired contradiction.

So, letting $W_1 = \hat{W}_1$,

$$ W_1^T A_1 W_1 = \begin{pmatrix} \lambda_1 & \mathbf{0} \\ \mathbf{0} & A_2 \end{pmatrix}. $$

Finally note that $A_2 = \hat{V}_1^T A_1 \hat{V}_1$ is symmetric

$$ A_2^T = (\hat{V}_1^T A_1 \hat{V}_1)^T = \hat{V}_1^T A_1^T \hat{V}_1 = \hat{V}_1^T A_1 \hat{V}_1 = A_2 $$

by the symmetry of $A_1$ itself.

Next step of the induction: Apply the same argument to the symmetric matrix $A_2 \in \mathbb{R}^{(d-1)\times (d-1)}$, let $\hat{W}_2 \in \mathbb{R}^{(d-1)\times (d-1)}$ be the corresponding orthogonal matrix, and define $\lambda_2$ and $A_3$ through the equation

$$ \hat{W}_2^T A_2 \hat{W}_2 = \begin{pmatrix} \lambda_2 & \mathbf{0} \\ \mathbf{0} & A_3 \end{pmatrix}. $$

Define the block matrix

$$ W_2 = \begin{pmatrix} 1 & \mathbf{0}\\ \mathbf{0} & \hat{W}_2 \end{pmatrix} $$

and observe that

$$ W_2^T W_1^T A_1 W_1 W_2 = W_2^T \begin{pmatrix} \lambda_1 & \mathbf{0} \\ \mathbf{0} & A_2 \end{pmatrix} W_2 = \begin{pmatrix} \lambda_1 & \mathbf{0}\\ \mathbf{0} & \hat{W}_2^T A_2 \hat{W}_2 \end{pmatrix} =\begin{pmatrix} \lambda_1 & 0 & \mathbf{0} \\ 0 & \lambda_2 & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & A_3 \end{pmatrix}. $$

Proceeding similarly by induction gives the claim. $\square$

2 Extremal characterization of eigenvalues¶

Definition (Rayleigh Quotient): Let $A \in \mathbb{R}^{d \times d}$ be a symmetric matrix. The Rayleigh quotient is defined as

$$ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} $$

which is defined for any $\mathbf{u} \neq \mathbf{0}$ in $\mathbb{R}^{d}$. $\lhd$

Exercise: Let $A \in \mathbb{R}^{d \times d}$ be a symmetric matrix. Let $\mathbf{v}$ be a (not necessarily unit) eigenvector of $A$ with eigenvalue $\lambda$. Show that $\mathcal{R}_A(\mathbf{v}) = \lambda$. $\lhd$

Exercise: Let $\mathcal{U}, \mathcal{V} \subseteq \mathbb{R}^d$ be subspaces such that $\mathrm{dim}(\mathcal{U}) + \mathrm{dim}(\mathcal{V}) > d$. Show there exists a non-zero vector in the intersection $\mathcal{U} \cap \mathcal{V}$. [Hint: Take the union of a basis for $\mathcal{U}$ and a basis for $\mathcal{V}$, then use linear dependence.] $\lhd$

Theorem (Courant-Fischer): Let $A \in \mathbb{R}^{d \times d}$ be a symmetric matrix with spectral decomposition $A = \sum_{i=1}^d \lambda_i \mathbf{v}_i \mathbf{v}_i^T$ where $\lambda_1 \geq \cdots \geq \lambda_d$. For each $k = 1,\ldots,d$, define the subspace

$$ \mathcal{V}_k = \mathrm{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k) \quad\text{and}\quad \mathcal{W}_{d-k+1} = \mathrm{span}(\mathbf{v}_k, \ldots, \mathbf{v}_d). $$

Then, for all $k = 1,\ldots,d$,

$$ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}) = \max_{\mathbf{u} \in \mathcal{W}_{d-k+1}} \mathcal{R}_A(\mathbf{u}). $$

Furthermore we have the following min-max formulas, which do not depend on the choice of spectral decomposition, for all $k = 1,\ldots,d$

$$ \lambda_k = \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}) = \min_{\mathrm{dim}(\mathcal{W}) = d-k+1} \max_{\mathbf{u} \in \mathcal{W}} \mathcal{R}_A(\mathbf{u}). $$

Proof idea: For the local formula, we expand a vector in $\mathcal{V}_k$ into the basis $\mathbf{v}_1,\ldots,\mathbf{v}_k$ and use the fact that $\mathcal{R}_A(\mathbf{v}_i) = \lambda_i$ and that eigenvalues are in non-increasing order. The global formulas then follow from a dimension argument.

Proof: We first prove the local formula, that is, the one involving a specific decomposition.

Local formulas: Since $\mathbf{v}_1, \ldots, \mathbf{v}_k$ form an orthonormal basis of $\mathcal{V}_k$, any nonzero vector $\mathbf{u} \in \mathcal{V}_k$ can be written as $\mathbf{u} = \sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle \mathbf{v}_i$ and it follows that

$$ \langle \mathbf{u}, \mathbf{u} \rangle = \sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2 $$$$ \langle \mathbf{u}, A \mathbf{u} \rangle = \left\langle \mathbf{u}, \sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle \lambda_i \mathbf{v}_i \right\rangle = \sum_{i=1}^k \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2. $$

Thus,

$$ \mathcal{R}_A(\mathbf{u}) = \frac{\langle \mathbf{u}, A \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} = \frac{\sum_{i=1}^k \lambda_i \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2} \geq \lambda_k \frac{\sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2}{\sum_{i=1}^k \langle \mathbf{u}, \mathbf{v}_i \rangle^2} = \lambda_k $$

where we used $\lambda_1 \geq \cdots \geq \lambda_k$ and the fact that $\langle \mathbf{u}, \mathbf{v}_i \rangle^2 \geq 0$. Moreover $\mathcal{R}_A(\mathbf{v}_k) = \lambda_k$. So we have established

$$ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}). $$

The expression in terms of $\mathcal{W}_{d-k+1}$ is proved similarly.

Global formulas: Since $\mathcal{V}_k$ has dimension $k$, it follows from the local formula that

$$ \lambda_k = \min_{\mathbf{u} \in \mathcal{V}_k} \mathcal{R}_A(\mathbf{u}) \leq \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). $$

Let $\mathcal{V}$ be any subspace with dimension $k$. Because $\mathcal{W}_{d-k+1}$ has dimension $d - k + 1$, we have that $\dim(\mathcal{V}) + \mathrm{dim}(\mathcal{W}_{d-k+1}) > d$ and there must be non-zero vector $\mathbf{u}_0$ in the intersection $\mathcal{V} \cap \mathcal{W}_{d-k+1}$ by the exercise above.

We then have by the other local formula that

$$ \lambda_k = \max_{\mathbf{u} \in \mathcal{W}_{d-k+1}} \mathcal{R}_A(\mathbf{u}) \geq \mathcal{R}_A(\mathbf{u}_0) \geq \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). $$

Since this inequality holds for any subspace of dimension $k$, we have

$$ \lambda_k \geq \max_{\mathrm{dim}(\mathcal{V}) = k} \min_{\mathbf{u} \in \mathcal{V}} \mathcal{R}_A(\mathbf{u}). $$

Combining with inequality in the other direction above gives the claim. The other global formula is proved similarly. $\square$

3 Perturbations of eigenvalues: Weyl's inequality¶

For a symmetric matrix $C \in \mathbb{R}^{d \times d}$, we let $\lambda_j(C)$, $j=1, \ldots, d$, be the eigenvalues of $C$ in non-increasing order with corresponding orthonormal eigenvectors $\mathbf{v}_j(C)$, $j=1, \ldots, d$. Define the subspaces

$$ \mathcal{V}_k(C) = \mathrm{span}(\mathbf{v}_1(C), \ldots, \mathbf{v}_k(C)) \quad\text{and}\quad \mathcal{W}_{d-k+1}(C) = \mathrm{span}(\mathbf{v}_k(C), \ldots, \mathbf{v}_d(C)). $$

The following lemma is one version of what is known as Weyl's Inequality.

Lemma (Weyl): Let $A \in \mathbb{R}^{d \times d}$ and $B \in \mathbb{R}^{d \times d}$ be symmetric matrices. Then, for all $j=1, \ldots, d$,

$$ \max_{j \in [d]} \left|\lambda_j(B) - \lambda_j(A)\right| \leq \|B- A\|_2 $$

where $\|C\|_2$ is the induced $2$-norm of $C$.

Proof idea: We use the extremal characterization of the eigenvalues together with a dimension argument.

Exercise: Prove the following claim, which is known as the Subspace Intersection Lemma. Let $\mathcal{S}_1$ and $\mathcal{S}_2$ be linear subspaces of $\mathbb{R}^d$ and let

$$ \mathcal{S}_1 + \mathcal{S}_2 = \{\mathbf{x}_1 + \mathbf{x}_2 \,:\, \forall \mathbf{x}_1 \in \mathcal{S}_1, \mathbf{x}_2 \in \mathcal{S}_2\}. $$

Then it holds that

$$ \mathrm{dim}(\mathcal{S}_1 + \mathcal{S}_2) = \mathrm{dim}(\mathcal{S}_1) + \mathrm{dim}(\mathcal{S}_2) - \mathrm{dim}(\mathcal{S}_1 \cap \mathcal{S}_2). $$

[Hint: Consider a basis of $\mathcal{S}_1 \cap \mathcal{S}_2$ and complete into bases of $\mathcal{S}_1$ and $\mathcal{S}_2$. Show that the reuslting list of vectors is linear independent.] $\lhd$

Exercise: Show that, for any linear subspaces $\mathcal{S}_1, \ldots, \mathcal{S}_m$ of $\mathcal{V} = \mathbb{R}^d$, it holds that

$$ \mathrm{dim}\left(\bigcap_{k=1}^m \mathcal{S}_k\right) \geq \sum_{k=1}^m \mathrm{dim}\left(\mathcal{S}_k\right) - (m-1) \,\mathrm{dim}(\mathcal{V}). $$

[Hint: Use the Subspace Intersection Lemma and induction.] $\lhd$

Proof (Weyl): Let $H = B - A$. We prove only the upper bound. The other direction follows from interchanging the roles of $A$ and $B$. Because

$$ \mathrm{dim}(\mathcal{V}_j(B)) + \mathrm{dim}(\mathcal{W}_{d-j+1}(A)) = j + (d-j+1) = d+1 $$

it follows from the exercise above that

$$ \mathrm{dim}\left(\mathcal{V}_j(B) \cap \mathcal{W}_{d-j+1}(A)\right) \geq d+1 - d = 1. $$

Hence the $\mathcal{V}_j(B) \cap \mathcal{W}_{d-j+1}(A)$ is non-empty. Let $\mathbf{v}$ be a unit vector in that intersection.

By Courant-Fischer,

$$ \lambda_j(B) \leq \langle \mathbf{v}, (A+H) \mathbf{v}\rangle = \langle \mathbf{v}, A \mathbf{v}\rangle + \langle \mathbf{v}, H \mathbf{v}\rangle \leq \lambda_j(A) + \langle \mathbf{v}, H \mathbf{v}\rangle. $$

Moreover, by Cauchy-Schwarz, since $\|\mathbf{v}\|=1$

$$ \langle \mathbf{v}, H \mathbf{v}\rangle \leq \|\mathbf{v}\| \|H\mathbf{v}\| \leq \|H\|_2 $$

which proves the claim. $\square$

4 Perturbations of eigenvectors: Davis-Kahan theorem¶

While Weyl's Inequality indicates that the eigenvalues of $A$ and $B$ are close when $\|A - B\|_2$ is small, it says nothing about the eigenvectors. The following theorem gives a related bound for eigenvectors. It is usually stated in terms of the angle between the eigenvectors. Here we give a version that will be more suited to our applications. We do not optimize the constants. We use the same notation as in the previous section.

Theorem (Davis-Kahan): Let $A \in \mathbb{R}^{d \times d}$ and $B \in \mathbb{R}^{d \times d}$ be symmetric matrices. For an $i \in \{1,\ldots,d\}$, assume that

$$ \delta := \min_{j \neq i} |\lambda_i(A) - \lambda_j(A)| > 0. $$

Then

$$ \min_{s \in \{+1,-1\}} \|\mathbf{v}_i(A) - s \mathbf{v}_i(B)\|^2 \leq \frac{8 \|A - B\|_2^2}{\delta^2}. $$

Proof: Expand $\mathbf{v}_i(B)$ in the basis formed by the eigenvectors of $A$, that is, $ \mathbf{v}_i(B) = \sum_{j=1}^d \langle \mathbf{v}_i(B), \mathbf{v}_j(A) \rangle \,\mathbf{v}_j(A), $ where we used the orthonormality of the $\mathbf{v}_j(A)$'s. On the one hand,

$$ \begin{align*} \|(A - \lambda_i(A) I) \,\mathbf{v}_i(B)\|^2 &= \left\|\sum_{j=1}^d \langle \mathbf{v}_i(B), \mathbf{v}_j(A) \rangle (A - \lambda_i(A) I)\,\mathbf{v}_j(A)\right\|^2\\ &= \left\|\sum_{j=1, j \neq i}^d \langle \mathbf{v}_i(B), \mathbf{v}_j(A) \rangle (\lambda_j(A) - \lambda_i(A))\,\mathbf{v}_j(A)\right\|^2\\ &= \sum_{j=1, j \neq i}^d \langle \mathbf{v}_i(B), \mathbf{v}_j(A) \rangle^2 (\lambda_j(A) - \lambda_i(A))^2\\ &\geq \delta^2 (1 - \langle \mathbf{v}_i(B), \mathbf{v}_i(A) \rangle^2), \end{align*} $$

where, on the last two lines, we used the orthonormality of the $\mathbf{v}_j(A)$'s and $\mathbf{v}_j(B)$'s.

On the other hand, letting $E = A - B$, by the triangle inequality

$$ \begin{align*} \|(A - \lambda_i(A) I) \,\mathbf{v}_i(B)\| &= \|(B + E - \lambda_i(A) I) \,\mathbf{v}_i(B)\|\\ &\leq \|(B - \lambda_i(A) I) \,\mathbf{v}_i(B)\| + \|E \,\mathbf{v}_i(B)\|\\ &\leq |\lambda_i(B) - \lambda_i(A)| \|\mathbf{v}_i(B)\| + \|E\|_2 \|\mathbf{v}_i(B)\|\\ &\leq |\lambda_i(B) - \lambda_i(A)| + \|E\|_2, \end{align*} $$

where, on the last line, we used the orthonormality of $\mathbf{v}_i(B)$ and the definition of $\|E\|_2$. By Weyl, we also have $|\lambda_i(B) - \lambda_i(A)| \leq \|E\|_2$.

Combining the last two inequalities gives

$$ (1 - \langle \mathbf{v}_i(B), \mathbf{v}_i(A) \rangle^2) \leq \frac{4 \|E\|_2^2}{\delta^2}. $$

The result follows by noting that, since $|\langle \mathbf{v}_i(B), \mathbf{v}_i(A) \rangle| \leq 1$ by Cauchy-Schwarz,

$$ \begin{align*} \min_{s \in \{+1,-1\}} \|\mathbf{v}_i(A) - s \mathbf{v}_i(B)\|^2 &= 2 - 2 |\langle \mathbf{v}_i(B), \mathbf{v}_i(A) \rangle|\\ &\leq 2 (1 - \langle \mathbf{v}_i(B), \mathbf{v}_i(A) \rangle^2)\\ &\leq \frac{8 \|E\|_2^2}{\delta^2}. \end{align*} $$

$\square$