Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 24, 2020
Copyright: © 2020 Sebastien Roch
The setup We are given a data matrix $A \in \mathbb{R}^{n \times d}$ and we want seek to project this data to a lower-dimensional space by computing a low-rank approximation of $A$. Truncating the SVD provides a solution to this problem. The SVD of $A$ is a decomposition of the form:
$$ A = U \Sigma V^T = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T $$where the columns of $U \in \mathbb{R}^{n \times r}$ and those of $V \in \mathbb{R}^{m \times r}$ are orthonormal, and $\Sigma \in \mathbb{R}^{r \times r}$ is a diagonal matrix. Here the $\mathbf{u}_j$'s are the columns of $U$ and are referred to as left singular vectors. Similarly the $\mathbf{v}_j$'s are the columns of $V$ and are referred to as right singular vectors. The $\sigma_j$'s, which are non-negative and in decreasing order
$$ \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 $$are the diagonal elements of $\Sigma$ and are referred to as singular values.
The theory Indeed, we established the following. First, recall the following definitions. Define $\mathbb{S}^{m-1} = \{\mathbf{x} \in \mathbb{R}^m\,:\,\|\mathbf{x}\| = 1\}$.
Definition (Frobenius Norm): The Frobenius norm of an $n \times m$ matrix $A \in \mathbb{R}^{n \times m}$ is defined as
$$ \|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m a_{ij}^2}. $$$\lhd$
Definition (Induced Norm): The $2$-norm of a matrix $A \in \mathbb{R}^{n \times m}$ is
$$ \|A\|_2 = \max_{\mathbf{0} \neq \mathbf{x} \in \mathbb{R}^m} \frac{\|A \mathbf{x}\|}{\|\mathbf{x}\|} = \max_{\mathbf{x} \in \mathbb{S}^{m-1}} \|A \mathbf{x}\|. $$$\lhd$
Theorem (Low-Rank Approximation in the Induced Norm): Let $A \in \mathbb{R}^{n \times m}$ be a matrix with SVD
$$ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T $$and let $A_k$ be the truncation defined above with $k < r$. For any matrix $B \in \mathbb{R}^{n \times m}$ of rank at most $k$,
$$ \|A - A_k\|_2 \leq \|A - B\|_2. $$Finally, we justified the use of this projection by the following inequality.
Theorem (Why Project): Let $A \in \mathbb{A}^{n \times d}$ be a matrix and let $A_k$ be the truncation above. For any matrix $C \in \mathbb{R}^{n \times d}$ of rank $\leq k$,
$$ \|A_k - C\|_F^2 \leq 8 k \|A - C\|_2^2. $$The algorithm We used the Power Iteration to compute the top singular vectors.
Lemma (Power Iteration): Let $A \in \mathbb{R}^{n\times m}$ be a matrix with $m \leq n$. Let $U \Sigma V^T$ be an SVD of $A$ such that $\sigma_1 > \sigma_2$. Define $B = A^T A$ and assume that $\mathbf{x} \in \mathbb{R}^m$ is a vector satisfying $\langle \mathbf{v}_1, \mathbf{x} \rangle > 0$. Then
$$ \frac{B^{2k} \mathbf{x}}{\|B^{2k} \mathbf{x}\|} \to \mathbf{v}_1 $$as $k \to +\infty$.
The algorithm is below.
function mmids_svd(A, r; maxiter=1000)
V = randn(size(A,2),r) # random initialization
for _ = 1:maxiter
W = A * V
Z = A' * W
V, R = mmids_gramschmidt(Z)
end
W = A * V
S = [norm(W[:, i]) for i=1:size(W,2)] # singular values
U = reduce(hcat,[W[:,i]/S[i] for i=1:size(W,2)]) # left singular vectors
return U, S, V
end
mmids_svd (generic function with 1 method)
We consider an application of dimensionality reduction in biology. We will look at SNP data from viruses. A little background first. From Wikipedia:
A single-nucleotide polymorphism (SNP; /snɪp/; plural /snɪps/) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.
Quoting Jombart et al., BMC Genetics (2010), we analyze:
the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.
Some details about the Jombart et al. dataset:
For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.
dfsnp = CSV.read("h3n2-snp.csv")
first(dfsnp, 5)
strain | s6a | s6c | s6g | s17a | s17g | s17t | s39a | s39c | |
---|---|---|---|---|---|---|---|---|---|
String | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | AB434107 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | AB434108 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | CY000113 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | CY000209 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | CY000217 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
nrow(dfsnp)
1642
A = reduce(hcat, [dfsnp[:,col] for col in names(dfsnp)[2:end]])
1642×317 Array{Float64,2}: 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ⋮ ⋮ ⋱ ⋮ ⋮ 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
U, S, V = mmids_svd(A, 2);
scatter(U[:,1].*S[1], U[:,2].*S[2], legend = false)
dfoth = CSV.read("h3n2-other.csv")
first(dfoth, 5)
strain | length | country | year | lon | lat | date | |
---|---|---|---|---|---|---|---|
String | Int64 | String | Int64 | Float64 | Float64 | String | |
1 | AB434107 | 1701 | Japan | 2002 | 137.215 | 35.5842 | 2002/02/25 |
2 | AB434108 | 1701 | Japan | 2002 | 137.215 | 35.5842 | 2002/03/01 |
3 | CY000113 | 1762 | USA | 2002 | -73.94 | 40.67 | 2002/01/29 |
4 | CY000209 | 1760 | USA | 2002 | -73.94 | 40.67 | 2002/01/17 |
5 | CY000217 | 1760 | USA | 2002 | -73.94 | 40.67 | 2002/02/26 |
year = dfoth[:,:year];
scatter(U[:,1].*S[1], U[:,2].*S[2], marker_z=year, legend=false,
colorbar=:right, clims=(2001,2006), seriescolor=:lightrainbow)