TUTORIAL 2a

Dimensionality reduction


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 24, 2020
Copyright: © 2020 Sebastien Roch


Recap from the lectures

The setup We are given a data matrix $A \in \mathbb{R}^{n \times d}$ and we want seek to project this data to a lower-dimensional space by computing a low-rank approximation of $A$. Truncating the SVD provides a solution to this problem. The SVD of $A$ is a decomposition of the form:

$$ A = U \Sigma V^T = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T $$

where the columns of $U \in \mathbb{R}^{n \times r}$ and those of $V \in \mathbb{R}^{m \times r}$ are orthonormal, and $\Sigma \in \mathbb{R}^{r \times r}$ is a diagonal matrix. Here the $\mathbf{u}_j$'s are the columns of $U$ and are referred to as left singular vectors. Similarly the $\mathbf{v}_j$'s are the columns of $V$ and are referred to as right singular vectors. The $\sigma_j$'s, which are non-negative and in decreasing order

$$ \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 $$

are the diagonal elements of $\Sigma$ and are referred to as singular values.

The theory Indeed, we established the following. First, recall the following definitions. Define $\mathbb{S}^{m-1} = \{\mathbf{x} \in \mathbb{R}^m\,:\,\|\mathbf{x}\| = 1\}$.

Definition (Frobenius Norm): The Frobenius norm of an $n \times m$ matrix $A \in \mathbb{R}^{n \times m}$ is defined as

$$ \|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m a_{ij}^2}. $$

$\lhd$

Definition (Induced Norm): The $2$-norm of a matrix $A \in \mathbb{R}^{n \times m}$ is

$$ \|A\|_2 = \max_{\mathbf{0} \neq \mathbf{x} \in \mathbb{R}^m} \frac{\|A \mathbf{x}\|}{\|\mathbf{x}\|} = \max_{\mathbf{x} \in \mathbb{S}^{m-1}} \|A \mathbf{x}\|. $$

$\lhd$


Theorem (Low-Rank Approximation in the Induced Norm): Let $A \in \mathbb{R}^{n \times m}$ be a matrix with SVD

$$ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T $$

and let $A_k$ be the truncation defined above with $k < r$. For any matrix $B \in \mathbb{R}^{n \times m}$ of rank at most $k$,

$$ \|A - A_k\|_2 \leq \|A - B\|_2. $$

Finally, we justified the use of this projection by the following inequality.


Theorem (Why Project): Let $A \in \mathbb{A}^{n \times d}$ be a matrix and let $A_k$ be the truncation above. For any matrix $C \in \mathbb{R}^{n \times d}$ of rank $\leq k$,

$$ \|A_k - C\|_F^2 \leq 8 k \|A - C\|_2^2. $$

The algorithm We used the Power Iteration to compute the top singular vectors.


Lemma (Power Iteration): Let $A \in \mathbb{R}^{n\times m}$ be a matrix with $m \leq n$. Let $U \Sigma V^T$ be an SVD of $A$ such that $\sigma_1 > \sigma_2$. Define $B = A^T A$ and assume that $\mathbf{x} \in \mathbb{R}^m$ is a vector satisfying $\langle \mathbf{v}_1, \mathbf{x} \rangle > 0$. Then

$$ \frac{B^{2k} \mathbf{x}}{\|B^{2k} \mathbf{x}\|} \to \mathbf{v}_1 $$

as $k \to +\infty$.


The algorithm is below.

In [2]:
function mmids_svd(A, r; maxiter=1000)
    V = randn(size(A,2),r) # random initialization
    for _ = 1:maxiter
        W = A * V
        Z = A' * W
        V, R = mmids_gramschmidt(Z)
    end
    W = A * V
    S = [norm(W[:, i]) for i=1:size(W,2)] # singular values
    U = reduce(hcat,[W[:,i]/S[i] for i=1:size(W,2)]) # left singular vectors
    return U, S, V
end
Out[2]:
mmids_svd (generic function with 1 method)

1 Exploring viral evolution using genetic data

We consider an application of dimensionality reduction in biology. We will look at SNP data from viruses. A little background first. From Wikipedia:

A single-nucleotide polymorphism (SNP; /snɪp/; plural /snɪps/) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.

Quoting Jombart et al., BMC Genetics (2010), we analyze:

the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.

Some details about the Jombart et al. dataset:

For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.

In [3]:
dfsnp = CSV.read("h3n2-snp.csv")
first(dfsnp, 5)
Out[3]:

5 rows × 318 columns (omitted printing of 309 columns)

strains6as6cs6gs17as17gs17ts39as39c
StringFloat64Float64Float64Float64Float64Float64Float64Float64
1AB4341071.00.00.01.00.00.00.00.0
2AB4341081.00.00.01.00.00.00.00.0
3CY0001131.00.00.01.00.00.00.00.0
4CY0002091.00.00.01.00.00.00.00.0
5CY0002171.00.00.01.00.00.00.00.0
In [4]:
nrow(dfsnp)
Out[4]:
1642
In [5]:
A = reduce(hcat, [dfsnp[:,col] for col in names(dfsnp)[2:end]])
Out[5]:
1642×317 Array{Float64,2}:
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 ⋮                        ⋮              ⋱  ⋮                        ⋮    
 1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
In [6]:
U, S, V = mmids_svd(A, 2);
In [7]:
scatter(U[:,1].*S[1], U[:,2].*S[2], legend = false)
Out[7]:
In [8]:
dfoth = CSV.read("h3n2-other.csv")
first(dfoth, 5)
Out[8]:

5 rows × 7 columns

strainlengthcountryyearlonlatdate
StringInt64StringInt64Float64Float64String
1AB4341071701Japan2002137.21535.58422002/02/25
2AB4341081701Japan2002137.21535.58422002/03/01
3CY0001131762USA2002-73.9440.672002/01/29
4CY0002091760USA2002-73.9440.672002/01/17
5CY0002171760USA2002-73.9440.672002/02/26
In [9]:
year = dfoth[:,:year];
In [10]:
scatter(U[:,1].*S[1], U[:,2].*S[2], marker_z=year, legend=false,
    colorbar=:right, clims=(2001,2006), seriescolor=:lightrainbow)
Out[10]: