Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 24, 2020
Copyright: © 2020 Sebastien Roch
The setup We are given a data matrix $A \in \mathbb{R}^{n \times d}$ and we want seek to project this data to a lower-dimensional space by computing a low-rank approximation of $A$. Truncating the SVD provides a solution to this problem. The SVD of $A$ is a decomposition of the form:
$$ A = U \Sigma V^T = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T $$where the columns of $U \in \mathbb{R}^{n \times r}$ and those of $V \in \mathbb{R}^{m \times r}$ are orthonormal, and $\Sigma \in \mathbb{R}^{r \times r}$ is a diagonal matrix. Here the $\mathbf{u}_j$'s are the columns of $U$ and are referred to as left singular vectors. Similarly the $\mathbf{v}_j$'s are the columns of $V$ and are referred to as right singular vectors. The $\sigma_j$'s, which are non-negative and in decreasing order
$$ \sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 $$are the diagonal elements of $\Sigma$ and are referred to as singular values.
The theory Indeed, we established the following. First, recall the following definitions. Define $\mathbb{S}^{m-1} = \{\mathbf{x} \in \mathbb{R}^m\,:\,\|\mathbf{x}\| = 1\}$.
Definition (Frobenius Norm): The Frobenius norm of an $n \times m$ matrix $A \in \mathbb{R}^{n \times m}$ is defined as
$$ \|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m a_{ij}^2}. $$$\lhd$
Definition (Induced Norm): The $2$-norm of a matrix $A \in \mathbb{R}^{n \times m}$ is
$$ \|A\|_2 = \max_{\mathbf{0} \neq \mathbf{x} \in \mathbb{R}^m} \frac{\|A \mathbf{x}\|}{\|\mathbf{x}\|} = \max_{\mathbf{x} \in \mathbb{S}^{m-1}} \|A \mathbf{x}\|. $$$\lhd$
Theorem (Low-Rank Approximation in the Induced Norm): Let $A \in \mathbb{R}^{n \times m}$ be a matrix with SVD
$$ A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T $$and let $A_k$ be the truncation defined above with $k < r$. For any matrix $B \in \mathbb{R}^{n \times m}$ of rank at most $k$,
$$ \|A - A_k\|_2 \leq \|A - B\|_2. $$Finally, we justified the use of this projection by the following inequality.
Theorem (Why Project): Let $A \in \mathbb{A}^{n \times d}$ be a matrix and let $A_k$ be the truncation above. For any matrix $C \in \mathbb{R}^{n \times d}$ of rank $\leq k$,
$$ \|A_k - C\|_F^2 \leq 8 k \|A - C\|_2^2. $$The algorithm We used the Power Iteration to compute the top singular vectors.
Lemma (Power Iteration): Let $A \in \mathbb{R}^{n\times m}$ be a matrix with $m \leq n$. Let $U \Sigma V^T$ be an SVD of $A$ such that $\sigma_1 > \sigma_2$. Define $B = A^T A$ and assume that $\mathbf{x} \in \mathbb{R}^m$ is a vector satisfying $\langle \mathbf{v}_1, \mathbf{x} \rangle > 0$. Then
$$ \frac{B^{2k} \mathbf{x}}{\|B^{2k} \mathbf{x}\|} \to \mathbf{v}_1 $$as $k \to +\infty$.
The algorithm is below.
# Julia version: 1.5.1
using DataFrames, CSV, Plots, LinearAlgebra, SparseArrays, Statistics, Distances
function mmids_gramschmidt(A)
n, m = size(A)
Q = zeros(Float64, n, m)
R = zeros(Float64, m, m)
for j = 1:m
v = A[:,j]
for k = 1:j-1
R[k,j] = dot(Q[:,k],A[:,j])
v -= R[k,j]*Q[:,k]
end
R[j,j] = norm(v)
Q[:,j] = v/R[j,j]
end
return Q, R
end
function mmids_svd(A, r; maxiter=1000)
V = randn(size(A,2),r) # random initialization
for _ = 1:maxiter
W = A * V
Z = A' * W
V, R = mmids_gramschmidt(Z)
end
W = A * V
S = [norm(W[:, i]) for i=1:size(W,2)] # singular values
U = reduce(hcat,[W[:,i]/S[i] for i=1:size(W,2)]) # left singular vectors
return U, S, V
end
In this notebook, we give two concrete examples of high-dimensional datasets where projecting to the first few singular vectors appears to preserve much of the structure.
We consider an application of dimensionality reduction in biology. We will look at SNP data from viruses. A little background first. From Wikipedia:
A single-nucleotide polymorphism (SNP; /snɪp/; plural /snɪps/) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.
Quoting Jombart et al., BMC Genetics (2010), we analyze:
the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.
Some details about the Jombart et al. dataset:
For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.
We first load the data. It contains a subset of $1642$ strains from the dataset above.
dfsnp = CSV.read("h3n2-snp.csv")
first(dfsnp, 5)
nrow(dfsnp)
We extract a data matrix, run our SVD algorithm with $k=2$, and plot the data in the projected subspace of the first two singular vectors.
A = reduce(hcat, [dfsnp[:,col] for col in names(dfsnp)[2:end]])
U, S, V = mmids_svd(A, 2);
scatter(U[:,1].*S[1], U[:,2].*S[2], legend = false)
There seems to be some reasonably well-defined clusters in this projection. To further reveal the structure, we color the data points by year. That information is in a separate file.
dfoth = CSV.read("h3n2-other.csv")
first(dfoth, 5)
year = dfoth[:,:year];
scatter(U[:,1].*S[1], U[:,2].*S[2], marker_z=year, legend=false,
colorbar=:right, clims=(2001,2006), seriescolor=:lightrainbow)
To some extent, one can "see" the virus evolving from year to year.
One last note: in this case better results can be obtained using other related dimensionality reduction methods.
Next, we return to the MovieLens dataset from the first notebook of this topic.
dfm = CSV.read("movielens-small-movies.csv")
first(dfm, 5)
dfr = CSV.read("movielens-small-ratings.csv")
first(dfr, 5)
Here's a summary of the data. There are $9742$ movies and $100836$ ratings made by $610$ users.
nrow(dfr)
describe(dfm)
describe(dfr)
We will convert the ratings data into a matrix.
We construct a matrix whose rows are the movies and whose columns are the users. Because this matrix contains many zeros, which we use to encode the absence of a rating, we first define it using a sparse
array (that is, we only list the non-zero entries).
A = sparse(dfr[:,:movieId],dfr[:,:userId],dfr[:,:rating])
Next, we extract the movie titles.
movie_titles = dfm[:,:title]
Before running our SVD algorithm, we preprocess our data. We first remove all movies without ratings. We also replace all missing ratings with the average rating of the corresponding movie. Finally, we center the ratings by subtracting the mean rating of each user.
## PRE-PROCESSING of A
## OUTPUT: Z, movie_titles_Z, mZ
# restrict to rows with at least one rating
nnz_movies = findall(s -> s > 0, vec(sum(A,dims=2)))
Z = copy(Array{Float64,2}(A[nnz_movies,:]))
# for each missing rating, replace with mean rating of corresponding movie
for i=1:size(Z,1)
nnz_ratings_for_i = findall(r -> r > 0, Z[i,:])
z_ratings_for_i = findall(r -> r == 0, Z[i,:])
Z[i,z_ratings_for_i] .= mean(Z[i,nnz_ratings_for_i])
end
# for each user, center ratings
mZ = zeros(Float64, size(Z,2))
for j=1:size(Z,2)
mZ[j] = mean(Z[:,j])
Z[:,j] .-= mZ[j]
end
# update movie titles
movie_titles_Z = movie_titles[nnz_movies];
We use our SVD algorithm to compute the top 10 singular vectors and plot the corresponding singular values.
U, S, V = mmids_svd(Z, 10);
plot(log.(S), legend=false)
We notice that the singular values drop quickly, suggesting that the data can be well approximated by a low-dimensional projection.
We construct predicted ratings for each user and each movie based on the projected data. To do so, we add back the average rating for each user.
k = 10
Apred = U[:,1:k]*Diagonal(S[1:k])*V[:,1:k]'
# for each user, de-center the ratings
for j=1:size(Z,2)
Apred[:,j] .+= mZ[j]
end
Apred
Compare to the original dataset.
Array(A[nnz_movies,:])