# Dimensionality reduction¶

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 24, 2020

## Recap from the lectures¶

The setup We are given a data matrix $A \in \mathbb{R}^{n \times d}$ and we want seek to project this data to a lower-dimensional space by computing a low-rank approximation of $A$. Truncating the SVD provides a solution to this problem. The SVD of $A$ is a decomposition of the form:

$$A = U \Sigma V^T = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T$$

where the columns of $U \in \mathbb{R}^{n \times r}$ and those of $V \in \mathbb{R}^{m \times r}$ are orthonormal, and $\Sigma \in \mathbb{R}^{r \times r}$ is a diagonal matrix. Here the $\mathbf{u}_j$'s are the columns of $U$ and are referred to as left singular vectors. Similarly the $\mathbf{v}_j$'s are the columns of $V$ and are referred to as right singular vectors. The $\sigma_j$'s, which are non-negative and in decreasing order

$$\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$$

are the diagonal elements of $\Sigma$ and are referred to as singular values.

The theory Indeed, we established the following. First, recall the following definitions. Define $\mathbb{S}^{m-1} = \{\mathbf{x} \in \mathbb{R}^m\,:\,\|\mathbf{x}\| = 1\}$.

Definition (Frobenius Norm): The Frobenius norm of an $n \times m$ matrix $A \in \mathbb{R}^{n \times m}$ is defined as

$$\|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^m a_{ij}^2}.$$

$\lhd$

Definition (Induced Norm): The $2$-norm of a matrix $A \in \mathbb{R}^{n \times m}$ is

$$\|A\|_2 = \max_{\mathbf{0} \neq \mathbf{x} \in \mathbb{R}^m} \frac{\|A \mathbf{x}\|}{\|\mathbf{x}\|} = \max_{\mathbf{x} \in \mathbb{S}^{m-1}} \|A \mathbf{x}\|.$$

$\lhd$

Theorem (Low-Rank Approximation in the Induced Norm): Let $A \in \mathbb{R}^{n \times m}$ be a matrix with SVD

$$A = \sum_{j=1}^r \sigma_j \mathbf{u}_j \mathbf{v}_j^T$$

and let $A_k$ be the truncation defined above with $k < r$. For any matrix $B \in \mathbb{R}^{n \times m}$ of rank at most $k$,

$$\|A - A_k\|_2 \leq \|A - B\|_2.$$

Finally, we justified the use of this projection by the following inequality.

Theorem (Why Project): Let $A \in \mathbb{A}^{n \times d}$ be a matrix and let $A_k$ be the truncation above. For any matrix $C \in \mathbb{R}^{n \times d}$ of rank $\leq k$,

$$\|A_k - C\|_F^2 \leq 8 k \|A - C\|_2^2.$$

The algorithm We used the Power Iteration to compute the top singular vectors.

Lemma (Power Iteration): Let $A \in \mathbb{R}^{n\times m}$ be a matrix with $m \leq n$. Let $U \Sigma V^T$ be an SVD of $A$ such that $\sigma_1 > \sigma_2$. Define $B = A^T A$ and assume that $\mathbf{x} \in \mathbb{R}^m$ is a vector satisfying $\langle \mathbf{v}_1, \mathbf{x} \rangle > 0$. Then

$$\frac{B^{2k} \mathbf{x}}{\|B^{2k} \mathbf{x}\|} \to \mathbf{v}_1$$

as $k \to +\infty$.

The algorithm is below.

In [1]:
# Julia version: 1.5.1
using DataFrames, CSV, Plots, LinearAlgebra, SparseArrays, Statistics, Distances

function mmids_gramschmidt(A)
n, m = size(A)
Q = zeros(Float64, n, m)
R = zeros(Float64, m, m)
for j = 1:m
v = A[:,j]
for k = 1:j-1
R[k,j] = dot(Q[:,k],A[:,j])
v -= R[k,j]*Q[:,k]
end
R[j,j] = norm(v)
Q[:,j] = v/R[j,j]
end
return Q, R
end

Out[1]:
mmids_gramschmidt (generic function with 1 method)
In [2]:
function mmids_svd(A, r; maxiter=1000)
V = randn(size(A,2),r) # random initialization
for _ = 1:maxiter
W = A * V
Z = A' * W
V, R = mmids_gramschmidt(Z)
end
W = A * V
S = [norm(W[:, i]) for i=1:size(W,2)] # singular values
U = reduce(hcat,[W[:,i]/S[i] for i=1:size(W,2)]) # left singular vectors
return U, S, V
end

Out[2]:
mmids_svd (generic function with 1 method)

In this notebook, we give two concrete examples of high-dimensional datasets where projecting to the first few singular vectors appears to preserve much of the structure.

## 1 Exploring viral evolution using genetic data¶

We consider an application of dimensionality reduction in biology. We will look at SNP data from viruses. A little background first. From Wikipedia:

A single-nucleotide polymorphism (SNP; /snɪp/; plural /snɪps/) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.

Quoting Jombart et al., BMC Genetics (2010), we analyze:

the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.

Some details about the Jombart et al. dataset:

For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.

We first load the data. It contains a subset of $1642$ strains from the dataset above.

In [3]:
dfsnp = CSV.read("h3n2-snp.csv")
first(dfsnp, 5)

Out[3]:

5 rows × 318 columns (omitted printing of 309 columns)

strains6as6cs6gs17as17gs17ts39as39c
StringFloat64Float64Float64Float64Float64Float64Float64Float64
1AB4341071.00.00.01.00.00.00.00.0
2AB4341081.00.00.01.00.00.00.00.0
3CY0001131.00.00.01.00.00.00.00.0
4CY0002091.00.00.01.00.00.00.00.0
5CY0002171.00.00.01.00.00.00.00.0
In [4]:
nrow(dfsnp)

Out[4]:
1642

We extract a data matrix, run our SVD algorithm with $k=2$, and plot the data in the projected subspace of the first two singular vectors.

In [5]:
A = reduce(hcat, [dfsnp[:,col] for col in names(dfsnp)[2:end]])

Out[5]:
1642×317 Array{Float64,2}:
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
⋮                        ⋮              ⋱  ⋮                        ⋮
1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  1.0  0.0  1.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
In [6]:
U, S, V = mmids_svd(A, 2);

In [7]:
scatter(U[:,1].*S[1], U[:,2].*S[2], legend = false)

Out[7]:

There seems to be some reasonably well-defined clusters in this projection. To further reveal the structure, we color the data points by year. That information is in a separate file.

In [8]:
dfoth = CSV.read("h3n2-other.csv")
first(dfoth, 5)

Out[8]:

5 rows × 7 columns

strainlengthcountryyearlonlatdate
StringInt64StringInt64Float64Float64String
1AB4341071701Japan2002137.21535.58422002/02/25
2AB4341081701Japan2002137.21535.58422002/03/01
3CY0001131762USA2002-73.9440.672002/01/29
4CY0002091760USA2002-73.9440.672002/01/17
5CY0002171760USA2002-73.9440.672002/02/26
In [9]:
year = dfoth[:,:year];

In [10]:
scatter(U[:,1].*S[1], U[:,2].*S[2], marker_z=year, legend=false,
colorbar=:right, clims=(2001,2006), seriescolor=:lightrainbow)

Out[10]:

To some extent, one can "see" the virus evolving from year to year.

One last note: in this case better results can be obtained using other related dimensionality reduction methods.

## 2 Movie recommendations [optional]¶

Next, we return to the MovieLens dataset from the first notebook of this topic.

In [11]:
dfm = CSV.read("movielens-small-movies.csv")
first(dfm, 5)

Out[11]:

5 rows × 2 columns

movieIdtitle
Int64String
11Toy Story (1995)
22Jumanji (1995)
33Grumpier Old Men (1995)
44Waiting to Exhale (1995)
55Father of the Bride Part II (1995)
In [12]:
dfr = CSV.read("movielens-small-ratings.csv")
first(dfr, 5)

Out[12]:

5 rows × 3 columns

userIdmovieIdrating
Int64Int64Float64
1114.0
2134.0
3164.0
41445.0
51475.0

Here's a summary of the data. There are $9742$ movies and $100836$ ratings made by $610$ users.

In [13]:
nrow(dfr)

Out[13]:
100836
In [14]:
describe(dfm)

Out[14]:

2 rows × 8 columns (omitted printing of 3 columns)

variablemeanminmedianmax
SymbolUnion…AnyUnion…Any
1movieId4871.514871.59742
2title'71 (2014)À nous la liberté (Freedom for Us) (1931)
In [15]:
describe(dfr)

Out[15]:

3 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingNothingDataType
1userId326.1281325.0610Int64
2movieId3108.3812255.09742Int64
3rating3.501560.53.55.0Float64

We will convert the ratings data into a matrix.

We construct a matrix whose rows are the movies and whose columns are the users. Because this matrix contains many zeros, which we use to encode the absence of a rating, we first define it using a sparse array (that is, we only list the non-zero entries).

In [16]:
A = sparse(dfr[:,:movieId],dfr[:,:userId],dfr[:,:rating])

Out[16]:
9742×610 SparseMatrixCSC{Float64,Int64} with 100836 stored entries:
[1   ,   1]  =  4.0
[3   ,   1]  =  4.0
[6   ,   1]  =  4.0
[44  ,   1]  =  5.0
[47  ,   1]  =  5.0
[63  ,   1]  =  3.0
[90  ,   1]  =  5.0
[98  ,   1]  =  4.0
[125 ,   1]  =  5.0
[131 ,   1]  =  5.0
[137 ,   1]  =  5.0
[185 ,   1]  =  5.0
⋮
[9343, 610]  =  3.0
[9358, 610]  =  4.0
[9360, 610]  =  4.0
[9367, 610]  =  3.5
[9390, 610]  =  3.5
[9391, 610]  =  3.5
[9393, 610]  =  5.0
[9434, 610]  =  4.0
[9435, 610]  =  4.0
[9462, 610]  =  5.0
[9463, 610]  =  5.0
[9464, 610]  =  5.0
[9504, 610]  =  3.0

Next, we extract the movie titles.

In [17]:
movie_titles = dfm[:,:title]

Out[17]:
9742-element Array{String,1}:
"Toy Story (1995)"
"Jumanji (1995)"
"Grumpier Old Men (1995)"
"Waiting to Exhale (1995)"
"Father of the Bride Part II (1995)"
"Heat (1995)"
"Sabrina (1995)"
"Tom and Huck (1995)"
"Sudden Death (1995)"
"GoldenEye (1995)"
"American President, The (1995)"
"Dracula: Dead and Loving It (1995)"
"Balto (1995)"
⋮
"Hommage à Zgougou (et salut à Sabine Mamou) (2002)"
"Gintama (2017)"
"Gintama: The Movie (2010)"
"anohana: The Flower We Saw That Day - The Movie (2013)"
"Silver Spoon (2014)"
"Love Live! The School Idol Movie (2015)"
"Jon Stewart Has Left the Building (2015)"
"Black Butler: Book of the Atlantic (2017)"
"No Game No Life: Zero (2017)"
"Flint (2017)"
"Bungo Stray Dogs: Dead Apple (2018)"
"Andrew Dice Clay: Dice Rules (1991)"

Before running our SVD algorithm, we preprocess our data. We first remove all movies without ratings. We also replace all missing ratings with the average rating of the corresponding movie. Finally, we center the ratings by subtracting the mean rating of each user.

In [18]:
## PRE-PROCESSING of A
## OUTPUT: Z, movie_titles_Z, mZ

# restrict to rows with at least one rating
nnz_movies = findall(s -> s > 0, vec(sum(A,dims=2)))
Z = copy(Array{Float64,2}(A[nnz_movies,:]))

# for each missing rating, replace with mean rating of corresponding movie
for i=1:size(Z,1)
nnz_ratings_for_i = findall(r -> r > 0, Z[i,:])
z_ratings_for_i = findall(r -> r == 0, Z[i,:])
Z[i,z_ratings_for_i] .= mean(Z[i,nnz_ratings_for_i])
end

# for each user, center ratings
mZ = zeros(Float64, size(Z,2))
for j=1:size(Z,2)
mZ[j] = mean(Z[:,j])
Z[:,j] .-= mZ[j]
end

# update movie titles
movie_titles_Z = movie_titles[nnz_movies];


We use our SVD algorithm to compute the top 10 singular vectors and plot the corresponding singular values.

In [19]:
U, S, V = mmids_svd(Z, 10);

In [20]:
plot(log.(S), legend=false)

Out[20]:

We notice that the singular values drop quickly, suggesting that the data can be well approximated by a low-dimensional projection.

We construct predicted ratings for each user and each movie based on the projected data. To do so, we add back the average rating for each user.

In [21]:
k = 10
Apred = U[:,1:k]*Diagonal(S[1:k])*V[:,1:k]'

# for each user, de-center the ratings
for j=1:size(Z,2)
Apred[:,j] .+= mZ[j]
end

Apred

Out[21]:
9724×610 Array{Float64,2}:
4.26165  3.91618   3.84328   …  4.0369   2.84007   3.86721  4.34904
3.70509  3.43685   3.40225      3.51824  2.80578   3.39466  3.7748
3.5377   3.25767   3.19452      3.39921  2.39665   3.2241   3.47571
2.36919  2.35648   2.34125      2.38415  2.27326   2.35112  2.33453
3.21588  3.0778    3.04875      3.15795  2.59782   3.04819  3.07079
4.13084  3.95377   3.99495   …  3.96292  3.81733   3.93173  4.23344
3.27396  3.18811   3.18621      3.23586  2.94751   3.17811  3.27278
2.89489  2.87598   2.8774       2.88675  2.82013   2.87419  2.8784
3.13811  3.12598   3.12126      3.17532  3.06104   3.12708  3.17527
3.71353  3.50268   3.51715      3.54102  3.33009   3.48468  3.69038
3.84006  3.67362   3.70846   …  3.69894  3.52891   3.67782  3.76047
2.49204  2.42525   2.45175      2.45259  2.37653   2.42259  2.58979
3.1537   3.12513   3.11888      3.13561  3.06168   3.12295  3.12655
⋮                            ⋱
1.00386  0.999757  0.997808     1.0036   0.990278  1.00168  1.00098
4.50888  4.50039   4.49873      4.49828  4.50016   4.50027  4.50686
3.50745  3.50021   3.49847      3.4998   3.49733   3.50067  3.50518
3.00673  3.00012   2.99834   …  3.00056  2.99592   3.00087  3.00434
4.00817  4.0003    3.9986       3.99904  3.99875   4.00047  4.00602
4.00817  4.0003    3.9986       3.99904  3.99875   4.00047  4.00602
3.50745  3.50021   3.49847      3.4998   3.49733   3.50067  3.50518
4.00817  4.0003    3.9986       3.99904  3.99875   4.00047  4.00602
3.50745  3.50021   3.49847   …  3.4998   3.49733   3.50067  3.50518
3.50745  3.50021   3.49847      3.4998   3.49733   3.50067  3.50518
3.50745  3.50021   3.49847      3.4998   3.49733   3.50067  3.50518
4.00817  4.0003    3.9986       3.99904  3.99875   4.00047  4.00602

Compare to the original dataset.

In [22]:
Array(A[nnz_movies,:])

Out[22]:
9724×610 Array{Float64,2}:
4.0  0.0  0.0  0.0  4.0  0.0  4.5  0.0  …  3.0  4.0  2.5  4.0  2.5  3.0  5.0
0.0  0.0  0.0  0.0  0.0  4.0  0.0  4.0     5.0  3.5  0.0  0.0  2.0  0.0  0.0
4.0  0.0  0.0  0.0  0.0  5.0  0.0  0.0     0.0  0.0  0.0  0.0  2.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  3.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  5.0  0.0  0.0     3.0  0.0  0.0  0.0  0.0  0.0  0.0
4.0  0.0  0.0  0.0  0.0  4.0  0.0  0.0  …  3.0  0.0  0.0  0.0  0.0  0.0  5.0
0.0  0.0  0.0  0.0  0.0  4.0  0.0  0.0     0.0  0.0  2.5  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  3.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  3.0  0.0  2.0     0.0  0.0  0.0  0.0  4.0  4.0  0.0
0.0  0.0  0.0  0.0  0.0  4.0  0.0  4.0  …  0.0  0.0  2.5  3.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  3.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
⋮                        ⋮              ⋱            ⋮
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0

For background on using SVD to make movie recommendations, see e.g. here. For a discussion of other methods, see e.g. here.