TOPIC 2

Spectral and singular value decompositions


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Oct 10, 2020
Copyright: © 2020 Sebastien Roch


ml-cheat-sheet

(Source)

Motivating example 1: movie recommendations

We would like to make movie recommmendations based on prior user ratings. For this purpose, we will use the MovieLens dataset. There are two files. The first one contains a list of movieId's and title's. The second one is a list of user ratings: userId is the user providing the rating, movieId is the movie being rated, and rating is the rating on a $1$ to $5$ scale (allowing half-points).

In [2]:
dfm = CSV.read("movielens-small-movies.csv")
first(dfm, 10)
Out[2]:

10 rows × 2 columns

movieIdtitle
Int64String
11Toy Story (1995)
22Jumanji (1995)
33Grumpier Old Men (1995)
44Waiting to Exhale (1995)
55Father of the Bride Part II (1995)
66Heat (1995)
77Sabrina (1995)
88Tom and Huck (1995)
99Sudden Death (1995)
1010GoldenEye (1995)
In [3]:
dfr = CSV.read("movielens-small-ratings.csv")
first(dfr, 10)
Out[3]:

10 rows × 3 columns

userIdmovieIdrating
Int64Int64Float64
1114.0
2134.0
3164.0
41445.0
51475.0
61633.0
71905.0
81984.0
911255.0
1011315.0
In [4]:
describe(dfm)
Out[4]:

2 rows × 8 columns (omitted printing of 3 columns)

variablemeanminmedianmax
SymbolUnion…AnyUnion…Any
1movieId4871.514871.59742
2title'71 (2014)À nous la liberté (Freedom for Us) (1931)
In [5]:
describe(dfr)
Out[5]:

3 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingNothingDataType
1userId326.1281325.0610Int64
2movieId3108.3812255.09742Int64
3rating3.501560.53.55.0Float64
In [6]:
nrow(dfr)
Out[6]:
100836

For instance, rating $1000$ was made by user:

In [7]:
dfr[1000,:userId]
Out[7]:
7

who watched the movie:

In [8]:
dfm[dfr[1000,:movieId],:title]
Out[8]:
"Phantom of the Opera, The (2004)"

and gave it rating:

In [9]:
dfr[1000,:rating]
Out[9]:
2.0

He or she was not a fan...

On the other hand, that same user really enjoyed:

In [10]:
[dfm[dfr[i,:movieId],:title] for i=1:nrow(dfr)
        if (dfr[i,:userId] .== dfr[1000,:userId]) & (dfr[i,:rating] .== 5)]
Out[10]:
13-element Array{String,1}:
 "Star Wars: Episode IV - A New Hope (1977)"
 "Forrest Gump (1994)"
 "Hot Shots! Part Deux (1993)"
 "Jurassic Park (1993)"
 "Silence of the Lambs, The (1991)"
 "Psycho (1960)"
 "Terminator, The (1984)"
 "Back to the Future (1985)"
 "Contact (1997)"
 "Seven Samurai (Shichinin no samurai) (1954)"
 "Planet of the Apes (1968)"
 "Naked Gun 2 1/2: The Smell of Fear, The (1991)"
 "Spirited Away (Sen to Chihiro no kamikakushi) (2001)"

We will convert the ratings data into a matrix. Each movies is rated by only a handful of users. More specifically, out of the

In [11]:
length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId]))
Out[11]:
5931640

possible ratings, we only have

In [12]:
nrow(dfr)
Out[12]:
100836

or, in percentage,

In [13]:
nrow(dfr) / (length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId])))
Out[13]:
0.016999683055613623

We construct a matrix whose rows are the movies and whose columns are the users. Because this matrix will contain a lot of zeros, which we use to encode the absence of a rating, we first define it using a sparse array (that is, we only list the non-zero entries), and then convert it to a dense matrix:

In [14]:
A = sparse(dfr[:,:movieId],dfr[:,:userId],dfr[:,:rating]);

The sparsity of this dataset is an issue. Two users may have very similar tastes, but may have rewieved different movies making it difficult to identify the similarity between them. One way to deal with the high-dimensionality of this data is to assume that it has a low-dimensional structure. Here it is natural to assume that there are a small number of "idealized" movies -- think action, comedy, horror, etc. -- and a small number of "idealized" moviegoers -- think people who like only action, people who like only comedy, people who like only horror, etc. -- and that every movie or moviegoer is a linear combination of these idealized entities. Mathemiatically, we'll see that we are looking for a low-rank approximation of our matrix $A$.

Motivating example 2: community detection

We will also look at the Karate Club dataset.

From Wikipedia:

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.

In [15]:
g = smallgraph(:karate)
gplot(g, nodelabel=1:nv(g))
Out[15]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Our goal:

identify natural sub-communities in the network

That is, we want to find groups of nodes that have many links between them, but relatively few with the other nodes.

It will turn out that the eigenvectors of the Laplacian matrix, a matrix naturally associated to the graph, contain useful information about such communities.