*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* Oct 10, 2020

*Copyright:* © 2020 Sebastien Roch

(Source)

We would like to make movie recommmendations based on prior user ratings. For this purpose, we will use the MovieLens dataset. There are two files. The first one contains a list of `movieId`

's and `title`

's. The second one is a list of user ratings: `userId`

is the user providing the rating, `movieId`

is the movie being rated, and `rating`

is the rating on a $1$ to $5$ scale (allowing half-points).

In [2]:

```
dfm = CSV.read("movielens-small-movies.csv")
first(dfm, 10)
```

Out[2]:

10 rows Ã— 2 columns

movieId | title | |
---|---|---|

Int64 | String | |

1 | 1 | Toy Story (1995) |

2 | 2 | Jumanji (1995) |

3 | 3 | Grumpier Old Men (1995) |

4 | 4 | Waiting to Exhale (1995) |

5 | 5 | Father of the Bride Part II (1995) |

6 | 6 | Heat (1995) |

7 | 7 | Sabrina (1995) |

8 | 8 | Tom and Huck (1995) |

9 | 9 | Sudden Death (1995) |

10 | 10 | GoldenEye (1995) |

In [3]:

```
dfr = CSV.read("movielens-small-ratings.csv")
first(dfr, 10)
```

Out[3]:

10 rows Ã— 3 columns

userId | movieId | rating | |
---|---|---|---|

Int64 | Int64 | Float64 | |

1 | 1 | 1 | 4.0 |

2 | 1 | 3 | 4.0 |

3 | 1 | 6 | 4.0 |

4 | 1 | 44 | 5.0 |

5 | 1 | 47 | 5.0 |

6 | 1 | 63 | 3.0 |

7 | 1 | 90 | 5.0 |

8 | 1 | 98 | 4.0 |

9 | 1 | 125 | 5.0 |

10 | 1 | 131 | 5.0 |

In [4]:

```
describe(dfm)
```

Out[4]:

2 rows Ã— 8 columns (omitted printing of 3 columns)

variable | mean | min | median | max | |
---|---|---|---|---|---|

Symbol | Unionâ€¦ | Any | Unionâ€¦ | Any | |

1 | movieId | 4871.5 | 1 | 4871.5 | 9742 |

2 | title | '71 (2014) | Ã€ nous la libertÃ© (Freedom for Us) (1931) |

In [5]:

```
describe(dfr)
```

Out[5]:

3 rows Ã— 8 columns

variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|

Symbol | Float64 | Real | Float64 | Real | Nothing | Nothing | DataType | |

1 | userId | 326.128 | 1 | 325.0 | 610 | Int64 | ||

2 | movieId | 3108.38 | 1 | 2255.0 | 9742 | Int64 | ||

3 | rating | 3.50156 | 0.5 | 3.5 | 5.0 | Float64 |

In [6]:

```
nrow(dfr)
```

Out[6]:

100836

For instance, rating $1000$ was made by user:

In [7]:

```
dfr[1000,:userId]
```

Out[7]:

7

who watched the movie:

In [8]:

```
dfm[dfr[1000,:movieId],:title]
```

Out[8]:

"Phantom of the Opera, The (2004)"

and gave it rating:

In [9]:

```
dfr[1000,:rating]
```

Out[9]:

2.0

He or she was not a fan...

On the other hand, that same user really enjoyed:

In [10]:

```
[dfm[dfr[i,:movieId],:title] for i=1:nrow(dfr)
if (dfr[i,:userId] .== dfr[1000,:userId]) & (dfr[i,:rating] .== 5)]
```

Out[10]:

13-element Array{String,1}: "Star Wars: Episode IV - A New Hope (1977)" "Forrest Gump (1994)" "Hot Shots! Part Deux (1993)" "Jurassic Park (1993)" "Silence of the Lambs, The (1991)" "Psycho (1960)" "Terminator, The (1984)" "Back to the Future (1985)" "Contact (1997)" "Seven Samurai (Shichinin no samurai) (1954)" "Planet of the Apes (1968)" "Naked Gun 2 1/2: The Smell of Fear, The (1991)" "Spirited Away (Sen to Chihiro no kamikakushi) (2001)"

We will convert the ratings data into a matrix. Each movies is rated by only a handful of users. More specifically, out of the

In [11]:

```
length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId]))
```

Out[11]:

5931640

possible ratings, we only have

In [12]:

```
nrow(dfr)
```

Out[12]:

100836

or, in percentage,

In [13]:

```
nrow(dfr) / (length(unique(dfr[:,:userId])) * length(unique(dfr[:,:movieId])))
```

Out[13]:

0.016999683055613623

We construct a matrix whose rows are the movies and whose columns are the users. Because this matrix will contain a lot of zeros, which we use to encode the absence of a rating, we first define it using a `sparse`

array (that is, we only list the non-zero entries), and then convert it to a dense matrix:

In [14]:

```
A = sparse(dfr[:,:movieId],dfr[:,:userId],dfr[:,:rating]);
```

We will also look at the Karate Club dataset.

From Wikipedia:

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.

In [15]:

```
g = smallgraph(:karate)
gplot(g, nodelabel=1:nv(g))
```

Out[15]:

Our goal:

identify natural sub-communities in the network

That is, we want to find groups of nodes that have many links between them, but relatively few with the other nodes.

It will turn out that the eigenvectors of the Laplacian matrix, a matrix naturally associated to the graph, contain useful information about such communities.