TOPIC 4

Probabilistic modeling, inference and sampling

Motivating example: Twitter sentiment analysis


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Nov 28, 2020
Copyright: © 2020 Sebastien Roch


As a motivation, we consider here the task of sentiment analysis, which is a classification problem. We use a dataset from Crowdflower. The full datatset is available here. Quoting Crowdflower:

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

We first load the data and look at its summary.

In [2]:
df = CSV.read("twitter-sentiment.csv")
first(df,5)
Out[2]:

5 rows × 4 columns

timeusersentimenttext
StringStringStringString
12/24/15 11:35cairdinneutral@VirginAmerica What @dhepburn said.
22/24/15 11:15jnardinopositive@VirginAmerica plus you've added commercials to the experience... tacky.
32/24/15 11:15yvonnalynnneutral@VirginAmerica I didn't today... Must mean I need to take another trip!
42/24/15 11:15jnardinonegative@VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse
52/24/15 11:14jnardinonegative@VirginAmerica and it's a really big bad thing about it

The number of tweets in this dataset is:

In [3]:
n = nrow(df)
Out[3]:
14640

Our goal is to classify tweets about airlines into positive, negative or neutral. A natural approach is to expect tweets with different sentiments to use different types of words. We take this into account by modeling the probability distributions of words used separately in positive, negative and neutral tweets. Standard probabilistic reasoning then leads to a natural classifier.

More broadly, we study in this topic various methods based on the probabilistic modeling of data. We will revisit along the way several problems we have encountered previously. An advantage of these approaches is the ability to quantify the uncertainty in our inferences.