*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* Nov 28, 2020

*Copyright:* © 2020 Sebastien Roch

As a motivation, we consider here the task of sentiment analysis, which is a classification problem. We use a dataset from Crowdflower. The full datatset is available here. Quoting Crowdflower:

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

We first load the data and look at its summary.

In [1]:

```
# Julia version: 1.5.1
using CSV, DataFrames
```

In [2]:

```
df = CSV.read("twitter-sentiment.csv")
first(df,5)
```

Out[2]:

The number of tweets in this dataset is:

In [3]:

```
n = nrow(df)
```

Out[3]:

Our goal is to classify tweets about airlines into positive, negative or neutral. A natural approach is to expect tweets with different sentiments to use different types of words. We take this into account by modeling the probability distributions of words used separately in positive, negative and neutral tweets. Standard probabilistic reasoning then leads to a natural classifier.

More broadly, we study in this topic various methods based on the probabilistic modeling of data. We will revisit along the way several problems we have encountered previously. An advantage of these approaches is the ability to quantify the uncertainty in our inferences.

Parts of this topic's notebooks are based on the following references.

[Bis] Bishop, Pattern Recognition and Machine Learning, Springer, 2006

[Khu] A. I. Khuri, Advanced calculus with applications in statistics (2nd ed.), Wiley, 2003

[Mur] K. P. Murphy, Machine Learning: a Probabilistic Perspective, MIT Press, 2012.