$\newcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\S}{\mathcal{S}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\btheta}{\boldsymbol{\theta}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\indep}{\perp\!\!\!\perp}$ $\newcommand{\bp}{\mathbf{p}}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$
Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Nov 28, 2020
Copyright: © 2020 Sebastien Roch
We have encountered the multivariate Bernoulli naive Bayes model in a previous lecture. We will use it here for a document classification-type task. We first recall the model and discuss its fitting from training data.
Model: This model is useful for document classification. We assume that a document has a single topic $C$ from a list $\mathcal{C} = \{c_1, \ldots, c_K\}$ with probability distribution $\pi_k = \P[C = c_k]$. There is a vocabulary of size $M$ and we record the presence or absence of a word $m$ in the document with a Bernoulli variable $X_m$, where $p_{km} = \P[X_m = 1|C = c_k]$. We denote by $\bX = (X_1, \ldots, X_M)$ the corresponding vector.
The conditional independence assumption comes next: we assume that, given a topic $C$, the word occurrences are independent. That is,
$$ \begin{align*} \P[\bX = \bx|C=c_k] &= \prod_{m=1}^M \P[X_m = x_m|C = c_k]\\ &= \prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}. \end{align*} $$Finally, the joint distribution is
$$ \begin{align*} \P[C = c_k, \bX = \bx] &= \P[\bX = \bx|C=c_k] \,\P[C=c_k]\\ &= \pi_k \prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}. \end{align*} $$Prediction: To predict the class of a new document, it is natural to maximize over $k$ the probability of $\{C=c_k\}$ given $\{\bX = \bx\}$. By Bayes' rule,
$$ \begin{align*} \P[C=c_k | \bX = \bx] &= \frac{\P[C = c_k, \bX = \bx]}{\P[\bX = \bx]}\\ &= \frac{\pi_k \prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}} {\sum_{k'=1}^K \pi_{k'} \prod_{m=1}^M p_{k'm}^{x_m} (1-p_{k'm})^{1-x_m}}. \end{align*} $$As the denominator does not in fact depend on $k$, maximizing $\P[C=c_k | \bX = \bx]$ boils down to maximizing the numerator $\pi_k \prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}$, which is straighforward to compute.
Model fitting: Before using the prediction scheme above, one must first fit the model from training data $\{\bX_i, C_i\}_{i=1}^n$. In this case, it means estimating the unknown parameters $\bpi$ and $\{\bp_k\}_{k=1}^K$, where $\bp_k = (p_{k1},\ldots, p_{kM})$. For each $k, m$ let
$$ N_{km} = \sum_{i=1}^n \mathbf{1}_{\{C_i = c_k\}}X_{i,m}, \quad N_{k} = \sum_{i=1}^n \mathbf{1}_{\{C_i = c_k\}}. $$A standard approach is maximum likelihood estimation, which involves finding the parameters that maximize the probability of observing the training data
$$ \mathcal{L}\left(\bpi, \{\bp_k\}; \{\bX_i, C_i\}\right) = \prod_{i=1}^n \pi_{C_i} \prod_{m=1}^M p_{C_i, m}^{X_{i,m}} (1-p_{C_i, m})^{1-X_{i,m}}. $$where we assumed that the samples are independent and identically distributed. Taking a logarithm to turn the products into sums and simplifying gives
$$ \begin{align*} &-\ln \mathcal{L}\left(\bpi, \{\bp_k\}; \{\bX_i, C_i\}\right)\\ &\quad = - \sum_{i=1}^n \ln \pi_{C_i} - \sum_{i=1}^n \sum_{m=1}^M [X_{i,m} \ln p_{C_{i}, m} + (1-X_{i,m}) \ln (1-p_{C_i, m})]\\ &\quad = - \sum_{k=1}^K N_k \ln \pi_k - \sum_{k=1}^K \sum_{m=1}^M [N_{km} \ln p_{km} + (N-N_{km}) \ln (1-p_{km})]. \end{align*} $$Assuming $N_k > 0$ for all $k$, it can be shown using calculus that the optimum is reached at
$$ \hat{\pi}_k = \frac{N_k}{N}, \quad \hat{p}_{km} = \frac{N_{km}}{N_k} $$for all $k, m$. While maximum likehood estimation has desirable theoretical properties, it does suffer from overfitting. In this case, if for instance a particular word does not occur in any training document then the probability of observing a new document containing that word is $0$ for any class. One approach to deal with this is Laplace smoothing
$$ \bar{\pi}_k = \frac{N_k + \alpha}{N + K \alpha}, \quad \bar{p}_{km} = \frac{N_{km} + \beta}{N_k + 2 \beta} $$where $\alpha, \beta > 0$, which can be justified using a Bayesian perspective.
We consider the task of sentiment analysis, which is a classification problem. We use a dataset from Crowdflower. The full datatset is available here. Quoting Crowdflower:
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").
We first load a cleaned-up version of the data and look at its summary.
# Julia version: 1.5.1
using CSV, DataFrames, TextAnalysis, Random, Plots, LaTeXStrings
using TextAnalysis: text
df = CSV.read("./twitter-sentiment.csv")
first(df,5)
n = nrow(df)
We use the package TextAnalysis.jl to extract and process text information in this dataset. The Corpus
function creates a collection of text documents (one for each tweet) from a DataFrame
. The function text
shows the text itself.
crps = Corpus(StringDocument.(df[:,:text]))
text.(crps)
We first preprocess the data. In particular, we lower-case all the words and remove punctuation. A more careful pre-procsseing would also include stemming, although we do not do this here. Regarding the latter, quoting Wikipedia:
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. [...] A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer. [...] A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.
Finally, update_lexicon!
creates a lexicon from the dataset. Quoting TextAnalysis.jl:
The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all of the documents. Often the most interesting words in a document are those words whose frequency within a document is higher than their frequency in the corpus as a whole.
The preprocessed text is shown below.
# preprocessing
function text_preproc!(crps)
remove_corrupt_utf8!(crps)
remove_case!(crps)
prepare!(crps, strip_punctuation)
#stem!(crps)
update_lexicon!(crps)
end
text_preproc!(crps)
text.(crps)
Next, we convert our dataset into a matrix by creating a document-term matrix using the DocumentTermMatrix
function. Quoting Wikipedia:
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
m = DocumentTermMatrix(crps)
t = m.terms
Because of our use of the multivariate Bernoulli naive Bayes model, it will be more convenient to work with a variant of the document-term matrix where each word is either present or absent. Note that, in the context of tweet data which are very short documents with likely little word repetition, there is probably not much difference.
dt = m.dtm .> 0
We also extract the labels (neutral
, postive
, negative
) from the dataset.
labels = df[:,:sentiment]
We split the data into a training set and a test set.
function split_data(m, n, ρ)
τ = randperm(n) # permutation of the rows
train_size = Int(floor(ρ*n)) # ρ should be between 0 and 1
train_set = τ[1:train_size]
test_set = τ[train_size+1:end]
return train_set, test_set
end
train_set, test_set = split_data(m , n, 0.9)
dt_train = dt[train_set,:]
labels_train = labels[train_set]
train_size = length(labels_train)
dt_test = dt[test_set,:]
labels_test = labels[test_set]
test_size = length(labels_test);
We implement the Naive Bayes method. We use Laplace smoothing to avoid overfitting issues.
function mmids_nb_fit(c, labels, dt, size; α=1, β=1)
# rows corresponding to class c
c_rows = findall(labels.==c)
# class prior
σ_c = (length(c_rows)+α)/(size+3α)
# term mle
N_j_c = sum(dt[c_rows,:].>=1,dims=1)
N_c = sum(N_j_c)
θ_c = (N_j_c.+β)./(N_c+2β)
return σ_c, θ_c
end
We are ready to train on the dataset.
# training
label_set = ["positive", "negative", "neutral"]
σ_c = zeros(3)
θ_c = zeros(3, length(t))
for i=1:length(label_set)
σ_c[i], θ_c[i,:] = mmids_nb_fit(
label_set[i], labels_train, dt_train, train_size)
end
σ_c
θ_c
plot([θ_c[1,:], θ_c[2,:], θ_c[3,:]], layout=(3,1), legend=false,
ylabel=[L"$\theta_{pos}$" L"$\theta_{neg}$" L"$\theta_{ntr}$"])