{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "$\\newcommand{\\P}{\\mathbb{P}}$\n", "$\\newcommand{\\E}{\\mathbb{E}}$\n", "$\\newcommand{\\S}{\\mathcal{S}}$\n", "$\\newcommand{\\var}{\\mathrm{Var}}$\n", "$\\newcommand{\\btheta}{\\boldsymbol{\\theta}}$\n", "$\\newcommand{\\bpi}{\\boldsymbol{\\pi}}$\n", "$\\newcommand{\\indep}{\\perp\\!\\!\\!\\perp}$\n", "$\\newcommand{\\bp}{\\mathbf{p}}$\n", "$\\newcommand{\\bx}{\\mathbf{x}}$\n", "$\\newcommand{\\bX}{\\mathbf{X}}$\n", "$\\newcommand{\\by}{\\mathbf{y}}$\n", "$\\newcommand{\\bY}{\\mathbf{Y}}$\n", "$\\newcommand{\\bz}{\\mathbf{z}}$\n", "$\\newcommand{\\bZ}{\\mathbf{Z}}$\n", "$\\newcommand{\\bw}{\\mathbf{w}}$\n", "$\\newcommand{\\bW}{\\mathbf{W}}$\n", "\n", "# TUTORIAL 4a \n", "\n", "\n", "# Twitter sentiment analysis\n", "\n", "***\n", "*Course:* [Math 535](http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS) \n", "*Author:* [Sebastien Roch](http://www.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison \n", "*Updated:* Nov 28, 2020 \n", "*Copyright:* © 2020 Sebastien Roch\n", "***" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1 Background: Naive Bayes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We have encountered the multivariate Bernoulli naive Bayes model in a previous lecture. We will use it here for a document classification-type task. We first recall the model and discuss its fitting from training data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Model:** This model is useful for document classification. We assume that a document has a single topic $C$ from a list $\\mathcal{C} = \\{c_1, \\ldots, c_K\\}$ with probability distribution $\\pi_k = \\P[C = c_k]$. There is a vocabulary of size $M$ and we record the presence or absence of a word $m$ in the document with a Bernoulli variable $X_m$, where $p_{km} = \\P[X_m = 1|C = c_k]$. We denote by $\\bX = (X_1, \\ldots, X_M)$ the corresponding vector." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The conditional independence assumption comes next: we assume that, given a topic $C$, the word occurrences are independent. That is, \n", "\n", "\n", "\\begin{align*}\n", "\\P[\\bX = \\bx|C=c_k]\n", "&= \\prod_{m=1}^M \\P[X_m = x_m|C = c_k]\\\\\n", "&= \\prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}.\n", "\\end{align*}\n", "\n", "\n", "Finally, the joint distribution is\n", "\n", "\n", "\\begin{align*}\n", "\\P[C = c_k, \\bX = \\bx]\n", "&= \\P[\\bX = \\bx|C=c_k] \\,\\P[C=c_k]\\\\\n", "&= \\pi_k \\prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}.\n", "\\end{align*}\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Prediction:** To predict the class of a new document, it is natural to maximize over $k$ the probability of $\\{C=c_k\\}$ given $\\{\\bX = \\bx\\}$. By Bayes' rule,\n", "\n", "\n", "\\begin{align*}\n", "\\P[C=c_k | \\bX = \\bx]\n", "&= \\frac{\\P[C = c_k, \\bX = \\bx]}{\\P[\\bX = \\bx]}\\\\\n", "&= \\frac{\\pi_k \\prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}}\n", "{\\sum_{k'=1}^K \\pi_{k'} \\prod_{m=1}^M p_{k'm}^{x_m} (1-p_{k'm})^{1-x_m}}.\n", "\\end{align*}\n", "\n", "\n", "As the denominator does not in fact depend on $k$, maximizing $\\P[C=c_k | \\bX = \\bx]$ boils down to maximizing the numerator $\\pi_k \\prod_{m=1}^M p_{km}^{x_m} (1-p_{km})^{1-x_m}$, which is straighforward to compute." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Model fitting:** Before using the prediction scheme above, one must first fit the model from training data $\\{\\bX_i, C_i\\}_{i=1}^n$. In this case, it means estimating the unknown parameters $\\bpi$ and $\\{\\bp_k\\}_{k=1}^K$, where $\\bp_k = (p_{k1},\\ldots, p_{kM})$. For each $k, m$ let \n", "\n", "$$\n", "N_{km} = \\sum_{i=1}^n \\mathbf{1}_{\\{C_i = c_k\\}}X_{i,m},\n", "\\quad \n", "N_{k} = \\sum_{i=1}^n \\mathbf{1}_{\\{C_i = c_k\\}}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A standard approach is [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation), which involves finding the parameters that maximize the probability of observing the training data\n", "\n", "$$\n", "\\mathcal{L}\\left(\\bpi, \\{\\bp_k\\}; \\{\\bX_i, C_i\\}\\right)\n", "= \\prod_{i=1}^n \\pi_{C_i} \\prod_{m=1}^M p_{C_i, m}^{X_{i,m}} (1-p_{C_i, m})^{1-X_{i,m}}.\n", "$$\n", "\n", "where we assumed that the samples are independent and identically distributed. Taking a logarithm to turn the products into sums and simplifying gives\n", "\n", "\n", "\\begin{align*}\n", "&-\\ln \\mathcal{L}\\left(\\bpi, \\{\\bp_k\\}; \\{\\bX_i, C_i\\}\\right)\\\\\n", "&\\quad = - \\sum_{i=1}^n \\ln \\pi_{C_i} - \\sum_{i=1}^n \\sum_{m=1}^M [X_{i,m} \\ln p_{C_{i}, m} + (1-X_{i,m}) \\ln (1-p_{C_i, m})]\\\\\n", "&\\quad = - \\sum_{k=1}^K N_k \\ln \\pi_k - \\sum_{k=1}^K \\sum_{m=1}^M [N_{km} \\ln p_{km} + (N-N_{km}) \\ln (1-p_{km})].\n", "\\end{align*}\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Assuming $N_k > 0$ for all $k$, it can be shown using calculus that the optimum is reached at\n", "\n", "$$\n", "\\hat{\\pi}_k = \\frac{N_k}{N},\n", "\\quad \\hat{p}_{km} = \\frac{N_{km}}{N_k}\n", "$$\n", "\n", "for all $k, m$. While maximum likehood estimation has [desirable theoretical properties](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation#Properties), it does suffer from [overfitting](https://towardsdatascience.com/parameter-inference-maximum-aposteriori-estimate-49f3cd98267a). In this case, if for instance a particular word does not occur in any training document then the probability of observing a new document containing that word is $0$ for any class. One approach to deal with this is [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)\n", "\n", "$$\n", "\\bar{\\pi}_k = \\frac{N_k + \\alpha}{N + K \\alpha},\n", "\\quad \\bar{p}_{km} = \\frac{N_{km} + \\beta}{N_k + 2 \\beta}\n", "$$\n", "\n", "where $\\alpha, \\beta > 0$, which can be justified using a Bayesian perspective." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2 Twitter US Airline Sentiment dataset" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We consider the task of sentiment analysis, which is a classification problem. We use a dataset from [Crowdflower](https://data.world/crowdflower/airline-twitter-sentiment). The full datatset is available [here](https://www.kaggle.com/crowdflower/twitter-airline-sentiment). Quoting [Crowdflower](https://data.world/crowdflower/airline-twitter-sentiment):\n", "\n", "> A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as \"late flight\" or \"rude service\")." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We first load a cleaned-up version of the data and look at its summary." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Julia version: 1.5.1\n", "using CSV, DataFrames, TextAnalysis, Random, Plots, LaTeXStrings\n", "using TextAnalysis: text" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "

5 rows × 4 columns

timeusersentimenttext
StringStringStringString
12/24/15 11:35cairdinneutral@VirginAmerica What @dhepburn said.
22/24/15 11:15jnardinopositive@VirginAmerica plus you've added commercials to the experience... tacky.
32/24/15 11:15yvonnalynnneutral@VirginAmerica I didn't today... Must mean I need to take another trip!
42/24/15 11:15jnardinonegative@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they have little recourse
" ], "text/latex": [ "\\begin{tabular}{r|cccc}\n", "\t& time & user & sentiment & text\\\\\n", "\t\\hline\n", "\t& String & String & String & String\\\\\n", "\t\\hline\n", "\t1 & 2/24/15 11:35 & cairdin & neutral & @VirginAmerica What @dhepburn said. \\\\\n", "\t2 & 2/24/15 11:15 & jnardino & positive & @VirginAmerica plus you've added commercials to the experience... tacky. \\\\\n", "\t3 & 2/24/15 11:15 & yvonnalynn & neutral & @VirginAmerica I didn't today... Must mean I need to take another trip! \\\\\n", "\t4 & 2/24/15 11:15 & jnardino & negative & @VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces \\& they have little recourse \\\\\n", "\t5 & 2/24/15 11:14 & jnardino & negative & @VirginAmerica and it's a really big bad thing about it \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "5×4 DataFrame. Omitted printing of 1 columns\n", "│ Row │ time │ user │ sentiment │\n", "│ │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n", "├─────┼───────────────┼────────────┼───────────┤\n", "│ 1 │ 2/24/15 11:35 │ cairdin │ neutral │\n", "│ 2 │ 2/24/15 11:15 │ jnardino │ positive │\n", "│ 3 │ 2/24/15 11:15 │ yvonnalynn │ neutral │\n", "│ 4 │ 2/24/15 11:15 │ jnardino │ negative │\n", "│ 5 │ 2/24/15 11:14 │ jnardino │ negative │" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = CSV.read(\"./twitter-sentiment.csv\")\n", "first(df,5)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "14640" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = nrow(df)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We use the package [TextAnalysis.jl](https://github.com/JuliaText/TextAnalysis.jl) to extract and process text information in this dataset. The Corpus function creates a collection of text documents (one for each tweet) from a DataFrame. The function text shows the text itself." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "14640-element Array{String,1}:\n", " \"@VirginAmerica What @dhepburn said.\"\n", " \"@VirginAmerica plus you've added commercials to the experience... tacky.\"\n", " \"@VirginAmerica I didn't today... Must mean I need to take another trip!\"\n", " \"@VirginAmerica it's really aggressive to blast obnoxious \\\"entertainment\\\" in your guests' faces & they have little recourse\"\n", " \"@VirginAmerica and it's a really big bad thing about it\"\n", " \"@VirginAmerica seriously would pay \\$30 a flight for seats that didn't have this playing.\\nit's really the only bad thing about flying VA\"\n", " \"@VirginAmerica yes, nearly every time I fly VX this \\x89\\xdb\\xcfear worm\\x89\\u6dd won\\x89۪t go away :)\"\n", " \"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP\"\n", " \"@virginamerica Well, I didn't\\x89\\xdb_but NOW I DO! :-D\"\n", " \"@VirginAmerica it was amazing, and arrived an hour early. You're too good to me.\"\n", " \"@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24\"\n", " \"@VirginAmerica I <3 pretty graphics. so much better than minimal iconography. :D\"\n", " \"@VirginAmerica This is such a great deal! Already thinking about my 2nd trip to @Australia & I haven't even gone on my 1st trip yet! ;p\"\n", " ⋮\n", " \"Thank you. \\x89\\xdb\\xcf@AmericanAir: @jlhalldc Customer Relations will review your concerns and contact you back directly, John.\\x89\\u6dd\"\n", " \"@AmericanAir How do I change my flight if the phone system keeps telling me that the representatives are busy?\"\n", " \"@AmericanAir Thanks! He is.\"\n", " \"@AmericanAir thx for nothing on getting us out of the country and back to US. Broken plane? Come on. Get another one.\"\n", " \"\\x89\\xdb\\xcf@AmericanAir: @TilleyMonsta George, that doesn't look good. Please follow this link to start the refund process: http://t.co/4gr39s91Dl\\x89\\u6dd_\\xd9\\xf7\\xe2\"\n", " \"@AmericanAir my flight was Cancelled Flightled, leaving tomorrow morning. Auto rebooked for a Tuesday night flight but need to arrive Monday.\"\n", " \"@AmericanAir right on cue with the delays_\\xd9\\xd4\\xce\"\n", " \"@AmericanAir thank you we got on a different flight to Chicago.\"\n", " \"@AmericanAir leaving over 20 minutes Late Flight. No warnings or communication until we were 15 minutes Late Flight. That's called shitty customer svc\"\n", " \"@AmericanAir Please bring American Airlines to #BlackBerry10\"\n", " \"@AmericanAir you have my money, you change my flight, and don't answer your phones! Any other suggestions so I can make my commitment??\"\n", " \"@AmericanAir we have 8 ppl so we need 2 know how many seats are on the next flight. Plz put us on standby for 4 people on the next flight?\"" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crps = Corpus(StringDocument.(df[:,:text]))\n", "text.(crps)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We first [preprocess](https://juliatext.github.io/TextAnalysis.jl/latest/documents/#Preprocessing-Documents-1) the data. In particular, we lower-case all the words and remove punctuation. A more careful pre-procsseing would also include stemming, although we do not do this here. Regarding the latter, quoting [Wikipedia](https://en.wikipedia.org/wiki/Stemming):\n", "\n", "> In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. [...] A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer. [...] A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Finally, update_lexicon! creates a lexicon from the dataset. Quoting [TextAnalysis.jl](https://juliatext.github.io/TextAnalysis.jl/latest/corpus/#Corpus-Statistics-1):\n", "\n", "> The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all of the documents. Often the most interesting words in a document are those words whose frequency within a document is higher than their frequency in the corpus as a whole.\n", "\n", "The preprocessed text is shown below." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "text_preproc! (generic function with 1 method)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# preprocessing\n", "function text_preproc!(crps)\n", " remove_corrupt_utf8!(crps)\n", " remove_case!(crps)\n", " prepare!(crps, strip_punctuation)\n", " #stem!(crps)\n", " update_lexicon!(crps)\n", "end" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "14640-element Array{String,1}:\n", " \"virginamerica what dhepburn said\"\n", " \"virginamerica plus youve added commercials to the experience tacky\"\n", " \"virginamerica i didnt today must mean i need to take another trip\"\n", " \"virginamerica its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse\"\n", " \"virginamerica and its a really big bad thing about it\"\n", " \"virginamerica seriously would pay 30 a flight for seats that didnt have this playing\\nits really the only bad thing about flying va\"\n", " \"virginamerica yes nearly every time i fly vx this ear worm \\u6dd won ۪t go away \"\n", " \"virginamerica really missed a prime opportunity for men without hats parody there httpstcomwpg7grezp\"\n", " \"virginamerica well i didnt but now i do d\"\n", " \"virginamerica it was amazing and arrived an hour early youre too good to me\"\n", " \"virginamerica did you know that suicide is the second leading cause of death among teens 1024\"\n", " \"virginamerica i lt3 pretty graphics so much better than minimal iconography d\"\n", " \"virginamerica this is such a great deal already thinking about my 2nd trip to australia amp i havent even gone on my 1st trip yet p\"\n", " ⋮\n", " \"thank you americanair jlhalldc customer relations will review your concerns and contact you back directly john \\u6dd\"\n", " \"americanair how do i change my flight if the phone system keeps telling me that the representatives are busy\"\n", " \"americanair thanks he is\"\n", " \"americanair thx for nothing on getting us out of the country and back to us broken plane come on get another one\"\n", " \" americanair tilleymonsta george that doesnt look good please follow this link to start the refund process httptco4gr39s91dl \\u6dd \"\n", " \"americanair my flight was cancelled flightled leaving tomorrow morning auto rebooked for a tuesday night flight but need to arrive monday\"\n", " \"americanair right on cue with the delays \"\n", " \"americanair thank you we got on a different flight to chicago\"\n", " \"americanair leaving over 20 minutes late flight no warnings or communication until we were 15 minutes late flight thats called shitty customer svc\"\n", " \"americanair please bring american airlines to blackberry10\"\n", " \"americanair you have my money you change my flight and dont answer your phones any other suggestions so i can make my commitment\"\n", " \"americanair we have 8 ppl so we need 2 know how many seats are on the next flight plz put us on standby for 4 people on the next flight\"" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_preproc!(crps)\n", "text.(crps)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Next, we convert our dataset into a matrix by creating a document-term matrix using the DocumentTermMatrix function. Quoting [Wikipedia](https://en.wikipedia.org/wiki/Document-term_matrix):\n", "\n", "> A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "16671-element Array{String,1}:\n", " \"0\"\n", " \"00\"\n", " \"001\"\n", " \"0011\"\n", " \"0016\"\n", " \"006\"\n", " \"009\"\n", " \"01\"\n", " \"015\"\n", " \"0162389030167\"\n", " \"0162424965446\"\n", " \"0162431184663\"\n", " \"0167560070877\"\n", " ⋮\n", " \"\\u6ddunfortunately\"\n", " \"\\u6ddw\"\n", " \"۪\"\n", " \"۪all\"\n", " \"۪d\"\n", " \"۪l\"\n", " \"۪ll\"\n", " \"۪m\"\n", " \"۪re\"\n", " \"۪s\"\n", " \"۪t\"\n", " \"۪ve\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = DocumentTermMatrix(crps)\n", "t = m.terms" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Because of our use of the multivariate Bernoulli naive Bayes model, it will be more convenient to work with a variant of the document-term matrix where each word is either present or absent. Note that, in the context of tweet data which are very short documents with likely little word repetition, there is probably not much difference." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "14640×16671 SparseArrays.SparseMatrixCSC{Bool,Int64} with 246999 stored entries:\n", " [36 , 1] = 1\n", " [41 , 1] = 1\n", " [74 , 1] = 1\n", " [217 , 1] = 1\n", " [223 , 1] = 1\n", " [258 , 1] = 1\n", " [338 , 1] = 1\n", " [353 , 1] = 1\n", " [403 , 1] = 1\n", " [436 , 1] = 1\n", " [526 , 1] = 1\n", " [646 , 1] = 1\n", " ⋮\n", " [2701 , 16671] = 1\n", " [3445 , 16671] = 1\n", " [3460 , 16671] = 1\n", " [8042 , 16671] = 1\n", " [10200, 16671] = 1\n", " [10872, 16671] = 1\n", " [11063, 16671] = 1\n", " [11646, 16671] = 1\n", " [11871, 16671] = 1\n", " [12293, 16671] = 1\n", " [13752, 16671] = 1\n", " [14049, 16671] = 1\n", " [14137, 16671] = 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dt = m.dtm .> 0" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We also extract the labels (neutral, postive, negative) from the dataset. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "14640-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:\n", " \"neutral\"\n", " \"positive\"\n", " \"neutral\"\n", " \"negative\"\n", " \"negative\"\n", " \"negative\"\n", " \"positive\"\n", " \"neutral\"\n", " \"positive\"\n", " \"positive\"\n", " \"neutral\"\n", " \"positive\"\n", " \"positive\"\n", " ⋮\n", " \"positive\"\n", " \"negative\"\n", " \"positive\"\n", " \"negative\"\n", " \"neutral\"\n", " \"negative\"\n", " \"negative\"\n", " \"positive\"\n", " \"negative\"\n", " \"neutral\"\n", " \"negative\"\n", " \"neutral\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels = df[:,:sentiment]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We split the data into a training set and a test set. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "split_data (generic function with 1 method)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function split_data(m, n, ρ)\n", " \n", " τ = randperm(n) # permutation of the rows\n", " train_size = Int(floor(ρ*n)) # ρ should be between 0 and 1\n", " train_set = τ[1:train_size]\n", " test_set = τ[train_size+1:end]\n", " \n", " return train_set, test_set\n", "end" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "train_set, test_set = split_data(m , n, 0.9)\n", "dt_train = dt[train_set,:]\n", "labels_train = labels[train_set]\n", "train_size = length(labels_train)\n", "dt_test = dt[test_set,:]\n", "labels_test = labels[test_set]\n", "test_size = length(labels_test);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We implement the Naive Bayes method. We use [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) to avoid overfitting issues." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "mmids_nb_fit (generic function with 1 method)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function mmids_nb_fit(c, labels, dt, size; α=1, β=1)\n", " \n", " # rows corresponding to class c\n", " c_rows = findall(labels.==c)\n", " \n", " # class prior\n", " σ_c = (length(c_rows)+α)/(size+3α)\n", " \n", " # term mle\n", " N_j_c = sum(dt[c_rows,:].>=1,dims=1)\n", " N_c = sum(N_j_c)\n", " θ_c = (N_j_c.+β)./(N_c+2β)\n", "\n", " return σ_c, θ_c\n", "end" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We are ready to train on the dataset." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# training\n", "label_set = [\"positive\", \"negative\", \"neutral\"]\n", "σ_c = zeros(3)\n", "θ_c = zeros(3, length(t))\n", "for i=1:length(label_set)\n", " σ_c[i], θ_c[i,:] = mmids_nb_fit(\n", " label_set[i], labels_train, dt_train, train_size)\n", "end" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3-element Array{Float64,1}:\n", " 0.16169663859169892\n", " 0.6275134683966918\n", " 0.21078989301160939" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "σ_c" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3×16671 Array{Float64,2}:\n", " 0.000720783 6.8646e-5 3.4323e-5 3.4323e-5 … 0.000137292 0.000137292\n", " 0.000336648 1.2948e-5 6.474e-6 6.474e-6 0.000207168 5.1792e-5\n", " 0.00150472 2.55037e-5 5.10074e-5 5.10074e-5 0.000153022 5.10074e-5" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "θ_c" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot([θ_c[1,:], θ_c[2,:], θ_c[3,:]], layout=(3,1), legend=false, \n", " ylabel=[L\"$\\theta_{pos}$\" L\"$\\theta_{neg}$\" L\"$\\theta_{ntr}\$\"])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We compute a prediction on each test tweet using the findmax function." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "search: \u001b[0m\u001b[1mf\u001b[22m\u001b[0m\u001b[1mi\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1md\u001b[22m\u001b[0m\u001b[1mm\u001b[22m\u001b[0m\u001b[1ma\u001b[22m\u001b[0m\u001b[1mx\u001b[22m \u001b[0m\u001b[1mf\u001b[22m\u001b[0m\u001b[1mi\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1md\u001b[22m\u001b[0m\u001b[1mm\u001b[22m\u001b[0m\u001b[1ma\u001b[22m\u001b[0m\u001b[1mx\u001b[22m! \u001b[0m\u001b[1mf\u001b[22m\u001b[0m\u001b[1mi\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1md\u001b[22m\u001b[0m\u001b[1mm\u001b[22min \u001b[0m\u001b[1mf\u001b[22m\u001b[0m\u001b[1mi\u001b[22m\u001b[0m\u001b[1mn\u001b[22m\u001b[0m\u001b[1md\u001b[22m\u001b[0m\u001b[1mm\u001b[22min!\n", "\n" ] }, { "data": { "text/latex": [ "\\begin{verbatim}\n", "findmax(itr) -> (x, index)\n", "\\end{verbatim}\n", "Return the maximum element of the collection \\texttt{itr} and its index. If there are multiple maximal elements, then the first one will be returned. If any data element is \\texttt{NaN}, this element is returned. The result is in line with \\texttt{max}.\n", "\n", "The collection must not be empty.\n", "\n", "\\section{Examples}\n", "\\begin{verbatim}\n", "julia> findmax([8,0.1,-9,pi])\n", "(8.0, 1)\n", "\n", "julia> findmax([1,7,7,6])\n", "(7, 2)\n", "\n", "julia> findmax([1,7,7,NaN])\n", "(NaN, 4)\n", "\\end{verbatim}\n", "\\rule{\\textwidth}{1pt}\n", "\\begin{verbatim}\n", "findmax(A; dims) -> (maxval, index)\n", "\\end{verbatim}\n", "For an array input, returns the value and index of the maximum over the given dimensions. \\texttt{NaN} is treated as greater than all other values.\n", "\n", "\\section{Examples}\n", "\\begin{verbatim}\n", "julia> A = [1.0 2; 3 4]\n", "2×2 Array{Float64,2}:\n", " 1.0 2.0\n", " 3.0 4.0\n", "\n", "julia> findmax(A, dims=1)\n", "([3.0 4.0], CartesianIndex{2}[CartesianIndex(2, 1) CartesianIndex(2, 2)])\n", "\n", "julia> findmax(A, dims=2)\n", "([2.0; 4.0], CartesianIndex{2}[CartesianIndex(1, 2); CartesianIndex(2, 2)])\n", "\\end{verbatim}\n" ], "text/markdown": [ "\n", "findmax(itr) -> (x, index)\n", "\n", "\n", "Return the maximum element of the collection itr and its index. If there are multiple maximal elements, then the first one will be returned. If any data element is NaN, this element is returned. The result is in line with max.\n", "\n", "The collection must not be empty.\n", "\n", "# Examples\n", "\n", "jldoctest\n", "julia> findmax([8,0.1,-9,pi])\n", "(8.0, 1)\n", "\n", "julia> findmax([1,7,7,6])\n", "(7, 2)\n", "\n", "julia> findmax([1,7,7,NaN])\n", "(NaN, 4)\n", "\n", "\n", "---\n", "\n", "\n", "findmax(A; dims) -> (maxval, index)\n", "\n", "\n", "For an array input, returns the value and index of the maximum over the given dimensions. NaN is treated as greater than all other values.\n", "\n", "# Examples\n", "\n", "jldoctest\n", "julia> A = [1.0 2; 3 4]\n", "2×2 Array{Float64,2}:\n", " 1.0 2.0\n", " 3.0 4.0\n", "\n", "julia> findmax(A, dims=1)\n", "([3.0 4.0], CartesianIndex{2}[CartesianIndex(2, 1) CartesianIndex(2, 2)])\n", "\n", "julia> findmax(A, dims=2)\n", "([2.0; 4.0], CartesianIndex{2}[CartesianIndex(1, 2); CartesianIndex(2, 2)])\n", "\n" ], "text/plain": [ "\u001b[36m findmax(itr) -> (x, index)\u001b[39m\n", "\n", " Return the maximum element of the collection \u001b[36mitr\u001b[39m and its index. If there are\n", " multiple maximal elements, then the first one will be returned. If any data\n", " element is \u001b[36mNaN\u001b[39m, this element is returned. The result is in line with \u001b[36mmax\u001b[39m.\n", "\n", " The collection must not be empty.\n", "\n", "\u001b[1m Examples\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[36m julia> findmax([8,0.1,-9,pi])\u001b[39m\n", "\u001b[36m (8.0, 1)\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> findmax([1,7,7,6])\u001b[39m\n", "\u001b[36m (7, 2)\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> findmax([1,7,7,NaN])\u001b[39m\n", "\u001b[36m (NaN, 4)\u001b[39m\n", "\n", " ────────────────────────────────────────────────────────────────────────────\n", "\n", "\u001b[36m findmax(A; dims) -> (maxval, index)\u001b[39m\n", "\n", " For an array input, returns the value and index of the maximum over the\n", " given dimensions. \u001b[36mNaN\u001b[39m is treated as greater than all other values.\n", "\n", "\u001b[1m Examples\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[36m julia> A = [1.0 2; 3 4]\u001b[39m\n", "\u001b[36m 2×2 Array{Float64,2}:\u001b[39m\n", "\u001b[36m 1.0 2.0\u001b[39m\n", "\u001b[36m 3.0 4.0\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> findmax(A, dims=1)\u001b[39m\n", "\u001b[36m ([3.0 4.0], CartesianIndex{2}[CartesianIndex(2, 1) CartesianIndex(2, 2)])\u001b[39m\n", "\u001b[36m \u001b[39m\n", "\u001b[36m julia> findmax(A, dims=2)\u001b[39m\n", "\u001b[36m ([2.0; 4.0], CartesianIndex{2}[CartesianIndex(1, 2); CartesianIndex(2, 2)])\u001b[39m" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "?findmax" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# testing\n", "score_c = log.(θ_c) * dt_test' \n", "score_c .+= log.(1 .- θ_c) * (1 .- dt_test')\n", "score_c .+= log.(σ_c)\n", "(maxscr, argmax) = findmax(score_c, dims=1);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "For example, for the 5th test tweet:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "3-element Array{Float64,1}:\n", " -75.75143663488974\n", " -75.18163974055193\n", " -74.01534684364829" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "score_c[:,5]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "-74.01534684364829" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "maxscr" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "CartesianIndex(3, 5)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "argmax" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The following computes the overall accuracy over the test data." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0.7137978142076503" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# accuracy\n", "acc = 0\n", "for i in 1:length(test_set)\n", " if label_set[argmax[i]]==labels_test[i]\n", " acc += 1\n", " end\n", "end\n", "acc/(length(labels_test))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "To get a better understanding of the differences uncovered by Naive Bayes between the different labels, we identify words that are particularly common in one label, but on the other. Recall that label 1 corresponds to positive while label 2 corresponds to negative." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "18-element Array{String,1}:\n", " \"1\"\n", " \"airline\"\n", " \"amazing\"\n", " \"awesome\"\n", " \"best\"\n", " \"crew\"\n", " \"flying\"\n", " \"good\"\n", " \"got\"\n", " \"great\"\n", " \"guys\"\n", " \"love\"\n", " \"much\"\n", " \"thank\"\n", " \"thanks\"\n", " \"today\"\n", " \"very\"\n", " \"virginamerica\"" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[findall((θ_c[1,:].>0.002).&(θ_c[2,:].<0.002))]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One notices that many (stemmed) positive words do appear in this list: awesome, best, great, love, thanks." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "35-element Array{String,1}:\n", " \"3\"\n", " \"about\"\n", " \"after\"\n", " \"am\"\n", " \"bag\"\n", " \"been\"\n", " \"call\"\n", " \"cancelled\"\n", " \"cant\"\n", " \"delayed\"\n", " \"do\"\n", " \"dont\"\n", " \"flightled\"\n", " ⋮\n", " \"need\"\n", " \"no\"\n", " \"one\"\n", " \"phone\"\n", " \"plane\"\n", " \"still\"\n", " \"there\"\n", " \"they\"\n", " \"what\"\n", " \"when\"\n", " \"why\"\n", " \"would\"" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t[findall((θ_c[2,:].>0.002).&(θ_c[1,:].<0.002))]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This time, we notice: bag, cancelled, cant, delayed, dont, no, phone." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.5.1", "language": "julia", "name": "julia-1.5" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.5.1" } }, "nbformat": 4, "nbformat_minor": 2 }