Describing and comparing texts

This exercise covers the material from Day 2, for describing texts using quanteda. We will also be using the quantedaData package, which contains some additional corpora not in the base quanteda package.

To install the quantedaData package, use:

devtools::install_github("kbenoit/quantedaData")

This requires that you have first installed the devtools package.

  1. Preparing and pre-processing texts

    1. “Cleaning” texts

      It is common to “clean” texts before processing, usually by removing punctuation, digits, and converting to lower case.

      “Cleaning” in quanteda takes through decisions made at the tokenization stage. In order to count word frequencies, we first need to split the text into words through a process known as tokenization. Look at the documentation for quanteda’s tokenize command using the built in help function (? before any object/command). Use the tokenize command on exampleString (a built-in data type in the quanteda package), and examine the results.

      Tokenize this text:

      1. with punctuation removed
      require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
      tokenize(exampleString, removePunct = TRUE)
      ## tokenizedText object from 1 document.
      ## Component 1 :
      ##   [1] "Instead"       "we"            "have"          "a"            
      ##   [5] "Fine"          "Gael-Labour"   "Party"         "Government"   
      ##   [9] "coming"        "into"          "power"         "promising"    
      ##  [13] "real"          "change"        "but"           "slavishly"    
      ##  [17] "following"     "the"           "previous"      "Government’s" 
      ##  [21] "policy"        "That"          "policy"        "has"          
      ##  [25] "been"          "dictated"      "by"            "the"          
      ##  [29] "IMF"           "the"           "EU"            "and"          
      ##  [33] "the"           "ECB"           "not"           "to"           
      ##  [37] "bail"          "out"           "the"           "Irish"        
      ##  [41] "people"        "but"           "to"            "salvage"      
      ##  [45] "German"        "French"        "British"       "banks"        
      ##  [49] "and"           "those"         "of"            "other"        
      ##  [53] "countries"     "from"          "their"         "disastrous"   
      ##  [57] "and"           "frenzied"      "embrace"       "of"           
      ##  [61] "Irish"         "bankers"       "and"           "speculators"  
      ##  [65] "in"            "the"           "Irish"         "property"     
      ##  [69] "market"        "bubble"        "It"            "is"           
      ##  [73] "criminal"      "that"          "an"            "Irish"        
      ##  [77] "Government"    "would"         "ever"          "slavishly"    
      ##  [81] "agree"         "to"            "make"          "a"            
      ##  [85] "vassal"        "State"         "of"            "the"          
      ##  [89] "Republic"      "of"            "Ireland"       "and"          
      ##  [93] "its"           "people"        "to"            "squeeze"      
      ##  [97] "tribute"       "from"          "our"           "people"       
      ## [101] "to"            "save"          "the"           "capitalist"   
      ## [105] "banks"         "of"            "Europe"        "and"          
      ## [109] "in"            "doing"         "so"            "destroy"      
      ## [113] "the"           "lives"         "of"            "hundreds"     
      ## [117] "of"            "thousands"     "of"            "our"          
      ## [121] "people"        "now"           "plunged"       "into"         
      ## [125] "unemployment"  "financial"     "hardship"      "and"          
      ## [129] "social"        "dislocation"   "In"            "this"         
      ## [133] "budget"        "and"           "the"           "past"         
      ## [137] "four"          "years"         "the"           "austerity"    
      ## [141] "policy"        "means"         "25"            "billion"      
      ## [145] "has"           "been"          "reefed"        "out"          
      ## [149] "of"            "the"           "Irish"         "economy"      
      ## [153] "in"            "pursuit"       "of"            "a"            
      ## [157] "policy"        "of"            "cringing"      "acceptance"   
      ## [161] "of"            "the"           "diktats"       "of"           
      ## [165] "the"           "financial"     "markets"       "Can"          
      ## [169] "this"          "Government"    "not"           "see"          
      ## [173] "that"          "not"           "only"          "is"           
      ## [177] "this"          "immoral"       "and"           "unjust"       
      ## [181] "in"            "the"           "extreme"       "it"           
      ## [185] "is"            "decimating"    "the"           "domestic"     
      ## [189] "economy"       "As"            "we"            "are"          
      ## [193] "tired"         "of"            "pointing"      "out"          
      ## [197] "if"            "we"            "savage"        "the"          
      ## [201] "ability"       "of"            "the"           "majority"     
      ## [205] "of"            "our"           "people"        "to"           
      ## [209] "purchase"      "goods"         "and"           "utilise"      
      ## [213] "services"      "then"          "tens"          "of"           
      ## [217] "thousands"     "of"            "workers"       "depending"    
      ## [221] "on"            "this"          "demand"        "for"          
      ## [225] "their"         "jobs"          "will"          "be"           
      ## [229] "thrown"        "on"            "the"           "scrapheap"    
      ## [233] "of"            "unemployment"  "and"           "tragically"   
      ## [237] "that"          "is"            "what"          "is"           
      ## [241] "happening"     "All"           "the"           "key"          
      ## [245] "indicators"    "in"            "the"           "domestic"     
      ## [249] "economy"       "show"          "the"           "abject"       
      ## [253] "failure"       "of"            "austerity"     "Private"      
      ## [257] "investment"    "has"           "collapsed"     "VAT"          
      ## [261] "receipts"      "are"           "more"          "than"         
      ## [265] "400"           "million"       "behind"        "and"          
      ## [269] "unemployment"  "has"           "risen"         "by"           
      ## [273] "thousands"     "since"         "this"          "Government"   
      ## [277] "entered"       "office"        "Much"          "is"           
      ## [281] "made"          "of"            "the"           "growth"       
      ## [285] "in"            "exports"       "Every"         "job"          
      ## [289] "in"            "the"           "export"        "sector"       
      ## [293] "is"            "vital"         "and"           "we"           
      ## [297] "defend"        "it"            "but"           "the"          
      ## [301] "type"          "of"            "investment"    "and"          
      ## [305] "the"           "capital"       "intensive"     "nature"       
      ## [309] "of"            "the"           "investment"    "that"         
      ## [313] "goes"          "into"          "exports"       "means"        
      ## [317] "that"          "it"            "is"            "not"          
      ## [321] "where"         "the"           "hundreds"      "of"           
      ## [325] "thousands"     "of"            "jobs"          "we"           
      ## [329] "need"          "will"          "be"            "created"      
      ## [333] "in"            "the"           "next"          "period"       
      ## [337] "of"            "years"         "On"            "the"          
      ## [341] "other"         "hand"          "taking"        "670"          
      ## [345] "million"       "from"          "the"           "pockets"      
      ## [349] "of"            "ordinary"      "people"        "through"      
      ## [353] "the"           "VAT"           "increases"     "and"          
      ## [357] "taking"        "other"         "money"         "in"           
      ## [361] "the"           "cuts"          "in"            "child"        
      ## [365] "benefit"       "and"           "elsewhere"     "will"         
      ## [369] "further"       "add"           "to"            "the"          
      ## [373] "downward"      "spiral"        "of"            "austerity"    
      ## [377] "The"           "interest"      "relief"        "for"          
      ## [381] "householders"  "who"           "are"           "trapped"      
      ## [385] "in"            "the"           "nightmare"     "of"           
      ## [389] "negative"      "equity"        "and"           "extortionate" 
      ## [393] "monthly"       "mortgage"      "payments"      "will"         
      ## [397] "be"            "welcomed"      "to"            "a"            
      ## [401] "degree"        "but"           "is"            "dismally"     
      ## [405] "inadequate"    "That"          "generation"    "of"           
      ## [409] "workers"       "who"           "are"           "trapped"      
      ## [413] "in"            "this"          "nightmare"     "are"          
      ## [417] "victims"       "of"            "the"           "extortion"    
      ## [421] "perpetrated"   "on"            "them"          "in"           
      ## [425] "the"           "housing"       "market"        "by"           
      ## [429] "the"           "immoral"       "speculators"   "legislated"   
      ## [433] "for"           "by"            "Fianna"        "Fáil"         
      ## [437] "and"           "the"           "PDs"           "The"          
      ## [441] "huge"          "proportion"    "of"            "their"        
      ## [445] "incomes"       "that"          "continues"     "to"           
      ## [449] "go"            "to"            "the"           "banks"        
      ## [453] "massively"     "dislocates"    "the"           "economy"      
      ## [457] "Otherwise"     "those"         "funds"         "would"        
      ## [461] "be"            "going"         "to"            "the"          
      ## [465] "purchase"      "of"            "goods"         "and"          
      ## [469] "services"      "in"            "the"           "domestic"     
      ## [473] "economy"       "stimulating"   "demand"        "and"          
      ## [477] "sustaining"    "tens"          "of"            "thousands"    
      ## [481] "of"            "jobs"          "for"           "tens"         
      ## [485] "of"            "thousands"     "of"            "people"       
      ## [489] "who"           "are"           "now"           "on"           
      ## [493] "the"           "dole"          "unfortunately"
      1. with hyphenated words broken up and hyphens removed
      tokenize(exampleString, removeHyphens = TRUE, removePunct = TRUE)
      ## tokenizedText object from 1 document.
      ## Component 1 :
      ##   [1] "Instead"       "we"            "have"          "a"            
      ##   [5] "Fine"          "Gael"          "Labour"        "Party"        
      ##   [9] "Government"    "coming"        "into"          "power"        
      ##  [13] "promising"     "real"          "change"        "but"          
      ##  [17] "slavishly"     "following"     "the"           "previous"     
      ##  [21] "Government’s"  "policy"        "That"          "policy"       
      ##  [25] "has"           "been"          "dictated"      "by"           
      ##  [29] "the"           "IMF"           "the"           "EU"           
      ##  [33] "and"           "the"           "ECB"           "not"          
      ##  [37] "to"            "bail"          "out"           "the"          
      ##  [41] "Irish"         "people"        "but"           "to"           
      ##  [45] "salvage"       "German"        "French"        "British"      
      ##  [49] "banks"         "and"           "those"         "of"           
      ##  [53] "other"         "countries"     "from"          "their"        
      ##  [57] "disastrous"    "and"           "frenzied"      "embrace"      
      ##  [61] "of"            "Irish"         "bankers"       "and"          
      ##  [65] "speculators"   "in"            "the"           "Irish"        
      ##  [69] "property"      "market"        "bubble"        "It"           
      ##  [73] "is"            "criminal"      "that"          "an"           
      ##  [77] "Irish"         "Government"    "would"         "ever"         
      ##  [81] "slavishly"     "agree"         "to"            "make"         
      ##  [85] "a"             "vassal"        "State"         "of"           
      ##  [89] "the"           "Republic"      "of"            "Ireland"      
      ##  [93] "and"           "its"           "people"        "to"           
      ##  [97] "squeeze"       "tribute"       "from"          "our"          
      ## [101] "people"        "to"            "save"          "the"          
      ## [105] "capitalist"    "banks"         "of"            "Europe"       
      ## [109] "and"           "in"            "doing"         "so"           
      ## [113] "destroy"       "the"           "lives"         "of"           
      ## [117] "hundreds"      "of"            "thousands"     "of"           
      ## [121] "our"           "people"        "now"           "plunged"      
      ## [125] "into"          "unemployment"  "financial"     "hardship"     
      ## [129] "and"           "social"        "dislocation"   "In"           
      ## [133] "this"          "budget"        "and"           "the"          
      ## [137] "past"          "four"          "years"         "the"          
      ## [141] "austerity"     "policy"        "means"         "25"           
      ## [145] "billion"       "has"           "been"          "reefed"       
      ## [149] "out"           "of"            "the"           "Irish"        
      ## [153] "economy"       "in"            "pursuit"       "of"           
      ## [157] "a"             "policy"        "of"            "cringing"     
      ## [161] "acceptance"    "of"            "the"           "diktats"      
      ## [165] "of"            "the"           "financial"     "markets"      
      ## [169] "Can"           "this"          "Government"    "not"          
      ## [173] "see"           "that"          "not"           "only"         
      ## [177] "is"            "this"          "immoral"       "and"          
      ## [181] "unjust"        "in"            "the"           "extreme"      
      ## [185] "it"            "is"            "decimating"    "the"          
      ## [189] "domestic"      "economy"       "As"            "we"           
      ## [193] "are"           "tired"         "of"            "pointing"     
      ## [197] "out"           "if"            "we"            "savage"       
      ## [201] "the"           "ability"       "of"            "the"          
      ## [205] "majority"      "of"            "our"           "people"       
      ## [209] "to"            "purchase"      "goods"         "and"          
      ## [213] "utilise"       "services"      "then"          "tens"         
      ## [217] "of"            "thousands"     "of"            "workers"      
      ## [221] "depending"     "on"            "this"          "demand"       
      ## [225] "for"           "their"         "jobs"          "will"         
      ## [229] "be"            "thrown"        "on"            "the"          
      ## [233] "scrapheap"     "of"            "unemployment"  "and"          
      ## [237] "tragically"    "that"          "is"            "what"         
      ## [241] "is"            "happening"     "All"           "the"          
      ## [245] "key"           "indicators"    "in"            "the"          
      ## [249] "domestic"      "economy"       "show"          "the"          
      ## [253] "abject"        "failure"       "of"            "austerity"    
      ## [257] "Private"       "investment"    "has"           "collapsed"    
      ## [261] "VAT"           "receipts"      "are"           "more"         
      ## [265] "than"          "400"           "million"       "behind"       
      ## [269] "and"           "unemployment"  "has"           "risen"        
      ## [273] "by"            "thousands"     "since"         "this"         
      ## [277] "Government"    "entered"       "office"        "Much"         
      ## [281] "is"            "made"          "of"            "the"          
      ## [285] "growth"        "in"            "exports"       "Every"        
      ## [289] "job"           "in"            "the"           "export"       
      ## [293] "sector"        "is"            "vital"         "and"          
      ## [297] "we"            "defend"        "it"            "but"          
      ## [301] "the"           "type"          "of"            "investment"   
      ## [305] "and"           "the"           "capital"       "intensive"    
      ## [309] "nature"        "of"            "the"           "investment"   
      ## [313] "that"          "goes"          "into"          "exports"      
      ## [317] "means"         "that"          "it"            "is"           
      ## [321] "not"           "where"         "the"           "hundreds"     
      ## [325] "of"            "thousands"     "of"            "jobs"         
      ## [329] "we"            "need"          "will"          "be"           
      ## [333] "created"       "in"            "the"           "next"         
      ## [337] "period"        "of"            "years"         "On"           
      ## [341] "the"           "other"         "hand"          "taking"       
      ## [345] "670"           "million"       "from"          "the"          
      ## [349] "pockets"       "of"            "ordinary"      "people"       
      ## [353] "through"       "the"           "VAT"           "increases"    
      ## [357] "and"           "taking"        "other"         "money"        
      ## [361] "in"            "the"           "cuts"          "in"           
      ## [365] "child"         "benefit"       "and"           "elsewhere"    
      ## [369] "will"          "further"       "add"           "to"           
      ## [373] "the"           "downward"      "spiral"        "of"           
      ## [377] "austerity"     "The"           "interest"      "relief"       
      ## [381] "for"           "householders"  "who"           "are"          
      ## [385] "trapped"       "in"            "the"           "nightmare"    
      ## [389] "of"            "negative"      "equity"        "and"          
      ## [393] "extortionate"  "monthly"       "mortgage"      "payments"     
      ## [397] "will"          "be"            "welcomed"      "to"           
      ## [401] "a"             "degree"        "but"           "is"           
      ## [405] "dismally"      "inadequate"    "That"          "generation"   
      ## [409] "of"            "workers"       "who"           "are"          
      ## [413] "trapped"       "in"            "this"          "nightmare"    
      ## [417] "are"           "victims"       "of"            "the"          
      ## [421] "extortion"     "perpetrated"   "on"            "them"         
      ## [425] "in"            "the"           "housing"       "market"       
      ## [429] "by"            "the"           "immoral"       "speculators"  
      ## [433] "legislated"    "for"           "by"            "Fianna"       
      ## [437] "Fáil"          "and"           "the"           "PDs"          
      ## [441] "The"           "huge"          "proportion"    "of"           
      ## [445] "their"         "incomes"       "that"          "continues"    
      ## [449] "to"            "go"            "to"            "the"          
      ## [453] "banks"         "massively"     "dislocates"    "the"          
      ## [457] "economy"       "Otherwise"     "those"         "funds"        
      ## [461] "would"         "be"            "going"         "to"           
      ## [465] "the"           "purchase"      "of"            "goods"        
      ## [469] "and"           "services"      "in"            "the"          
      ## [473] "domestic"      "economy"       "stimulating"   "demand"       
      ## [477] "and"           "sustaining"    "tens"          "of"           
      ## [481] "thousands"     "of"            "jobs"          "for"          
      ## [485] "tens"          "of"            "thousands"     "of"           
      ## [489] "people"        "who"           "are"           "now"          
      ## [493] "on"            "the"           "dole"          "unfortunately"
      1. with punctuation removed and converted to lowercase

      The text must be converted to lower case before being tokenized.

      tokenize(toLower(exampleString), removePunct = TRUE)
      ## tokenizedText object from 1 document.
      ## Component 1 :
      ##   [1] "instead"       "we"            "have"          "a"            
      ##   [5] "fine"          "gael-labour"   "party"         "government"   
      ##   [9] "coming"        "into"          "power"         "promising"    
      ##  [13] "real"          "change"        "but"           "slavishly"    
      ##  [17] "following"     "the"           "previous"      "government’s" 
      ##  [21] "policy"        "that"          "policy"        "has"          
      ##  [25] "been"          "dictated"      "by"            "the"          
      ##  [29] "imf"           "the"           "eu"            "and"          
      ##  [33] "the"           "ecb"           "not"           "to"           
      ##  [37] "bail"          "out"           "the"           "irish"        
      ##  [41] "people"        "but"           "to"            "salvage"      
      ##  [45] "german"        "french"        "british"       "banks"        
      ##  [49] "and"           "those"         "of"            "other"        
      ##  [53] "countries"     "from"          "their"         "disastrous"   
      ##  [57] "and"           "frenzied"      "embrace"       "of"           
      ##  [61] "irish"         "bankers"       "and"           "speculators"  
      ##  [65] "in"            "the"           "irish"         "property"     
      ##  [69] "market"        "bubble"        "it"            "is"           
      ##  [73] "criminal"      "that"          "an"            "irish"        
      ##  [77] "government"    "would"         "ever"          "slavishly"    
      ##  [81] "agree"         "to"            "make"          "a"            
      ##  [85] "vassal"        "state"         "of"            "the"          
      ##  [89] "republic"      "of"            "ireland"       "and"          
      ##  [93] "its"           "people"        "to"            "squeeze"      
      ##  [97] "tribute"       "from"          "our"           "people"       
      ## [101] "to"            "save"          "the"           "capitalist"   
      ## [105] "banks"         "of"            "europe"        "and"          
      ## [109] "in"            "doing"         "so"            "destroy"      
      ## [113] "the"           "lives"         "of"            "hundreds"     
      ## [117] "of"            "thousands"     "of"            "our"          
      ## [121] "people"        "now"           "plunged"       "into"         
      ## [125] "unemployment"  "financial"     "hardship"      "and"          
      ## [129] "social"        "dislocation"   "in"            "this"         
      ## [133] "budget"        "and"           "the"           "past"         
      ## [137] "four"          "years"         "the"           "austerity"    
      ## [141] "policy"        "means"         "25"            "billion"      
      ## [145] "has"           "been"          "reefed"        "out"          
      ## [149] "of"            "the"           "irish"         "economy"      
      ## [153] "in"            "pursuit"       "of"            "a"            
      ## [157] "policy"        "of"            "cringing"      "acceptance"   
      ## [161] "of"            "the"           "diktats"       "of"           
      ## [165] "the"           "financial"     "markets"       "can"          
      ## [169] "this"          "government"    "not"           "see"          
      ## [173] "that"          "not"           "only"          "is"           
      ## [177] "this"          "immoral"       "and"           "unjust"       
      ## [181] "in"            "the"           "extreme"       "it"           
      ## [185] "is"            "decimating"    "the"           "domestic"     
      ## [189] "economy"       "as"            "we"            "are"          
      ## [193] "tired"         "of"            "pointing"      "out"          
      ## [197] "if"            "we"            "savage"        "the"          
      ## [201] "ability"       "of"            "the"           "majority"     
      ## [205] "of"            "our"           "people"        "to"           
      ## [209] "purchase"      "goods"         "and"           "utilise"      
      ## [213] "services"      "then"          "tens"          "of"           
      ## [217] "thousands"     "of"            "workers"       "depending"    
      ## [221] "on"            "this"          "demand"        "for"          
      ## [225] "their"         "jobs"          "will"          "be"           
      ## [229] "thrown"        "on"            "the"           "scrapheap"    
      ## [233] "of"            "unemployment"  "and"           "tragically"   
      ## [237] "that"          "is"            "what"          "is"           
      ## [241] "happening"     "all"           "the"           "key"          
      ## [245] "indicators"    "in"            "the"           "domestic"     
      ## [249] "economy"       "show"          "the"           "abject"       
      ## [253] "failure"       "of"            "austerity"     "private"      
      ## [257] "investment"    "has"           "collapsed"     "vat"          
      ## [261] "receipts"      "are"           "more"          "than"         
      ## [265] "400"           "million"       "behind"        "and"          
      ## [269] "unemployment"  "has"           "risen"         "by"           
      ## [273] "thousands"     "since"         "this"          "government"   
      ## [277] "entered"       "office"        "much"          "is"           
      ## [281] "made"          "of"            "the"           "growth"       
      ## [285] "in"            "exports"       "every"         "job"          
      ## [289] "in"            "the"           "export"        "sector"       
      ## [293] "is"            "vital"         "and"           "we"           
      ## [297] "defend"        "it"            "but"           "the"          
      ## [301] "type"          "of"            "investment"    "and"          
      ## [305] "the"           "capital"       "intensive"     "nature"       
      ## [309] "of"            "the"           "investment"    "that"         
      ## [313] "goes"          "into"          "exports"       "means"        
      ## [317] "that"          "it"            "is"            "not"          
      ## [321] "where"         "the"           "hundreds"      "of"           
      ## [325] "thousands"     "of"            "jobs"          "we"           
      ## [329] "need"          "will"          "be"            "created"      
      ## [333] "in"            "the"           "next"          "period"       
      ## [337] "of"            "years"         "on"            "the"          
      ## [341] "other"         "hand"          "taking"        "670"          
      ## [345] "million"       "from"          "the"           "pockets"      
      ## [349] "of"            "ordinary"      "people"        "through"      
      ## [353] "the"           "vat"           "increases"     "and"          
      ## [357] "taking"        "other"         "money"         "in"           
      ## [361] "the"           "cuts"          "in"            "child"        
      ## [365] "benefit"       "and"           "elsewhere"     "will"         
      ## [369] "further"       "add"           "to"            "the"          
      ## [373] "downward"      "spiral"        "of"            "austerity"    
      ## [377] "the"           "interest"      "relief"        "for"          
      ## [381] "householders"  "who"           "are"           "trapped"      
      ## [385] "in"            "the"           "nightmare"     "of"           
      ## [389] "negative"      "equity"        "and"           "extortionate" 
      ## [393] "monthly"       "mortgage"      "payments"      "will"         
      ## [397] "be"            "welcomed"      "to"            "a"            
      ## [401] "degree"        "but"           "is"            "dismally"     
      ## [405] "inadequate"    "that"          "generation"    "of"           
      ## [409] "workers"       "who"           "are"           "trapped"      
      ## [413] "in"            "this"          "nightmare"     "are"          
      ## [417] "victims"       "of"            "the"           "extortion"    
      ## [421] "perpetrated"   "on"            "them"          "in"           
      ## [425] "the"           "housing"       "market"        "by"           
      ## [429] "the"           "immoral"       "speculators"   "legislated"   
      ## [433] "for"           "by"            "fianna"        "fáil"         
      ## [437] "and"           "the"           "pds"           "the"          
      ## [441] "huge"          "proportion"    "of"            "their"        
      ## [445] "incomes"       "that"          "continues"     "to"           
      ## [449] "go"            "to"            "the"           "banks"        
      ## [453] "massively"     "dislocates"    "the"           "economy"      
      ## [457] "otherwise"     "those"         "funds"         "would"        
      ## [461] "be"            "going"         "to"            "the"          
      ## [465] "purchase"      "of"            "goods"         "and"          
      ## [469] "services"      "in"            "the"           "domestic"     
      ## [473] "economy"       "stimulating"   "demand"        "and"          
      ## [477] "sustaining"    "tens"          "of"            "thousands"    
      ## [481] "of"            "jobs"          "for"           "tens"         
      ## [485] "of"            "thousands"     "of"            "people"       
      ## [489] "who"           "are"           "now"           "on"           
      ## [493] "the"           "dole"          "unfortunately"
      1. segmented into sentences without other cleaning applied.
      tokenize(exampleString, what = "sentence")
      ## tokenizedText object from 1 document.
      ## Component 1 :
      ##  [1] "Instead we have a Fine Gael-Labour Party Government, coming into power promising real change but slavishly following the previous Government’s policy."                                                                                                                                                                                                           
      ##  [2] "That policy has been dictated by the IMF, the EU and the ECB not to bail out the Irish people but to salvage German, French, British banks and those of other countries from their disastrous and frenzied embrace of Irish bankers and speculators in the Irish property market bubble."                                                                         
      ##  [3] "It is criminal that an Irish Government would ever slavishly agree to make a vassal State of the Republic of Ireland and its people, to squeeze tribute from our people to save the capitalist banks of Europe and in doing so destroy the lives of hundreds of thousands of our people now plunged into unemployment, financial hardship and social dislocation."
      ##  [4] "In this budget and the past four years the austerity policy means €25 billion has been reefed out of the Irish economy in pursuit of a policy of cringing acceptance of the diktats of the financial markets."                                                                                                                                                    
      ##  [5] "Can this Government not see that, not only is this immoral and unjust in the extreme, it is decimating the domestic economy?"                                                                                                                                                                                                                                     
      ##  [6] "As we are tired of pointing out, if we savage the ability of the majority of our people to purchase goods and utilise services, then tens of thousands of workers depending on this demand for their jobs will be thrown on the scrapheap of unemployment and, tragically, that is what is happening."                                                            
      ##  [7] "All the key indicators in the domestic economy show the abject failure of austerity."                                                                                                                                                                                                                                                                             
      ##  [8] "Private investment has collapsed, VAT receipts are more than €400 million behind and unemployment has risen by thousands since this Government entered office."                                                                                                                                                                                                   
      ##  [9] "Much is made of the growth in exports."                                                                                                                                                                                                                                                                                                                           
      ## [10] "Every job in the export sector is vital and we defend it but the type of investment and the capital intensive nature of the investment that goes into exports means that it is not where the hundreds of thousands of jobs we need will be created in the next period of years."                                                                                  
      ## [11] "On the other hand, taking €670 million from the pockets of ordinary people through the VAT increases and taking other money in the cuts in child benefit and elsewhere will further add to the downward spiral of austerity."                                                                                                                                     
      ## [12] "The interest relief for householders who are trapped in the nightmare of negative equity and extortionate monthly mortgage payments will be welcomed to a degree but is dismally inadequate."                                                                                                                                                                     
      ## [13] "That generation of workers who are trapped in this nightmare are victims of the extortion perpetrated on them in the housing market by the immoral speculators legislated for by Fianna Fáil and the PDs."                                                                                                                                                        
      ## [14] "The huge proportion of their incomes that continues to go to the banks massively dislocates the economy."                                                                                                                                                                                                                                                         
      ## [15] "Otherwise those funds would be going to the purchase of goods and services in the domestic economy, stimulating demand and sustaining tens of thousands of jobs for tens of thousands of people who are now on the dole, unfortunately."
    2. Stemming.

      Stemming removes the suffixes using the Porter stemmer, found in the SnowballC library. The quanteda function to invoke the stemmer is wordstem(). Apply stemming to the exampleString and examine the results. Why does it not work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?

      It does not work because wordstem() only works on tokenized texts. To tokenize the sentence-segmented vector, you would need to do the following:

      sents <- tokenize(exampleString, what = "sentence")
      # unlist it first
      (tokenizedSents <- tokenize(unlist(sents)))
      ## tokenizedText object from 15 documents.
      ## Component 1 :
      ##  [1] "Instead"      "we"           "have"         "a"           
      ##  [5] "Fine"         "Gael"         "-"            "Labour"      
      ##  [9] "Party"        "Government"   ","            "coming"      
      ## [13] "into"         "power"        "promising"    "real"        
      ## [17] "change"       "but"          "slavishly"    "following"   
      ## [21] "the"          "previous"     "Government’s" "policy"      
      ## [25] "."           
      ## 
      ## Component 2 :
      ##  [1] "That"        "policy"      "has"         "been"        "dictated"   
      ##  [6] "by"          "the"         "IMF"         ","           "the"        
      ## [11] "EU"          "and"         "the"         "ECB"         "not"        
      ## [16] "to"          "bail"        "out"         "the"         "Irish"      
      ## [21] "people"      "but"         "to"          "salvage"     "German"     
      ## [26] ","           "French"      ","           "British"     "banks"      
      ## [31] "and"         "those"       "of"          "other"       "countries"  
      ## [36] "from"        "their"       "disastrous"  "and"         "frenzied"   
      ## [41] "embrace"     "of"          "Irish"       "bankers"     "and"        
      ## [46] "speculators" "in"          "the"         "Irish"       "property"   
      ## [51] "market"      "bubble"      "."          
      ## 
      ## Component 3 :
      ##  [1] "It"           "is"           "criminal"     "that"        
      ##  [5] "an"           "Irish"        "Government"   "would"       
      ##  [9] "ever"         "slavishly"    "agree"        "to"          
      ## [13] "make"         "a"            "vassal"       "State"       
      ## [17] "of"           "the"          "Republic"     "of"          
      ## [21] "Ireland"      "and"          "its"          "people"      
      ## [25] ","            "to"           "squeeze"      "tribute"     
      ## [29] "from"         "our"          "people"       "to"          
      ## [33] "save"         "the"          "capitalist"   "banks"       
      ## [37] "of"           "Europe"       "and"          "in"          
      ## [41] "doing"        "so"           "destroy"      "the"         
      ## [45] "lives"        "of"           "hundreds"     "of"          
      ## [49] "thousands"    "of"           "our"          "people"      
      ## [53] "now"          "plunged"      "into"         "unemployment"
      ## [57] ","            "financial"    "hardship"     "and"         
      ## [61] "social"       "dislocation"  "."           
      ## 
      ## Component 4 :
      ##  [1] "In"         "this"       "budget"     "and"        "the"       
      ##  [6] "past"       "four"       "years"      "the"        "austerity" 
      ## [11] "policy"     "means"      "€"          "25"         "billion"   
      ## [16] "has"        "been"       "reefed"     "out"        "of"        
      ## [21] "the"        "Irish"      "economy"    "in"         "pursuit"   
      ## [26] "of"         "a"          "policy"     "of"         "cringing"  
      ## [31] "acceptance" "of"         "the"        "diktats"    "of"        
      ## [36] "the"        "financial"  "markets"    "."         
      ## 
      ## Component 5 :
      ##  [1] "Can"        "this"       "Government" "not"        "see"       
      ##  [6] "that"       ","          "not"        "only"       "is"        
      ## [11] "this"       "immoral"    "and"        "unjust"     "in"        
      ## [16] "the"        "extreme"    ","          "it"         "is"        
      ## [21] "decimating" "the"        "domestic"   "economy"    "?"         
      ## 
      ## Component 6 :
      ##  [1] "As"           "we"           "are"          "tired"       
      ##  [5] "of"           "pointing"     "out"          ","           
      ##  [9] "if"           "we"           "savage"       "the"         
      ## [13] "ability"      "of"           "the"          "majority"    
      ## [17] "of"           "our"          "people"       "to"          
      ## [21] "purchase"     "goods"        "and"          "utilise"     
      ## [25] "services"     ","            "then"         "tens"        
      ## [29] "of"           "thousands"    "of"           "workers"     
      ## [33] "depending"    "on"           "this"         "demand"      
      ## [37] "for"          "their"        "jobs"         "will"        
      ## [41] "be"           "thrown"       "on"           "the"         
      ## [45] "scrapheap"    "of"           "unemployment" "and"         
      ## [49] ","            "tragically"   ","            "that"        
      ## [53] "is"           "what"         "is"           "happening"   
      ## [57] "."           
      ## 
      ## Component 7 :
      ##  [1] "All"        "the"        "key"        "indicators" "in"        
      ##  [6] "the"        "domestic"   "economy"    "show"       "the"       
      ## [11] "abject"     "failure"    "of"         "austerity"  "."         
      ## 
      ## Component 8 :
      ##  [1] "Private"      "investment"   "has"          "collapsed"   
      ##  [5] ","            "VAT"          "receipts"     "are"         
      ##  [9] "more"         "than"         "€"            "400"         
      ## [13] "million"      "behind"       "and"          "unemployment"
      ## [17] "has"          "risen"        "by"           "thousands"   
      ## [21] "since"        "this"         "Government"   "entered"     
      ## [25] "office"       "."           
      ## 
      ## Component 9 :
      ## [1] "Much"    "is"      "made"    "of"      "the"     "growth"  "in"     
      ## [8] "exports" "."      
      ## 
      ## Component 10 :
      ##  [1] "Every"      "job"        "in"         "the"        "export"    
      ##  [6] "sector"     "is"         "vital"      "and"        "we"        
      ## [11] "defend"     "it"         "but"        "the"        "type"      
      ## [16] "of"         "investment" "and"        "the"        "capital"   
      ## [21] "intensive"  "nature"     "of"         "the"        "investment"
      ## [26] "that"       "goes"       "into"       "exports"    "means"     
      ## [31] "that"       "it"         "is"         "not"        "where"     
      ## [36] "the"        "hundreds"   "of"         "thousands"  "of"        
      ## [41] "jobs"       "we"         "need"       "will"       "be"        
      ## [46] "created"    "in"         "the"        "next"       "period"    
      ## [51] "of"         "years"      "."         
      ## 
      ## Component 11 :
      ##  [1] "On"        "the"       "other"     "hand"      ","        
      ##  [6] "taking"    "€"         "670"       "million"   "from"     
      ## [11] "the"       "pockets"   "of"        "ordinary"  "people"   
      ## [16] "through"   "the"       "VAT"       "increases" "and"      
      ## [21] "taking"    "other"     "money"     "in"        "the"      
      ## [26] "cuts"      "in"        "child"     "benefit"   "and"      
      ## [31] "elsewhere" "will"      "further"   "add"       "to"       
      ## [36] "the"       "downward"  "spiral"    "of"        "austerity"
      ## [41] "."        
      ## 
      ## Component 12 :
      ##  [1] "The"          "interest"     "relief"       "for"         
      ##  [5] "householders" "who"          "are"          "trapped"     
      ##  [9] "in"           "the"          "nightmare"    "of"          
      ## [13] "negative"     "equity"       "and"          "extortionate"
      ## [17] "monthly"      "mortgage"     "payments"     "will"        
      ## [21] "be"           "welcomed"     "to"           "a"           
      ## [25] "degree"       "but"          "is"           "dismally"    
      ## [29] "inadequate"   "."           
      ## 
      ## Component 13 :
      ##  [1] "That"        "generation"  "of"          "workers"     "who"        
      ##  [6] "are"         "trapped"     "in"          "this"        "nightmare"  
      ## [11] "are"         "victims"     "of"          "the"         "extortion"  
      ## [16] "perpetrated" "on"          "them"        "in"          "the"        
      ## [21] "housing"     "market"      "by"          "the"         "immoral"    
      ## [26] "speculators" "legislated"  "for"         "by"          "Fianna"     
      ## [31] "Fáil"        "and"         "the"         "PDs"         "."          
      ## 
      ## Component 14 :
      ##  [1] "The"        "huge"       "proportion" "of"         "their"     
      ##  [6] "incomes"    "that"       "continues"  "to"         "go"        
      ## [11] "to"         "the"        "banks"      "massively"  "dislocates"
      ## [16] "the"        "economy"    "."         
      ## 
      ## Component 15 :
      ##  [1] "Otherwise"     "those"         "funds"         "would"        
      ##  [5] "be"            "going"         "to"            "the"          
      ##  [9] "purchase"      "of"            "goods"         "and"          
      ## [13] "services"      "in"            "the"           "domestic"     
      ## [17] "economy"       ","             "stimulating"   "demand"       
      ## [21] "and"           "sustaining"    "tens"          "of"           
      ## [25] "thousands"     "of"            "jobs"          "for"          
      ## [29] "tens"          "of"            "thousands"     "of"           
      ## [33] "people"        "who"           "are"           "now"          
      ## [37] "on"            "the"           "dole"          ","            
      ## [41] "unfortunately" "."
      wordstem(tokenizedSents)
      ## tokenizedText object from 15 documents.
      ## Component 1 :
      ##  [1] "Instead"     "we"          "have"        "a"           "Fine"       
      ##  [6] "Gael"        "-"           "Labour"      "Parti"       "Govern"     
      ## [11] ","           "come"        "into"        "power"       "promis"     
      ## [16] "real"        "chang"       "but"         "slavishli"   "follow"     
      ## [21] "the"         "previou"     "Government’" "polici"      "."          
      ## 
      ## Component 2 :
      ##  [1] "That"     "polici"   "ha"       "been"     "dictat"   "by"      
      ##  [7] "the"      "IMF"      ","        "the"      "EU"       "and"     
      ## [13] "the"      "ECB"      "not"      "to"       "bail"     "out"     
      ## [19] "the"      "Irish"    "peopl"    "but"      "to"       "salvag"  
      ## [25] "German"   ","        "French"   ","        "British"  "bank"    
      ## [31] "and"      "those"    "of"       "other"    "countri"  "from"    
      ## [37] "their"    "disastr"  "and"      "frenzi"   "embrac"   "of"      
      ## [43] "Irish"    "banker"   "and"      "specul"   "in"       "the"     
      ## [49] "Irish"    "properti" "market"   "bubbl"    "."       
      ## 
      ## Component 3 :
      ##  [1] "It"         "i"          "crimin"     "that"       "an"        
      ##  [6] "Irish"      "Govern"     "would"      "ever"       "slavishli" 
      ## [11] "agre"       "to"         "make"       "a"          "vassal"    
      ## [16] "State"      "of"         "the"        "Republ"     "of"        
      ## [21] "Ireland"    "and"        "it"         "peopl"      ","         
      ## [26] "to"         "squeez"     "tribut"     "from"       "our"       
      ## [31] "peopl"      "to"         "save"       "the"        "capitalist"
      ## [36] "bank"       "of"         "Europ"      "and"        "in"        
      ## [41] "do"         "so"         "destroi"    "the"        "live"      
      ## [46] "of"         "hundr"      "of"         "thousand"   "of"        
      ## [51] "our"        "peopl"      "now"        "plung"      "into"      
      ## [56] "unemploy"   ","          "financi"    "hardship"   "and"       
      ## [61] "social"     "disloc"     "."         
      ## 
      ## Component 4 :
      ##  [1] "In"      "thi"     "budget"  "and"     "the"     "past"    "four"   
      ##  [8] "year"    "the"     "auster"  "polici"  "mean"    "€"       "25"     
      ## [15] "billion" "ha"      "been"    "reef"    "out"     "of"      "the"    
      ## [22] "Irish"   "economi" "in"      "pursuit" "of"      "a"       "polici" 
      ## [29] "of"      "cring"   "accept"  "of"      "the"     "diktat"  "of"     
      ## [36] "the"     "financi" "market"  "."      
      ## 
      ## Component 5 :
      ##  [1] "Can"     "thi"     "Govern"  "not"     "see"     "that"    ","      
      ##  [8] "not"     "onli"    "i"       "thi"     "immor"   "and"     "unjust" 
      ## [15] "in"      "the"     "extrem"  ","       "it"      "i"       "decim"  
      ## [22] "the"     "domest"  "economi" "?"      
      ## 
      ## Component 6 :
      ##  [1] "A"         "we"        "ar"        "tire"      "of"       
      ##  [6] "point"     "out"       ","         "if"        "we"       
      ## [11] "savag"     "the"       "abil"      "of"        "the"      
      ## [16] "major"     "of"        "our"       "peopl"     "to"       
      ## [21] "purchas"   "good"      "and"       "utilis"    "servic"   
      ## [26] ","         "then"      "ten"       "of"        "thousand" 
      ## [31] "of"        "worker"    "depend"    "on"        "thi"      
      ## [36] "demand"    "for"       "their"     "job"       "will"     
      ## [41] "be"        "thrown"    "on"        "the"       "scrapheap"
      ## [46] "of"        "unemploy"  "and"       ","         "tragic"   
      ## [51] ","         "that"      "i"         "what"      "i"        
      ## [56] "happen"    "."        
      ## 
      ## Component 7 :
      ##  [1] "All"     "the"     "kei"     "indic"   "in"      "the"     "domest" 
      ##  [8] "economi" "show"    "the"     "abject"  "failur"  "of"      "auster" 
      ## [15] "."      
      ## 
      ## Component 8 :
      ##  [1] "Privat"   "invest"   "ha"       "collaps"  ","        "VAT"     
      ##  [7] "receipt"  "ar"       "more"     "than"     "€"        "400"     
      ## [13] "million"  "behind"   "and"      "unemploy" "ha"       "risen"   
      ## [19] "by"       "thousand" "sinc"     "thi"      "Govern"   "enter"   
      ## [25] "offic"    "."       
      ## 
      ## Component 9 :
      ## [1] "Much"   "i"      "made"   "of"     "the"    "growth" "in"     "export"
      ## [9] "."     
      ## 
      ## Component 10 :
      ##  [1] "Everi"    "job"      "in"       "the"      "export"   "sector"  
      ##  [7] "i"        "vital"    "and"      "we"       "defend"   "it"      
      ## [13] "but"      "the"      "type"     "of"       "invest"   "and"     
      ## [19] "the"      "capit"    "intens"   "natur"    "of"       "the"     
      ## [25] "invest"   "that"     "goe"      "into"     "export"   "mean"    
      ## [31] "that"     "it"       "i"        "not"      "where"    "the"     
      ## [37] "hundr"    "of"       "thousand" "of"       "job"      "we"      
      ## [43] "need"     "will"     "be"       "creat"    "in"       "the"     
      ## [49] "next"     "period"   "of"       "year"     "."       
      ## 
      ## Component 11 :
      ##  [1] "On"       "the"      "other"    "hand"     ","        "take"    
      ##  [7] "€"        "670"      "million"  "from"     "the"      "pocket"  
      ## [13] "of"       "ordinari" "peopl"    "through"  "the"      "VAT"     
      ## [19] "increas"  "and"      "take"     "other"    "monei"    "in"      
      ## [25] "the"      "cut"      "in"       "child"    "benefit"  "and"     
      ## [31] "elsewher" "will"     "further"  "add"      "to"       "the"     
      ## [37] "downward" "spiral"   "of"       "auster"   "."       
      ## 
      ## Component 12 :
      ##  [1] "The"       "interest"  "relief"    "for"       "household"
      ##  [6] "who"       "ar"        "trap"      "in"        "the"      
      ## [11] "nightmar"  "of"        "neg"       "equiti"    "and"      
      ## [16] "extortion" "monthli"   "mortgag"   "payment"   "will"     
      ## [21] "be"        "welcom"    "to"        "a"         "degre"    
      ## [26] "but"       "i"         "dismal"    "inadequ"   "."        
      ## 
      ## Component 13 :
      ##  [1] "That"     "gener"    "of"       "worker"   "who"      "ar"      
      ##  [7] "trap"     "in"       "thi"      "nightmar" "ar"       "victim"  
      ## [13] "of"       "the"      "extort"   "perpetr"  "on"       "them"    
      ## [19] "in"       "the"      "hous"     "market"   "by"       "the"     
      ## [25] "immor"    "specul"   "legisl"   "for"      "by"       "Fianna"  
      ## [31] "Fáil"     "and"      "the"      "PD"       "."       
      ## 
      ## Component 14 :
      ##  [1] "The"     "huge"    "proport" "of"      "their"   "incom"   "that"   
      ##  [8] "continu" "to"      "go"      "to"      "the"     "bank"    "massiv" 
      ## [15] "disloc"  "the"     "economi" "."      
      ## 
      ## Component 15 :
      ##  [1] "Otherwis" "those"    "fund"     "would"    "be"       "go"      
      ##  [7] "to"       "the"      "purchas"  "of"       "good"     "and"     
      ## [13] "servic"   "in"       "the"      "domest"   "economi"  ","       
      ## [19] "stimul"   "demand"   "and"      "sustain"  "ten"      "of"      
      ## [25] "thousand" "of"       "job"      "for"      "ten"      "of"      
      ## [31] "thousand" "of"       "peopl"    "who"      "ar"       "now"     
      ## [37] "on"       "the"      "dole"     ","        "unfortun" "."
    3. Applying pre-processing to the creation of a dfm.

      quanteda’s dfm() function by default applies certain “cleaning” steps to the text, which are not the defaults in tokenize(). Create a dfm from exampleString. What are the differences between the steps applied by dfm() and the default settings for tokenize()?

      Compare the steps required in a similar text preparation package, tm:

      require(tm)
      data(crude)
      crude <- tm_map(crude, content_transformer(tolower))
      crude <- tm_map(crude, removePunctuation)
      crude <- tm_map(crude, removeNumbers)
      crude <- tm_map(crude, stemDocument)
      (dtm <- DocumentTermMatrix(crude))
      
      # same in quanteda
      require(quanteda)
      crudeCorpus <- corpus(crude)
      (crudeDfm <- dfm(crudeCorpus))

      Inspect the dimensions of the resulting objects, including the names of the words extracted as features. It is also worth comparing the structure of the document-feature matrixes returned by each package. tm uses the slam simple triplet matrix format for representing a sparse matrix.

      It is also – in fact almost always – useful to inspect the structure of this object:

      str(dtm)
      str(crudeDfm)

      This indicates that we can extract the names of the words from the tm TermDocumentMatrix object by getting the rownames from inspecting the tdm:

      head(Terms(dtm), 20)

      Compare this to the results of the same operations from quanteda. To get the “words” from a quanteda object, you can use the features() function:

      head(features(crudeDfm), 20)

      Why was is the first 20 terms listed different between the two packages?

      Because the default term ordering for a TermDocumentMatrix is alphabetical, whereas these are in the original order of occurrence in a quanteda dfm.

      What proportion of the crudeDfm are zeros? Compare the sizes of tdm and crudeDfm using the object.size() function.

      sum(crudeDfm == 0) / prod(dim(crudeDfm))
      object.size(dtm)
      object.size(crudeDfm)

      Now we will detach the tm package to rempove it from our namespace.

      search()
      detach("package:tm")
      search()
  2. Key-words-in-context

    1. quanteda provides a keyword-in-context function that is easily usable and configurable to explore texts in a descriptive way. Type ?kwic to view the documentation.

    2. Using the ie2010Corpus object, examine the context for the word “Irish”. What is its predominant usage?

      kwic(ie2010Corpus, "Irish")

      A mixture of references to the Irish people, the Irish Economy, and Anglo Irish Bank.

    3. Use the kwic function to discover the context of the word “clean”. Is this associated with environmental policy?

      kwic(ie2010Corpus, "clean")

      Only one of the four usages is associated with environmental policy, the other usages refer to a metaphorical cleanup.

    4. Examine the context of words related to “disaster”. Hint: you can use the stem of the word along with setting the regex argument to TRUE. Execute a query using a pattern match that returns different variations of words based on “disaster” (such as disasters, disastrous, disastrously, etc.).

      kwic(ie2010Corpus, "disast*")
      kwic(ie2010Corpus, "disast", valuetype = "regex")  # alternative usage
    5. Load the text of Herman Melville’s Moby Dick and assign a KWIC search for “Ahab” to an object named kwicAhab.
      Examine the structure of this object. Now plot it and describe what this plot represents. Note: You might want to resize this so that the height is just small enough to see the lines.

      To access Moby Dick, use the syntax below. (This file is distributed with quanteda but is a bit tricky to access since it is compressed “raw” text data rather than an R formatted object.)

      require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
      mobydicktf <- textfile(unzip(system.file("extdata", "pg2701.txt.zip", package = "quanteda")))
      mobydickCorpus <- corpus(mobydicktf)
      
      kwicAhab <- kwic(mobydickCorpus, "Ahab")
      str(kwicAhab)  # a specially classed data.frame
      ## Classes 'kwic' and 'data.frame': 436 obs. of  5 variables:
      ##  $ docname    : Factor w/ 1 level "text1": 1 1 1 1 1 1 1 1 1 1 ...
      ##  $ position   : num  36712 36719 36734 36862 40814 ...
      ##  $ contextPre : chr  "                 ye clapped eye on Captain" "                         \"\" Who is Captain" "                     I thought so. Captain" "                     . Clap eye on Captain" ...
      ##  $ keyword    : chr  "Ahab" "Ahab" "Ahab" "Ahab" ...
      ##  $ contextPost: chr  "?\"\" Who is                                  " ", sir?\"\"                                    " "is the Captain of this                      " ", young man, and                            " ...
      ##  - attr(*, "valuetype")= chr "glob"
      ##  - attr(*, "ntoken")= Named int 264074
      ##   ..- attr(*, "names")= chr "text1"
      ##  - attr(*, "keywords")= chr "Ahab"
      ##  - attr(*, "tokenize_opts")= list()
      pdf("kwicAhab.pdf", width = 8, height = 2)
      plot(kwicAhab)
      dev.off()
      ## quartz_off_screen 
      ##                 2
      if (Sys.info()["sysname"] == "Darwin") system("open kwicAhab.pdf")
  3. Descriptive statistics

    1. We can extract basic descriptive statistics from a corpus from its document feature matrix. Make a dfm from the 2010 Irish budget speeches corpus.

      ieDfm <- dfm(ie2010Corpus)
      ## Creating a dfm from a corpus ...
      ##    ... lowercasing
      ##    ... tokenizing
      ##    ... indexing documents: 14 documents
      ##    ... indexing features: 4,881 feature types
      ##    ... created a 14 x 4881 sparse dfm
      ##    ... complete. 
      ## Elapsed time: 0.109 seconds.
    2. Examine the most frequent word features using topfeatures(). What are the five most frequent word in the corpus?

      topfeatures(ieDfm, 5)
      ##  the   to   of  and   in 
      ## 3598 1633 1537 1359 1231
    3. Compute and average syllable length for each text in the ie2010Corpus object. To do this, you will use two vectorized functions: ntoken() and syllables(). But be careful, since one of these two functions works fine for a corpus class object, but the other does not. The help pages for each function will tell you which works with what sort of object. (Hint: see also texts().)

      meanSyllableLength <- syllables(texts(ie2010Corpus)) / ntoken(ie2010Corpus)
      meanSyllableLength
      ##       2010_BUDGET_01_Brian_Lenihan_FF 
      ##                              1.472804 
      ##      2010_BUDGET_02_Richard_Bruton_FG 
      ##                              1.322912 
      ##        2010_BUDGET_03_Joan_Burton_LAB 
      ##                              1.352621 
      ##       2010_BUDGET_04_Arthur_Morgan_SF 
      ##                              1.385943 
      ##         2010_BUDGET_05_Brian_Cowen_FF 
      ##                              1.456772 
      ##          2010_BUDGET_06_Enda_Kenny_FG 
      ##                              1.378232 
      ##     2010_BUDGET_07_Kieran_ODonnell_FG 
      ##                              1.300563 
      ##      2010_BUDGET_08_Eamon_Gilmore_LAB 
      ##                              1.373963 
      ##    2010_BUDGET_09_Michael_Higgins_LAB 
      ##                              1.351708 
      ##       2010_BUDGET_10_Ruairi_Quinn_LAB 
      ##                              1.340310 
      ##     2010_BUDGET_11_John_Gormley_Green 
      ##                              1.388031 
      ##       2010_BUDGET_12_Eamon_Ryan_Green 
      ##                              1.326469 
      ##     2010_BUDGET_13_Ciaran_Cuffe_Green 
      ##                              1.386218 
      ## 2010_BUDGET_14_Caoimhghin_OCaolain_SF 
      ##                              1.355398
  4. Lexical diversity and reading difficulty

    1. Compare the post-1960 inaugural speeches grouped by president, in terms of their lexical diversity, using the CTTR measure. Plot these in a dotchart.

      presDfm <- dfm(subset(inaugCorpus, Year > 1960))
      ## Creating a dfm from a corpus ...
      ##    ... lowercasing
      ##    ... tokenizing
      ##    ... indexing documents: 14 documents
      ##    ... indexing features: 3,690 feature types
      ##    ... created a 14 x 3690 sparse dfm
      ##    ... complete. 
      ## Elapsed time: 0.057 seconds.
      (presCTTR <- lexdiv(presDfm, "CTTR"))
      ## 1961-Kennedy 1965-Johnson   1969-Nixon   1973-Nixon  1977-Carter 
      ##    10.204812     9.648534    10.862780     8.447654     9.915697 
      ##  1981-Reagan  1985-Reagan    1989-Bush 1993-Clinton 1997-Clinton 
      ##    12.075486    11.993347    10.992870    10.613237    10.946841 
      ##    2001-Bush    2005-Bush   2009-Obama   2013-Obama 
      ##    10.379045    11.265046    12.916282    11.996813
      dotchart(presCTTR, pch = 16, xlab = "CTTR Lexical Diversity")

    2. Compare the post-1960 inaugural speeches (not grouped) in terms of their readability, on both the Flesch-Kincaid and FOG indexes. Plot these values by year, connecting them using the type = "b" option to the base package’s plot() method. (You are welcome to use ggplot2 if you prefer that graphics package.)

      presRblty <- readability(subset(inaugCorpus, Year > 1960), c("Flesch.Kincaid", "FOG"))
      plot(docvars(subset(inaugCorpus, Year > 1960), "Year"),
           presRblty$Flesch.Kincaid, type = "b", pch = 16, ylim = c(5, 15),
           xlab = "", ylab = "Readability Index")
      points(docvars(subset(inaugCorpus, Year > 1960), "Year"),
           presRblty$FOG, type = "b", pch = 15, ylim = c(6, 16), lty = "dashed")
      text(1968, 14, "FOG")
      text(1973, 8, "Flesch-Kincaid")

    3. Extra credit: What is/are the word(s) with the highest Scrabble value spoken in Obama’s 2013 inaugural speech? (note question originally, and erroneously, asked about his non-existent 2015 inaugural speech!)

      toks <- tokenize(inaugCorpus["2013-Obama"], removePunct = TRUE, simplify = TRUE)
      scrabbleWords <- scrabble(toks)
      names(scrabbleWords) <- toks
      head(sort(scrabbleWords, decreasing = TRUE))
      ## self-executing   inextricably   relinquished   marginalized   overwhelming 
      ##             26             26             25             25             24 
      ##   Philadelphia 
      ##             23
  5. Document and word associations

    1. Load the presidential State of the Union (SOTU) corpus. Select just the speeches since 1980, and and create a dfm from this corpus, after removing quanteda’s list of built-in English stop words, and stemming the terms.

      To load in the SOTU corpus, make sure you have installed the package quantedaData, and use this command:

      data(SOTUCorpus, package = "quantedaData")

      For selection, see ?subset.corpus. In R’s “S3” object-oriented system, the .corpus means that this specific method will dispatch when supplied a corpus class object as its first argument. When calling this method for subset, you do not actually need to type subset.corpus(), just subset(). (But you may also type the full name if you wish.) Many methods in R have multiple definitions for different object classes, and understanding this and how it works will serve you well as you learn more R. quanteda is very object-oriented, and this is why many of the same methods can be applied to both corpus and dfm class objects.

      This also explains why you get certain warning messages when you attach the quanteda package, e.g.

      detach("package:quanteda")
      require(quanteda)
      ## Loading required package: quanteda
      ## 
      ## Attaching package: 'quanteda'
      ## The following object is masked from 'package:stats':
      ## 
      ##     df
      ## The following object is masked from 'package:base':
      ## 
      ##     sample

      Here, the object df() (a function) from the stats package – which is one of the standard packages that is always attached when you start R – has been superceded in priority in R’s “namespace” by another object (also a function) called df() from the quanteda package. Compare the two using ?df, where you should see two versions listed.

      Compute the density for an F distribution for a value of 5, with df1 = 1 and df2 = 4. How do you need to address this function to call it?

      stats::df(5, 1, 4)
      ## [1] 0.02208462

      Now compute the document frequency of the texts in your selected SOTUCorpus-based corpus. How can you address that function?

      head(docfreq(presDfm))       # preferred!
      ##      vice president   johnson        mr   speaker     chief 
      ##         9        13         2         7         5         8
      head(quanteda::df(presDfm))  # safest with df()
      ##      vice president   johnson        mr   speaker     chief 
      ##         9        13         2         7         5         8
      head(df(presDfm))            # works if quanteda attached after stats, which it almost always will be
      ##      vice president   johnson        mr   speaker     chief 
      ##         9        13         2         7         5         8

      Extra credit: Can you describe what the warning messages about the sample object mean?

      **The message about df() means that this new function definition has been placed higher in priority on the search path than the identically named function from the stats package. Calls to df(anydfm) will call the quanteda version rather than the stats version.

      The message for base::sample is similar, for the sample() function from the base package. But in this case, quanteda has simply redefine the sample.default() to call base::sample(), so that calling sample on any object other than a corpus or a dfm (see methods(sample)) will invoke base::sample(). (You would know this by examining the source code in corpus.R and dfm-methods.R.)**

    2. Measure the document similarities using similarity(). Compare the results for the correlation and the cosine methods, with and without first converting the dfm to relative frequencies (also known as “normalization”).

      similarity(presDfm, method = "correlation", margin = "document", n = 5)
      ## similarity Matrix:
      ## $`1961-Kennedy`
      ##   1969-Nixon 1997-Clinton  1981-Reagan  1985-Reagan   2009-Obama 
      ##       0.9218       0.9189       0.9135       0.9135       0.9095 
      ## 
      ## $`1965-Johnson`
      ##  1985-Reagan  1981-Reagan   2009-Obama    1989-Bush 1997-Clinton 
      ##       0.9395       0.9377       0.9337       0.9266       0.9214 
      ## 
      ## $`1969-Nixon`
      ## 1961-Kennedy  1981-Reagan   1973-Nixon  1985-Reagan   2009-Obama 
      ##       0.9218       0.9211       0.9206       0.9100       0.8993 
      ## 
      ## $`1973-Nixon`
      ##   1969-Nixon  1985-Reagan  1981-Reagan 1961-Kennedy  1977-Carter 
      ##       0.9206       0.9045       0.9032       0.8979       0.8867 
      ## 
      ## $`1977-Carter`
      ##  2013-Obama 1985-Reagan  2009-Obama 1981-Reagan   1989-Bush 
      ##      0.9244      0.9221      0.9184      0.9168      0.9154 
      ## 
      ## $`1981-Reagan`
      ##  1985-Reagan   2009-Obama 1965-Johnson    1989-Bush   2013-Obama 
      ##       0.9584       0.9495       0.9377       0.9375       0.9342 
      ## 
      ## $`1985-Reagan`
      ##  1981-Reagan   2009-Obama 1997-Clinton    1989-Bush 1965-Johnson 
      ##       0.9584       0.9526       0.9485       0.9412       0.9395 
      ## 
      ## $`1989-Bush`
      ##  1985-Reagan   2009-Obama  1981-Reagan 1965-Johnson  1977-Carter 
      ##       0.9412       0.9397       0.9375       0.9266       0.9154 
      ## 
      ## $`1993-Clinton`
      ##   2009-Obama  1985-Reagan   2013-Obama  1981-Reagan 1997-Clinton 
      ##       0.9320       0.9306       0.9295       0.9198       0.9131 
      ## 
      ## $`1997-Clinton`
      ##  1985-Reagan   2009-Obama  1981-Reagan   2013-Obama 1965-Johnson 
      ##       0.9485       0.9443       0.9316       0.9260       0.9214 
      ## 
      ## $`2001-Bush`
      ##  2013-Obama 1981-Reagan 1985-Reagan  2009-Obama   1989-Bush 
      ##      0.9170      0.9170      0.9169      0.9167      0.9128 
      ## 
      ## $`2005-Bush`
      ##  1985-Reagan   2009-Obama 1997-Clinton  1981-Reagan 1965-Johnson 
      ##       0.9237       0.9194       0.9186       0.9150       0.9124 
      ## 
      ## $`2009-Obama`
      ##   2013-Obama  1985-Reagan  1981-Reagan 1997-Clinton    1989-Bush 
      ##       0.9596       0.9526       0.9495       0.9443       0.9397 
      ## 
      ## $`2013-Obama`
      ##   2009-Obama  1985-Reagan  1981-Reagan 1993-Clinton 1997-Clinton 
      ##       0.9596       0.9371       0.9342       0.9295       0.9260
      similarity(tf(presDfm, "prop"), method = "correlation", margin = "document", n = 5)
      ## similarity Matrix:
      ## $`1961-Kennedy`
      ##   1969-Nixon 1997-Clinton  1981-Reagan  1985-Reagan   2009-Obama 
      ##       0.9218       0.9189       0.9135       0.9135       0.9095 
      ## 
      ## $`1965-Johnson`
      ##  1985-Reagan  1981-Reagan   2009-Obama    1989-Bush 1997-Clinton 
      ##       0.9395       0.9377       0.9337       0.9266       0.9214 
      ## 
      ## $`1969-Nixon`
      ## 1961-Kennedy  1981-Reagan   1973-Nixon  1985-Reagan   2009-Obama 
      ##       0.9218       0.9211       0.9206       0.9100       0.8993 
      ## 
      ## $`1973-Nixon`
      ##   1969-Nixon  1985-Reagan  1981-Reagan 1961-Kennedy  1977-Carter 
      ##       0.9206       0.9045       0.9032       0.8979       0.8867 
      ## 
      ## $`1977-Carter`
      ##  2013-Obama 1985-Reagan  2009-Obama 1981-Reagan   1989-Bush 
      ##      0.9244      0.9221      0.9184      0.9168      0.9154 
      ## 
      ## $`1981-Reagan`
      ##  1985-Reagan   2009-Obama 1965-Johnson    1989-Bush   2013-Obama 
      ##       0.9584       0.9495       0.9377       0.9375       0.9342 
      ## 
      ## $`1985-Reagan`
      ##  1981-Reagan   2009-Obama 1997-Clinton    1989-Bush 1965-Johnson 
      ##       0.9584       0.9526       0.9485       0.9412       0.9395 
      ## 
      ## $`1989-Bush`
      ##  1985-Reagan   2009-Obama  1981-Reagan 1965-Johnson  1977-Carter 
      ##       0.9412       0.9397       0.9375       0.9266       0.9154 
      ## 
      ## $`1993-Clinton`
      ##   2009-Obama  1985-Reagan   2013-Obama  1981-Reagan 1997-Clinton 
      ##       0.9320       0.9306       0.9295       0.9198       0.9131 
      ## 
      ## $`1997-Clinton`
      ##  1985-Reagan   2009-Obama  1981-Reagan   2013-Obama 1965-Johnson 
      ##       0.9485       0.9443       0.9316       0.9260       0.9214 
      ## 
      ## $`2001-Bush`
      ##  2013-Obama 1981-Reagan 1985-Reagan  2009-Obama   1989-Bush 
      ##      0.9170      0.9170      0.9169      0.9167      0.9128 
      ## 
      ## $`2005-Bush`
      ##  1985-Reagan   2009-Obama 1997-Clinton  1981-Reagan 1965-Johnson 
      ##       0.9237       0.9194       0.9186       0.9150       0.9124 
      ## 
      ## $`2009-Obama`
      ##   2013-Obama  1985-Reagan  1981-Reagan 1997-Clinton    1989-Bush 
      ##       0.9596       0.9526       0.9495       0.9443       0.9397 
      ## 
      ## $`2013-Obama`
      ##   2009-Obama  1985-Reagan  1981-Reagan 1993-Clinton 1997-Clinton 
      ##       0.9596       0.9371       0.9342       0.9295       0.9260
      similarity(presDfm, method = "cosine", margin = "document", n = 5)
      ## similarity Matrix:
      ## $`1961-Kennedy`
      ##   1969-Nixon 1997-Clinton  1981-Reagan  1985-Reagan   2009-Obama 
      ##       0.9235       0.9206       0.9155       0.9155       0.9116 
      ## 
      ## $`1965-Johnson`
      ##  1985-Reagan  1981-Reagan   2009-Obama    1989-Bush 1997-Clinton 
      ##       0.9409       0.9391       0.9352       0.9283       0.9231 
      ## 
      ## $`1969-Nixon`
      ## 1961-Kennedy  1981-Reagan   1973-Nixon  1985-Reagan   2009-Obama 
      ##       0.9235       0.9228       0.9224       0.9120       0.9015 
      ## 
      ## $`1973-Nixon`
      ##   1969-Nixon  1985-Reagan  1981-Reagan 1961-Kennedy  1977-Carter 
      ##       0.9224       0.9068       0.9055       0.9002       0.8894 
      ## 
      ## $`1977-Carter`
      ##  2013-Obama 1985-Reagan  2009-Obama 1981-Reagan   1989-Bush 
      ##      0.9262      0.9240      0.9203      0.9188      0.9174 
      ## 
      ## $`1981-Reagan`
      ##  1985-Reagan   2009-Obama 1965-Johnson    1989-Bush   2013-Obama 
      ##       0.9594       0.9507       0.9391       0.9390       0.9358 
      ## 
      ## $`1985-Reagan`
      ##  1981-Reagan   2009-Obama 1997-Clinton    1989-Bush 1965-Johnson 
      ##       0.9594       0.9537       0.9495       0.9426       0.9409 
      ## 
      ## $`1989-Bush`
      ##  1985-Reagan   2009-Obama  1981-Reagan 1965-Johnson  1977-Carter 
      ##       0.9426       0.9411       0.9390       0.9283       0.9174 
      ## 
      ## $`1993-Clinton`
      ##   2009-Obama  1985-Reagan   2013-Obama  1981-Reagan 1997-Clinton 
      ##       0.9335       0.9322       0.9311       0.9217       0.9149 
      ## 
      ## $`1997-Clinton`
      ##  1985-Reagan   2009-Obama  1981-Reagan   2013-Obama 1965-Johnson 
      ##       0.9495       0.9455       0.9331       0.9276       0.9231 
      ## 
      ## $`2001-Bush`
      ## 1981-Reagan  2013-Obama 1985-Reagan  2009-Obama   1989-Bush 
      ##      0.9190      0.9190      0.9189      0.9186      0.9149 
      ## 
      ## $`2005-Bush`
      ##  1985-Reagan   2009-Obama 1997-Clinton  1981-Reagan 1965-Johnson 
      ##       0.9251       0.9210       0.9202       0.9165       0.9141 
      ## 
      ## $`2009-Obama`
      ##   2013-Obama  1985-Reagan  1981-Reagan 1997-Clinton    1989-Bush 
      ##       0.9605       0.9537       0.9507       0.9455       0.9411 
      ## 
      ## $`2013-Obama`
      ##   2009-Obama  1985-Reagan  1981-Reagan 1993-Clinton 1997-Clinton 
      ##       0.9605       0.9386       0.9358       0.9311       0.9276
      similarity(tf(presDfm, "prop"), method = "cosine", margin = "document", n = 5)
      ## similarity Matrix:
      ## $`1961-Kennedy`
      ##   1969-Nixon 1997-Clinton  1981-Reagan  1985-Reagan   2009-Obama 
      ##       0.9235       0.9206       0.9155       0.9155       0.9116 
      ## 
      ## $`1965-Johnson`
      ##  1985-Reagan  1981-Reagan   2009-Obama    1989-Bush 1997-Clinton 
      ##       0.9409       0.9391       0.9352       0.9283       0.9231 
      ## 
      ## $`1969-Nixon`
      ## 1961-Kennedy  1981-Reagan   1973-Nixon  1985-Reagan   2009-Obama 
      ##       0.9235       0.9228       0.9224       0.9120       0.9015 
      ## 
      ## $`1973-Nixon`
      ##   1969-Nixon  1985-Reagan  1981-Reagan 1961-Kennedy  1977-Carter 
      ##       0.9224       0.9068       0.9055       0.9002       0.8894 
      ## 
      ## $`1977-Carter`
      ##  2013-Obama 1985-Reagan  2009-Obama 1981-Reagan   1989-Bush 
      ##      0.9262      0.9240      0.9203      0.9188      0.9174 
      ## 
      ## $`1981-Reagan`
      ##  1985-Reagan   2009-Obama 1965-Johnson    1989-Bush   2013-Obama 
      ##       0.9594       0.9507       0.9391       0.9390       0.9358 
      ## 
      ## $`1985-Reagan`
      ##  1981-Reagan   2009-Obama 1997-Clinton    1989-Bush 1965-Johnson 
      ##       0.9594       0.9537       0.9495       0.9426       0.9409 
      ## 
      ## $`1989-Bush`
      ##  1985-Reagan   2009-Obama  1981-Reagan 1965-Johnson  1977-Carter 
      ##       0.9426       0.9411       0.9390       0.9283       0.9174 
      ## 
      ## $`1993-Clinton`
      ##   2009-Obama  1985-Reagan   2013-Obama  1981-Reagan 1997-Clinton 
      ##       0.9335       0.9322       0.9311       0.9217       0.9149 
      ## 
      ## $`1997-Clinton`
      ##  1985-Reagan   2009-Obama  1981-Reagan   2013-Obama 1965-Johnson 
      ##       0.9495       0.9455       0.9331       0.9276       0.9231 
      ## 
      ## $`2001-Bush`
      ## 1981-Reagan  2013-Obama 1985-Reagan  2009-Obama   1989-Bush 
      ##      0.9190      0.9190      0.9189      0.9186      0.9149 
      ## 
      ## $`2005-Bush`
      ##  1985-Reagan   2009-Obama 1997-Clinton  1981-Reagan 1965-Johnson 
      ##       0.9251       0.9210       0.9202       0.9165       0.9141 
      ## 
      ## $`2009-Obama`
      ##   2013-Obama  1985-Reagan  1981-Reagan 1997-Clinton    1989-Bush 
      ##       0.9605       0.9537       0.9507       0.9455       0.9411 
      ## 
      ## $`2013-Obama`
      ##   2009-Obama  1985-Reagan  1981-Reagan 1993-Clinton 1997-Clinton 
      ##       0.9605       0.9386       0.9358       0.9311       0.9276
    3. Measure the term similarities for the following words: economy, health, women, using cosine distance.

      similarity(presDfm, c("economy", "health", "women"), method = "cosine", margin = "feature", n = 10)
      ## similarity Matrix:
      ## $economy
      ##   political opportunity        with   strongest        much        fair 
      ##      0.8838      0.8812      0.8634      0.8427      0.8355      0.8341 
      ##      almost   threatens      months   americans 
      ##      0.8341      0.8341      0.8341      0.8227 
      ## 
      ## $health
      ##       wrong       every      ideals      reform generations      common 
      ##      0.9037      0.8502      0.8500      0.8485      0.8367      0.8281 
      ##        long        less     america     schools 
      ##      0.8232      0.8199      0.8104      0.7913 
      ## 
      ## $women
      ## throughout       hard        men        you      shown      bless 
      ##     0.9390     0.9191     0.9099     0.9002     0.8854     0.8731 
      ##        day    fathers   remember prosperous 
      ##     0.8645     0.8552     0.8552     0.8528
    4. Extra credit: Now weight the dfm object using tf-idf weights, and recompute the term similarity matrix. How different are the results? Compute the same similarities for the tf-idf weighted results for a similar dfm but without removing stopwords. How different are these results?

      similarity(tfidf(presDfm), c("economy", "health", "women"), method = "cosine", margin = "feature", n = 10)
      ## similarity Matrix:
      ## $economy
      ##   political opportunity   strongest        much        fair      almost 
      ##      0.8838      0.8812      0.8427      0.8355      0.8341      0.8341 
      ##   threatens      months   americans  government 
      ##      0.8341      0.8341      0.8227      0.8226 
      ## 
      ## $health
      ##       wrong       every      ideals      reform generations      common 
      ##      0.9037      0.8502      0.8500      0.8485      0.8367      0.8281 
      ##        less     schools       force   communism 
      ##      0.8199      0.7913      0.7912      0.7906 
      ## 
      ## $women
      ## throughout       hard        men        you      shown      bless 
      ##     0.9390     0.9191     0.9099     0.9002     0.8854     0.8731 
      ##        day    fathers   remember prosperous 
      ##     0.8645     0.8552     0.8552     0.8528
      similarity(tfidf(presDfm, smoothing = 1), c("economy", "health", "women"), method = "cosine", margin = "feature", n = 10)
      ## similarity Matrix:
      ## $economy
      ##   political opportunity        with   strongest        much      almost 
      ##      0.8838      0.8812      0.8634      0.8427      0.8355      0.8341 
      ##   threatens      months        fair   americans 
      ##      0.8341      0.8341      0.8341      0.8227 
      ## 
      ## $health
      ##       wrong       every      ideals      reform generations      common 
      ##      0.9037      0.8502      0.8500      0.8485      0.8367      0.8281 
      ##        long        less     america     schools 
      ##      0.8232      0.8199      0.8104      0.7913 
      ## 
      ## $women
      ## throughout       hard        men        you      shown      bless 
      ##     0.9390     0.9191     0.9099     0.9002     0.8854     0.8731 
      ##        day    fathers   remember prosperous 
      ##     0.8645     0.8552     0.8552     0.8528

      Almost identical, although all similarities increase with the smoothing. tfidf weighting has virtually no effect here.