Most of this exercise consisted of working through the code provided at your own pace. The answers below are for those parts that asked a direct question of you.

Exercise 1

  1. Preliminaries: Installation

    1. Install the package

    Note that you cannot install from source on the LSE’s lab computers, because they do not have RTools installed.

    1. Test your setup

    Did you get the same output on the word cloud? You might have experienced some issues with the size of the graphics window in RStudio. This has to do with the size of your graphics device – see (this page)[https://support.rstudio.com/hc/en-us/articles/200488548-Problem-with-Plots-or-Graphics-Device] for help.

    To control the output window size in RMarkdown, you can use code like this in your chunk:

    {r, fig.width=8, fig.height=8}
    plot(ieDfm, min.freq=25, random.order=FALSE)
  2. Basic string manipulation functions in R

    1. Counting characters.

    2. Extracting characters.

    3. Splitting texts and using lists.

    require(quanteda, warn.conflicts = FALSE, quietly = TRUE)
    toks <- tokenize("This is a sentence containing some caractères Français.")
    str(toks)
    ## List of 1
    ##  $ : chr [1:9] "This" "is" "a" "sentence" ...
    ##  - attr(*, "class")= chr [1:2] "tokenizedTexts" "list"
    ##  - attr(*, "what")= chr "word"
    ##  - attr(*, "ngrams")= int 1
    ##  - attr(*, "concatenator")= chr ""

    The “structure” of the toks object indicates that it is a specially classed “list” type, with additional attributes added by tokenize, such as "what", “ngrams”, etc. The list has one element, a 9-element character vector consisting of the tokens from the sentence that was tokenized.

    methods(class = "tokenizedTexts")
    ##  [1] collocations   dfm            kwic           ngrams        
    ##  [5] ntoken         ntype          print          selectFeatures
    ##  [9] skipgrams      syllables      toLower        wordstem      
    ## see '?methods' for accessing help and source code

    These are all of the methods (“functions”) defined in quanteda for an object of class tokenizedTexts. When you send the onject to the Global environment, then you are invoking the print.tokenizedTexts method. You can compare the output of this with the default print method:

    getS3method("print", "default")(toks)
    ## [[1]]
    ## [1] "This"       "is"         "a"          "sentence"   "containing"
    ## [6] "some"       "caractères" "Français"   "."         
    ## 
    ## attr(,"class")
    ## [1] "tokenizedTexts" "list"          
    ## attr(,"what")
    ## [1] "word"
    ## attr(,"ngrams")
    ## [1] 1
    ## attr(,"concatenator")
    ## [1] ""
    getS3method("print", "tokenizedTexts")(toks)
    ## tokenizedText object from 1 document.
    ## Component 1 :
    ## [1] "This"       "is"         "a"          "sentence"   "containing"
    ## [6] "some"       "caractères" "Français"   "."
    1. Joining character objects together.

    2. Manipulating case

    sVec <- c("Quanteda is the Best Text Package Ever, approved by NATO!", 
              "Quanteda является лучший текст пакет тех, утвержденной НАТО!")
    tolower(sVec)
    ## [1] "quanteda is the best text package ever, approved by nato!"   
    ## [2] "quanteda является лучший текст пакет тех, утвержденной нато!"
    toLower(sVec)
    ## [1] "quanteda is the best text package ever, approved by nato!"   
    ## [2] "quanteda является лучший текст пакет тех, утвержденной нато!"

    Here, tolower() worked okay on the Russian text, but that may not always be the case for the default tolower() when working on Unicode text. Better to use the smarter toLower().

  3. Counting and comparing objects.

    1. Comparing character objects.

    Extra credit: Try using this with the length() function to figure out how many times new occurs in the tokenized text of the 57th inaugural speech, which you can access as a quanteda built-in object as inaugTexts[57]. Hint: use %in% to return a logical vector, and coerce this to 0s and 1s by using the sum() function on the resulting vector.

    sum(tokenize(inaugTexts[57])[[1]] %in% "new")
    ## [1] 6

    This works because it is the tokenization of the 57th element of the inaugTexts character vector, and then it indexes the first element of the resulting list (the tokenizedTexts class object resulting from the call to tokenize()). The %in% creates a logical vector for each element in the vector from the tokenize return, depending on whether or not it matches “new”. sum() counts up the TRUE items in the vector, and finds that there were six.

    Of course this is also what dfm() is for:

    inaugDfm <- dfm(inaugTexts, toLower = FALSE, verbose = FALSE)
    inaugDfm[57, "new"]
    ## Document-feature matrix of: 1 document, 1 feature.
    ## 1 x 1 sparse Matrix of class "dfmSparse"
    ##             features
    ## docs         new
    ##   2013-Obama   6
    1. Pattern matching
  4. Making a corpus and corpus structure

    1. From a vector of texts already in memory.

      summary(corpus(inaugTexts), 5)
      ## Corpus consisting of 57 documents, showing 5 documents.
      ## 
      ##             Text Types Tokens Sentences
      ##  1789-Washington   626   1540        24
      ##  1793-Washington    96    147         4
      ##       1797-Adams   826   2584        37
      ##   1801-Jefferson   716   1935        42
      ##   1805-Jefferson   804   2381        45
      ## 
      ## Source:  /Users/kbenoit/Dropbox/Classes/Trinity/Text Analysis 2016/Exercises/Exercise 1/* on x86_64 by kbenoit
      ## Created: Thu Mar 10 19:28:35 2016
      ## Notes:
    2. From a directory of text files.

      Example:

      mytf <- textfile("~/Dropbox/QUANTESS/corpora/amicus/all/*.txt")
      mycorpus <- corpus(mytf)
      summary(mycorpus, 5)
      ## Corpus consisting of 102 documents, showing 5 documents.
      ## 
      ##       Text Types Tokens Sentences
      ##  sAP01.txt  1825   7935       256
      ##  sAP02.txt  2131   8629       393
      ##  sAP03.txt  2200  10241       475
      ##  sAP04.txt  1377   6253       232
      ##  sAP05.txt  2227   9345       372
      ## 
      ## Source:  /Users/kbenoit/Dropbox/Classes/Trinity/Text Analysis 2016/Exercises/Exercise 1/* on x86_64 by kbenoit
      ## Created: Thu Mar 10 19:28:38 2016
      ## Notes:
    3. There are many other ways to create a corpus, most using the intermediate function textfile() to read texts into R. Explore these ways by studying ?textfile. Can you reproduce the examples?

      No you probably cannot, since many of them refer to files that are local to my system.