Instructions

Work your way through the examples, studying each to understand what it is doing. Where questions are asked, include your answer when you write this up.

Ways to prepare your answer:

Naming your file: Please use the following convention:

Exercise1_Lastname_FirstName.pdf (or whatever extension is appropriate)

Submitting your answers: Can be done by email to kbenoit@tcd.ie.

Exercise 1

  1. Preliminaries: Installation

    1. Install the package

    First, you need to have quanteda installed. You can do this from inside RStudio, from the Tools…Install Packages menu, or simply using

    install.packages("quanteda")

    (Optional) You can install some additional corpus data from quantedaData using

    ## the devtools package is required to install quanteda from Github
    devtools::install_github("kbenoit/quantedaData")

    Note that on Windows platforms, it is also (highly) recommended that you install the RTools suite, and for OS X, that you install XCode from the App Store.

    1. Test your setup

    Before you can execute the quanteda commands in this file, you will need to attach its functions using a require() or library() call.

    require(quanteda)
    ## Loading required package: quanteda
    ## 
    ## Attaching package: 'quanteda'
    ## 
    ## The following object is masked from 'package:stats':
    ## 
    ##     df
    ## 
    ## The following object is masked from 'package:base':
    ## 
    ##     sample

    Now summarize some texts in the Irish 2010 budget speech corpus:

    summary(ie2010Corpus)
    ## Corpus consisting of 14 documents.
    ## 
    ##                                   Text Types Tokens Sentences year debate
    ##        2010_BUDGET_01_Brian_Lenihan_FF  1754   7916       404 2010 BUDGET
    ##       2010_BUDGET_02_Richard_Bruton_FG   995   4086       217 2010 BUDGET
    ##         2010_BUDGET_03_Joan_Burton_LAB  1521   5790       309 2010 BUDGET
    ##        2010_BUDGET_04_Arthur_Morgan_SF  1499   6510       345 2010 BUDGET
    ##          2010_BUDGET_05_Brian_Cowen_FF  1544   5964       252 2010 BUDGET
    ##           2010_BUDGET_06_Enda_Kenny_FG  1087   3896       155 2010 BUDGET
    ##      2010_BUDGET_07_Kieran_ODonnell_FG   638   2086       133 2010 BUDGET
    ##       2010_BUDGET_08_Eamon_Gilmore_LAB  1123   3807       202 2010 BUDGET
    ##     2010_BUDGET_09_Michael_Higgins_LAB   457   1149        44 2010 BUDGET
    ##        2010_BUDGET_10_Ruairi_Quinn_LAB   415   1181        60 2010 BUDGET
    ##      2010_BUDGET_11_John_Gormley_Green   381    939        50 2010 BUDGET
    ##        2010_BUDGET_12_Eamon_Ryan_Green   486   1519        90 2010 BUDGET
    ##      2010_BUDGET_13_Ciaran_Cuffe_Green   426   1144        45 2010 BUDGET
    ##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1110   3699       177 2010 BUDGET
    ##  number      foren     name party
    ##      01      Brian  Lenihan    FF
    ##      02    Richard   Bruton    FG
    ##      03       Joan   Burton   LAB
    ##      04     Arthur   Morgan    SF
    ##      05      Brian    Cowen    FF
    ##      06       Enda    Kenny    FG
    ##      07     Kieran ODonnell    FG
    ##      08      Eamon  Gilmore   LAB
    ##      09    Michael  Higgins   LAB
    ##      10     Ruairi    Quinn   LAB
    ##      11       John  Gormley Green
    ##      12      Eamon     Ryan Green
    ##      13     Ciaran    Cuffe Green
    ##      14 Caoimhghin OCaolain    SF
    ## 
    ## Source:  /home/paul/Dropbox/code/quantedaData/* on x86_64 by paul
    ## Created: Tue Sep 16 15:58:21 2014
    ## Notes:

    Create a document-feature matrix from this corpus, removing stop words:

    ieDfm <- dfm(ie2010Corpus, ignoredFeatures = c(stopwords("english"), "will"), stem = TRUE)
    ## Creating a dfm from a corpus ...
    ##    ... lowercasing
    ##    ... tokenizing
    ##    ... indexing documents: 14 documents
    ##    ... indexing features: 4,881 feature types
    ##    ... removed 118 features, from 175 supplied (glob) feature types
    ##    ... stemming features (English), trimmed 1510 feature variants
    ##    ... created a 14 x 3253 sparse dfm
    ##    ... complete. 
    ## Elapsed time: 0.102 seconds.

    Look at the top occuring features:

    topfeatures(ieDfm)
    ##  budget   peopl  govern    year  minist     tax  public economi     cut 
    ##     271     266     242     198     197     195     179     172     172 
    ##     job 
    ##     148

    Make a word cloud:

    plot(ieDfm, min.freq=25, random.order=FALSE)