Quantitative Text Analysis (TCD 2016)

February 4th, 2016 Ken Posted in Course-related, Quantitative Methods, Text Analysis No Comments »

Meeting times

5 February, 12 February, 11 March, 18 March.

Course Handout

Short Outline

The course surveys methods for systematically extracting quantitative information from political text for social scientific purposes, starting with classical content analysis and dictionary-based methods, to classification methods, and state-of-the-art scaling methods and topic models for estimating quantities from text using statistical techniques. The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. The common focus across all methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features—such as coded content categories, word counts, word types, dictionary counts, or parts of speech—and converting these into a quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically covers these methods in a logical progression, with a practical, hands-on approach where each technique will be applied using appropriate software to real texts.

Objectives

The course is also designed to cover many fundamental issues in quantitative text analysis such as inter-coder agreement, reliability, validation, accuracy, and precision. It focuses on methods of converting texts into quantitative matrixes of features, and then analysing those features using statistical methods. The course briefly covers the qualitative technique of human coding and an- notation but only for the purposes of creating a validation set for automated approaches. These automated approaches include dictionary construction and application, classification and machine learning, scaling models, and topic models. For each topic, we will systematically cover published applications and examples of these methods, from a variety of disciplinary and applied fields but focusing on political science. Lessons will consist of a mixture of theoretical grounding in content analysis approaches and techniques, with hands on analysis of real texts using content analytic and statistical software.

Detailed  Course Schedule

Session 1: Introduction and Issues in quantitative text analysis

This session will cover fundamentals, including the continuum from traditional (non-computer assisted) content analysis to fully automated quantitative text analysis. We will cover the conceptual foundations of content analysis and quantitative content analysis, discuss the objectives, the approach to knowledge, and the particular view of texts when performing quantitative analysis. We will also discuss issues including where to obtain textual data; formatting and working with text files; indexing and meta-data; units of analysis; and definitions of features and measures commonly extracted from texts, including stemming, and stop-words.

Required Reading:

  • Vignette and instructions at http://github.com/kbenoit/quanteda • Grimmer and Stewart (2013)
  • Manning, Raghavan and Schütze (2008, 117–120)
  • Krippendorff (2013, Chs. 9–10)
  • Dunning (1993)

Recommended Reading:

  • Krippendorff (2013, Ch. 1–2, 5, 7)
  • Wikipedia entry on Character encoding, http://en.wikipedia.org/wiki/Text_encoding
  • Browse the different text file formats at http://www.fileinfo.com/filetypes/text
  • Neuendorf (2002, Chs. 4–7)
  • Krippendorff (2013, Ch. 6)
  • Däubler et al. (2012)

Resources:

Session 2: Quantitative methods for comparing texts

Here we focus on quantitative methods for describing texts, focusing on summary measures that highlight particular characteristics of documents and allowing these to be compared. These meth- ods include characterizing texts through concordances, co-occurrences, and keywords in context; complexity and readability measures; and an in-depth discussion of text types, tokens, and equiv- alencies. We will also discuss weighting strategies for features, such as tf-idf. The emphasis will be on comparing texts, through concordances and keyword identification, dissimilarity measures, association models, and vector-space models.

Required Reading:

  • Krippendorff (2013, Ch. 10)
  • Lowe et al. (2011)
  • Manning, Raghavan and Schütze (2008, Section 6.3)

Recommended Reading:

  • Seale, Ziebland and Charteris-Black (2006)

Resources:

Session 3: Automated dictionary-based approaches

Automatic dictionary-based methods involve association of pre-defined word lists with particular quantitative values assigned by the researcher for some characteristic of interest. This topic covers the design model behind dictionary construction, including guidelines for testing and refining dic- tionaries. Hand-on work will cover commonly used dictionaries such as LIWC, RID, and the Harvard IV-4, with applications. We will also review a variety of text pre-processing issues and textual data concepts such as word types, tokens, and equivalencies, including word stemming and trimming of words based on term and/or document frequency.

Required Reading:

  • Neuendorf (2002, Ch. 6)
  • Laver and Garry (2000)
  • Rooduijn and Pauwels (2011)

Recommended Reading:

  • Pennebaker and Chung (2008)
  • Tausczik and Pennebaker (2010)
  • Loughran and McDonald (2011)

Resources:

Session 4: Machine Learning for Texts

Classification methods permit the automatic classification of texts in a test set following machine learning from a training set. We will introduce machine learning methods for classifying documents, including one of the most popular classifiers, the Naive Bayes model. The topic also introduces validation and reporting methods for classifiers and discusses where these methods are applicable. Building on the Naive Bayes classifier, we introduce the “Wordscores” method of Laver, Benoit and Garry (2003) for scaling latent traits, and show the link between classification and scaling. We also cover applications of penalized regression to score and scale texts.

Required Reading:

  • Manning, Raghavan and Schütze (2008, Ch. 13)
  • Lantz (2013, Ch. 3–4)
  • Evans et al. (2007)
  • Laver, Benoit and Garry (2003)

Recommended Reading:

  • Lantz (2013, Ch. 10)
  • Statsoft,“NaiveBayesClassifierIntroductoryOverview,”http://www.statsoft.com/textbook/naive-bayes-classifier/.
  • An online article by Paul Graham on classifying spam e-mail. http://www.paulgraham. com/spam.html.
  • Bionicspirit.com,9Feb2012,“HowtoBuildaNaiveBayesClassifier,”http://bionicspirit. com/blog/2012/02/09/howto-build-naive-bayes-classifier.html.
  • Yu, Kaufmann and Diermeier (2008)
  • Zumel and Mount (2014, Ch. 5–6)
  • Benoit and Nulty (2013)
  • Martin and Vanberg (2007)
  • Benoit and Laver (2008)
  • Lowe (2008)

Resources:

Final Assignment

Project Guidelines

The final project is a written analysis of approximately 4,000-5,000 words. The project replaces a final exam, and is designed to give you a chance to analyze a set of texts that you have chosen, reflecting (hopefully) your own research interests. This can be, and probably makes the most sense to be, textual data from something you are already studying. Which texts you choose, what question you investigate, and how you analyze the texts is your choice, but you must justify the choice.

Content

Your content should include the following:

  1. A cover sheet including the title and your name.
  2. An abstract page, with an abstract of no more than 200 words.
  3. Introduction.
    An expanded version of your abstract, which introduces the question, states the rationale for trying to answer it, briefly describes your corpus, identifies the methods you apply to the texts, and summarizes the findings.
  4. Motivation.
    Why have you chosen to analyze this topic? Is there a compelling social reason? Does it contribute to scholarship? This can include a “literature review” (but don’t overdo it).
  5. Description of your corpus.
    You are free to choose any corpus you wish, of any size, although you must include a justification for the choice of texts, and acknowledge the source. You will need to document any format conversions or pre-processing steps you have applied to the texts prior to analysis. You should also present some basic summary statistics about the texts, prior to analysis.
  6. Description of your methods.
    What techniques will you apply to analyze the texts? Is there a precedent (in previous scholarly literature) for applying such methods to texts similar to yours? Defend why the application of the methods is appropriate.
  7. Results.
    Apply the methods, present the findings. Be sure to be explicit about any steps taken.
  8. Conclusions. What conclusions on the question can we draw from the results?

Formatting

There is no rigid set of guidelines, but you should use a Chicago manual of style compatible referencing system (parenthetical references rather than footnotes). Tables should be formatted rather than consisting of pasted output from a statistical package. Think of trying to look (roughly) as if you were formatting a journal article.

Deadline

30 April 2016, 5pm, by email to kbenoit@tcd.ie.

AddThis Social Bookmark Button

How to batch convert pdf files to text

July 31st, 2014 Ken Posted in Course-related, Quantitative Methods 8 Comments »

Frequently I am asked: I have a bunch of pdf files, how can I convert them to plain text so that analyze them using quantitative techniques? Here is my recommendation.

  1. Download the xpdf suite of tools for your platform. This includes the part we will use, pdftotext.
    Alternatives are the Apache PDFBox Java pdf library, and the Python-based PDFminer.
  2. [Windows only – Mac and Linux/Unix have this built in to the Terminal or shell already]: You will need a bash shell for your platform. (It is possible to do what I suggest below using the Windows shell, but it’s been so long since I programmed in the Windows DOS/command line script language that I won’t even attempt it now.) The main options seem to be win-bash and Cygwin.
  3. Create a folder called pdfs in your home folder (for this example – of course it can be elsewhere). Copy your pdf files to this  folder.
  4. In a text edtor, create a text file called convertmyfiles.sh with the following contents:
    #!/bin/bash
    FILES=~/pdfs/*.pdf
    for f in $FILES
    do
     echo "Processing $f file..."
     pdftotext -enc UTF-8 $f
    done
    

    (I am not providing a link because if you cannot create a text file and copy this text to it — and crucially edit it slightly for your own needs — then you probably won’t have much luck with these steps anyway.)

  5. Open the bash shell (Terminal.app or win-bash or equivalent) and execute the following:
    cd pdfs
    ./convertmyfiles.sh
    

    Now you will have a set of text files (ending with .txt) converted as a set. These will probably need tidying up, as the conversion tends to include cruft like headers, page numbers, etc. Sometimes the layout will be mangled too, especially if you had multi-column pages. You can experiment the conversion options available from

    pdftotext -h
    

    Note that in the file provided, the extracted text is given a UTF-8 (Unicode) character encoding, which is what you should be using whenever possible.

Example: (from Terminal.app on my Mac)

Last login: Thu Jul 31 11:29:44 on ttys001
KBs-MBP13:~ kbenoit$ cd pdfs
KBs-MBP13:pdfs kbenoit$ pwd
/Users/kbenoit/pdfs
KBs-MBP13:pdfs kbenoit$ rm *txt
KBs-MBP13:pdfs kbenoit$ ls
11centerpartiet2004.pdf
11folkpartiet2004.pdf
11kristdemokraterna2004.pdf
11kristdemokraterna2004_300k.pdf
11miljopartiet_de_grone2004.pdf
13radikale_venste2004_ENGL.pdf
13socialdemokraterne2004.pdf
21Ecolo_programme_2004.pdf
21Mouvement_Reformateur_100_propositions_pour_2_Θlect_Vlaams_en_europe.PDF
21SPA_europeesprogramma2004.pdf
convertmyfiles.sh
KBs-MBP13:pdfs kbenoit$ ./convertmyfiles.sh 
Processing /Users/kbenoit/pdfs/11centerpartiet2004.pdf file...
Processing /Users/kbenoit/pdfs/11folkpartiet2004.pdf file...
Processing /Users/kbenoit/pdfs/11kristdemokraterna2004.pdf file...
Processing /Users/kbenoit/pdfs/11kristdemokraterna2004_300k.pdf file...
Processing /Users/kbenoit/pdfs/11miljopartiet_de_grone2004.pdf file...
Processing /Users/kbenoit/pdfs/13radikale_venste2004_ENGL.pdf file...
Processing /Users/kbenoit/pdfs/13socialdemokraterne2004.pdf file...
Processing /Users/kbenoit/pdfs/21Ecolo_programme_2004.pdf file...
Processing /Users/kbenoit/pdfs/21SPA_europeesprogramma2004.pdf file...
KBs-MBP13:pdfs kbenoit$ ls
11centerpartiet2004.pdf
11centerpartiet2004.txt
11folkpartiet2004.pdf
11folkpartiet2004.txt
11kristdemokraterna2004.pdf
11kristdemokraterna2004.txt
11kristdemokraterna2004_300k.pdf
11kristdemokraterna2004_300k.txt
11miljopartiet_de_grone2004.pdf
11miljopartiet_de_grone2004.txt
13radikale_venste2004_ENGL.pdf
13radikale_venste2004_ENGL.txt
13socialdemokraterne2004.pdf
13socialdemokraterne2004.txt
21Ecolo_programme_2004.pdf
21Ecolo_programme_2004.txt
21Mouvement_Reformateur_100_propositions_pour_2_Θlect_Vlaams_en_europe.PDF
21SPA_europeesprogramma2004.pdf
21SPA_europeesprogramma2004.txt
convertmyfiles.sh
KBs-MBP13:pdfs kbenoit$ 

Update 12 November 2015 for Windows (thanks Thomas)

For Windows, one way to do the is to use Windows PowerShell ISE (Integrated scripting environment) in Programs/Accessories as follows:

cd mypdffolder
$FILES= ls *.pdf
foreach ($f in $FILES) {
    C:\Program` Files\xpdf\bin32\pdftotext -enc UTF-8 $f
}
AddThis Social Bookmark Button