How to batch convert pdf files to text

Frequently I am asked: I have a bunch of pdf files, how can I convert them to plain text so that analyze them using quantitative techniques? Here is my recommendation.

  1. Download the xpdf suite of tools for your platform. This includes the part we will use, pdftotext.
    Alternatives are the Apache PDFBox Java pdf library, and the Python-based PDFminer.
  2. [Windows only – Mac and Linux/Unix have this built in to the Terminal or shell already]: You will need a bash shell for your platform. (It is possible to do what I suggest below using the Windows shell, but it’s been so long since I programmed in the Windows DOS/command line script language that I won’t even attempt it now.) The main options seem to be win-bash and Cygwin.
  3. Create a folder called pdfs in your home folder (for this example – of course it can be elsewhere). Copy your pdf files to this  folder.
  4. In a text edtor, create a text file called convertmyfiles.sh with the following contents:
    #!/bin/bash
    FILES=~/pdfs/*.pdf
    for f in $FILES
    do
     echo "Processing $f file..."
     pdftotext -enc UTF-8 $f
    done
    

    (I am not providing a link because if you cannot create a text file and copy this text to it — and crucially edit it slightly for your own needs — then you probably won’t have much luck with these steps anyway.)

  5. Open the bash shell (Terminal.app or win-bash or equivalent) and execute the following:
    cd pdfs
    ./convertmyfiles.sh
    

    Now you will have a set of text files (ending with .txt) converted as a set. These will probably need tidying up, as the conversion tends to include cruft like headers, page numbers, etc. Sometimes the layout will be mangled too, especially if you had multi-column pages. You can experiment the conversion options available from

    pdftotext -h
    

    Note that in the file provided, the extracted text is given a UTF-8 (Unicode) character encoding, which is what you should be using whenever possible.

Example: (from Terminal.app on my Mac)

Last login: Thu Jul 31 11:29:44 on ttys001
KBs-MBP13:~ kbenoit$ cd pdfs
KBs-MBP13:pdfs kbenoit$ pwd
/Users/kbenoit/pdfs
KBs-MBP13:pdfs kbenoit$ rm *txt
KBs-MBP13:pdfs kbenoit$ ls
11centerpartiet2004.pdf
11folkpartiet2004.pdf
11kristdemokraterna2004.pdf
11kristdemokraterna2004_300k.pdf
11miljopartiet_de_grone2004.pdf
13radikale_venste2004_ENGL.pdf
13socialdemokraterne2004.pdf
21Ecolo_programme_2004.pdf
21Mouvement_Reformateur_100_propositions_pour_2_Θlect_Vlaams_en_europe.PDF
21SPA_europeesprogramma2004.pdf
convertmyfiles.sh
KBs-MBP13:pdfs kbenoit$ ./convertmyfiles.sh 
Processing /Users/kbenoit/pdfs/11centerpartiet2004.pdf file...
Processing /Users/kbenoit/pdfs/11folkpartiet2004.pdf file...
Processing /Users/kbenoit/pdfs/11kristdemokraterna2004.pdf file...
Processing /Users/kbenoit/pdfs/11kristdemokraterna2004_300k.pdf file...
Processing /Users/kbenoit/pdfs/11miljopartiet_de_grone2004.pdf file...
Processing /Users/kbenoit/pdfs/13radikale_venste2004_ENGL.pdf file...
Processing /Users/kbenoit/pdfs/13socialdemokraterne2004.pdf file...
Processing /Users/kbenoit/pdfs/21Ecolo_programme_2004.pdf file...
Processing /Users/kbenoit/pdfs/21SPA_europeesprogramma2004.pdf file...
KBs-MBP13:pdfs kbenoit$ ls
11centerpartiet2004.pdf
11centerpartiet2004.txt
11folkpartiet2004.pdf
11folkpartiet2004.txt
11kristdemokraterna2004.pdf
11kristdemokraterna2004.txt
11kristdemokraterna2004_300k.pdf
11kristdemokraterna2004_300k.txt
11miljopartiet_de_grone2004.pdf
11miljopartiet_de_grone2004.txt
13radikale_venste2004_ENGL.pdf
13radikale_venste2004_ENGL.txt
13socialdemokraterne2004.pdf
13socialdemokraterne2004.txt
21Ecolo_programme_2004.pdf
21Ecolo_programme_2004.txt
21Mouvement_Reformateur_100_propositions_pour_2_Θlect_Vlaams_en_europe.PDF
21SPA_europeesprogramma2004.pdf
21SPA_europeesprogramma2004.txt
convertmyfiles.sh
KBs-MBP13:pdfs kbenoit$ 

Update 12 November 2015 for Windows (thanks Thomas)

For Windows, one way to do the is to use Windows PowerShell ISE (Integrated scripting environment) in Programs/Accessories as follows:

cd mypdffolder
$FILES= ls *.pdf
foreach ($f in $FILES) {
    C:\Program` Files\xpdf\bin32\pdftotext -enc UTF-8 $f
}

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button

8 Responses to “How to batch convert pdf files to text”

  1. Hello! If you are interested in a free and simple toolkit to convert your pdf files to text format, you might try kitpdf.com and see the results. Maybe it’s useful and it will work smoothly for you.

  2. Brian Lloyd Says:

    Thanks – your directions & script worked beautifully! Just what I needed – many thanks –

  3. Hello,
    Many thanks for this. When you say “You can experiment the conversion options available from ‘pdftotext -h'”, does that there’s an option to discard headers and footers during the conversion? I have hundreds of PDFs to convert and can’t manually clean the headers.
    Cheers,
    Viola

  4. No option in the version I have (pdftotext 3.03) but you could do the batch conversion and then use regular expression substitution through sed or awk to search for and replace the headers and footers. I suggest posting a reproducible example on http://www.stackoverflow.com if you need help with this, e.g. this answer to remove stop words except your “stop words” will be the footer and header patterns.

  5. If I work in R, is there a reason for still doing it the above way? I saw that e.g. the tm-package also allows to do it from R (although it also requires additional non-R software like the one you mention)?

    If I can choose between obtaining my documents in Word97-03 .doc format and pdfs, are pdfs easier to work with?

    Thanks!

  6. I’d do it batch style with xpdf, because this allows you to set the output encoding flag. You can also try it with a single file and see how well it works. e.g.

    pdftotext -enc UTF-8 00005802.pdf
    

    will create a UTF-8 encoded text file. You may not be able to work well in Windows R with UTF-8 (since I don’t think it lets you set this as a locale) but you can either convert the files to Latin2 (aka ISO-8859-2) which is an 8-bit encoding, or you can use iconv to convert them after they’ve been UTF-8 converted.

  7. […] files into a single plain text file. To do so, I got pdftotext, and used Ken Benoit’s instructions to convert the folder of pdfs to a plain text file. Following his instructions, I saved the following script as […]

  8. I came across an issue when iterating over files stored in $FILES variable. It fails if the path contains whitespace or other non-standard characters, e.g. ~/pdfs/party\ manifestos/*.pdf

    I think a more robust approach would be to use the output of find:
    FOLDER=~/pdfs/party\ manifestos/
    find “$FOLDER” -name ‘*.pdf’ | while read i;
    do
    echo “Processing $i”
    pdftotext -enc UTF-8 “$i”
    done

    I wrote a short gist with more details:
    https://gist.github.com/tpaskhalis/214c3976ac08cb809d846e01135d9f5f

Leave a Reply