How to batch convert pdf files to text

Frequently I am asked: I have a bunch of pdf files, how can I convert them to plain text so that analyze them using quantitative techniques? Here is my recommendation.

  1. Download the xpdf suite of tools for your platform. This includes the part we will use, pdftotext.
    Alternatives are the Apache PDFBox Java pdf library, and the Python-based PDFminer.
  2. [Windows only – Mac and Linux/Unix have this built in to the Terminal or shell already]: You will need a bash shell for your platform. (It is possible to do what I suggest below using the Windows shell, but it’s been so long since I programmed in the Windows DOS/command line script language that I won’t even attempt it now.) The main options seem to be win-bash and Cygwin.
  3. Create a folder called pdfs in your home folder (for this example – of course it can be elsewhere). Copy your pdf files to this  folder.
  4. In a text edtor, create a text file called with the following contents:
    for f in $FILES
     echo "Processing $f file..."
     pdftotext -enc UTF-8 $f

    (I am not providing a link because if you cannot create a text file and copy this text to it — and crucially edit it slightly for your own needs — then you probably won’t have much luck with these steps anyway.)

  5. Open the bash shell ( or win-bash or equivalent) and execute the following:
    cd pdfs

    Now you will have a set of text files (ending with .txt) converted as a set. These will probably need tidying up, as the conversion tends to include cruft like headers, page numbers, etc. Sometimes the layout will be mangled too, especially if you had multi-column pages. You can experiment the conversion options available from

    pdftotext -h

    Note that in the file provided, the extracted text is given a UTF-8 (Unicode) character encoding, which is what you should be using whenever possible.

Example: (from on my Mac)

Last login: Thu Jul 31 11:29:44 on ttys001
KBs-MBP13:~ kbenoit$ cd pdfs
KBs-MBP13:pdfs kbenoit$ pwd
KBs-MBP13:pdfs kbenoit$ rm *txt
KBs-MBP13:pdfs kbenoit$ ls
KBs-MBP13:pdfs kbenoit$ ./ 
Processing /Users/kbenoit/pdfs/11centerpartiet2004.pdf file...
Processing /Users/kbenoit/pdfs/11folkpartiet2004.pdf file...
Processing /Users/kbenoit/pdfs/11kristdemokraterna2004.pdf file...
Processing /Users/kbenoit/pdfs/11kristdemokraterna2004_300k.pdf file...
Processing /Users/kbenoit/pdfs/11miljopartiet_de_grone2004.pdf file...
Processing /Users/kbenoit/pdfs/13radikale_venste2004_ENGL.pdf file...
Processing /Users/kbenoit/pdfs/13socialdemokraterne2004.pdf file...
Processing /Users/kbenoit/pdfs/21Ecolo_programme_2004.pdf file...
Processing /Users/kbenoit/pdfs/21SPA_europeesprogramma2004.pdf file...
KBs-MBP13:pdfs kbenoit$ ls
KBs-MBP13:pdfs kbenoit$ 

Update 12 November 2015 for Windows (thanks Thomas)

For Windows, one way to do the is to use Windows PowerShell ISE (Integrated scripting environment) in Programs/Accessories as follows:

cd mypdffolder
$FILES= ls *.pdf
foreach ($f in $FILES) {
    C:\Program` Files\xpdf\bin32\pdftotext -enc UTF-8 $f

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button
  • Christian

    Hello! If you are interested in a free and simple toolkit to convert your pdf files to text format, you might try and see the results. Maybe it’s useful and it will work smoothly for you.

  • Brian Lloyd

    Thanks – your directions & script worked beautifully! Just what I needed – many thanks –

  • Viola

    Many thanks for this. When you say “You can experiment the conversion options available from ‘pdftotext -h'”, does that there’s an option to discard headers and footers during the conversion? I have hundreds of PDFs to convert and can’t manually clean the headers.

  • Ken

    No option in the version I have (pdftotext 3.03) but you could do the batch conversion and then use regular expression substitution through sed or awk to search for and replace the headers and footers. I suggest posting a reproducible example on if you need help with this, e.g. this answer to remove stop words except your “stop words” will be the footer and header patterns.

  • Thomas

    If I work in R, is there a reason for still doing it the above way? I saw that e.g. the tm-package also allows to do it from R (although it also requires additional non-R software like the one you mention)?

    If I can choose between obtaining my documents in Word97-03 .doc format and pdfs, are pdfs easier to work with?


  • Ken

    I’d do it batch style with xpdf, because this allows you to set the output encoding flag. You can also try it with a single file and see how well it works. e.g.

    pdftotext -enc UTF-8 00005802.pdf

    will create a UTF-8 encoded text file. You may not be able to work well in Windows R with UTF-8 (since I don’t think it lets you set this as a locale) but you can either convert the files to Latin2 (aka ISO-8859-2) which is an 8-bit encoding, or you can use iconv to convert them after they’ve been UTF-8 converted.

  • Pingback: Bashing my head against the command line | Anne Donlon()

  • Tom

    I came across an issue when iterating over files stored in $FILES variable. It fails if the path contains whitespace or other non-standard characters, e.g. ~/pdfs/party\ manifestos/*.pdf

    I think a more robust approach would be to use the output of find:
    FOLDER=~/pdfs/party\ manifestos/
    find “$FOLDER” -name ‘*.pdf’ | while read i;
    echo “Processing $i”
    pdftotext -enc UTF-8 “$i”

    I wrote a short gist with more details: