pdfminer: mine a PDF file

Get a PDF file

Download a PDF file such as https://www.irs.gov/pub/irs-pdf/f1040.pdf from your browser or with

curl https://www.irs.gov/pub/irs-pdf/f1040.pdf > f1040.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146k  100  146k    0     0   961k      0 --:--:-- --:--:-- --:--:--  963k

ls -l f1040.pdf
-rw-r--r--  1 myname  mygroup  149958 May 26 08:02 f1040.pdf

file f1040.pdf
f1040.pdf: PDF document, version 1.7

Install pdfminer and the command line programs

Install pdfminer for Python 2, pdfminer.six for Python3.

pip3 install pdfminer.six
pip3 list
pip3 show pdfminer.six

which pdf2txt.py
/Library/Frameworks/Python.framework/Versions/3.7/bin/pdf2txt.py

pdf2txt.py --help
dumppdf.py

Run pdf2txt.py from the command line

Translate PDF to text

In the directory that holds your downloaded f1040.pdf, create a new file f1040.txt and examine it with TextEdit.app.

pdf2txt.py -o f1040.txt f1040.pdf

ls -l f1040.txt
-rw-r--r--  1 myname  mygroup  5362 May 26 08:05 f1040.txt

file f1040.txt
f1040.txt: UTF-8 Unicode text

Translate PDF to HTML or to XML

In the directory that holds your downloaded f1040.pdf, create a new file f1040.html and examine it with your browser. In my Chrome browser, I pulled down
File → Open File…

pdf2txt.py -o f1040.html f1040.pdf

ls -l f1040.html
-rw-r--r--  1 myname  mygroup  121924 May 26 08:21 f1040.html

file f1040.html
f1040.html: HTML document text, UTF-8 Unicode text, with very long lines

You can also say -o f1040.xml instead of -o f1040.html.

f1040.xml: XML 1.0 document text, UTF-8 Unicode text, with very long lines

Python code to Convert PDF to text

Without the laparams, each page was one big line of text.

"Convert a PDF file to text and print it."

import sys
import io

import pdfminer.pdfinterp
import pdfminer.converter
import pdfminer.pdfpage

try:
    pdfFile = open("f1040.pdf", "rb")   #read binary
except:
    print(sys.exc_info())
    sys.exit(1)

resourceManager = pdfminer.pdfinterp.PDFResourceManager()
stringFile = io.StringIO()
layoutParameters = pdfminer.layout.LAParams(line_margin = 0.1)
textConverter = pdfminer.converter.TextConverter(resourceManager, stringFile, laparams = layoutParameters)
pageInterpreter = pdfminer.pdfinterp.PDFPageInterpreter(resourceManager, textConverter)

try:
    pages = pdfminer.pdfpage.PDFPage.get_pages(pdfFile, caching = True, check_extractable = True)

    for page in pages:
        pageInterpreter.process_page(page)

    oneBigString = stringFile.getvalue()
except:
    print(sys.exc_info())
    sys.exit(1)
finally:
    pdfFile.close()
    textConverter.close()
    stringFile.close()

if len(oneBigString) == 0:
    sys.exit(1)

print(oneBigString)
#Or print the text line by line:
#for i, line in enumerate(oneBigString.splitlines(), start = 1):
#    print(i, line)

sys.exit(0)
m1040 Department of the Treasury—Internal Revenue Service

U.S. Individual Income Tax Return  2018 OMB No. 1545-0074
etc.