Download a PDF file such as
https://www.irs.gov/pub/irs-pdf/f1040.pdf
from your browser or with
curl https://www.irs.gov/pub/irs-pdf/f1040.pdf > f1040.pdf % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 146k 100 146k 0 0 961k 0 --:--:-- --:--:-- --:--:-- 963k ls -l f1040.pdf -rw-r--r-- 1 myname mygroup 149958 May 26 08:02 f1040.pdf file f1040.pdf f1040.pdf: PDF document, version 1.7
Install
pdfminer
for Python 2,
pdfminer.six
for Python3.
pip3 install pdfminer.six pip3 list pip3 show pdfminer.six which pdf2txt.py /Library/Frameworks/Python.framework/Versions/3.7/bin/pdf2txt.py pdf2txt.py --help dumppdf.py
In the directory that holds your downloaded
f1040.pdf
,
create a new file
f1040.txt
and examine it with TextEdit.app.
pdf2txt.py -o f1040.txt f1040.pdf ls -l f1040.txt -rw-r--r-- 1 myname mygroup 5362 May 26 08:05 f1040.txt file f1040.txt f1040.txt: UTF-8 Unicode text
In the directory that holds your downloaded
f1040.pdf
,
create a new file
f1040.html
and examine it with your browser.
In my Chrome browser, I pulled down
File → Open File…
pdf2txt.py -o f1040.html f1040.pdf ls -l f1040.html -rw-r--r-- 1 myname mygroup 121924 May 26 08:21 f1040.html file f1040.html f1040.html: HTML document text, UTF-8 Unicode text, with very long lines
You can also say
-o f1040.xml
instead of
-o f1040.html
.
f1040.xml: XML 1.0 document text, UTF-8 Unicode text, with very long lines
Without the
laparams
,
each page was one big line of text.
"Convert a PDF file to text and print it." import sys import io import pdfminer.pdfinterp import pdfminer.converter import pdfminer.pdfpage try: pdfFile = open("f1040.pdf", "rb") #read binary except: print(sys.exc_info()) sys.exit(1) resourceManager = pdfminer.pdfinterp.PDFResourceManager() stringFile = io.StringIO() layoutParameters = pdfminer.layout.LAParams(line_margin = 0.1) textConverter = pdfminer.converter.TextConverter(resourceManager, stringFile, laparams = layoutParameters) pageInterpreter = pdfminer.pdfinterp.PDFPageInterpreter(resourceManager, textConverter) try: pages = pdfminer.pdfpage.PDFPage.get_pages(pdfFile, caching = True, check_extractable = True) for page in pages: pageInterpreter.process_page(page) oneBigString = stringFile.getvalue() except: print(sys.exc_info()) sys.exit(1) finally: pdfFile.close() textConverter.close() stringFile.close() if len(oneBigString) == 0: sys.exit(1) print(oneBigString) #Or print the text line by line: #for i, line in enumerate(oneBigString.splitlines(), start = 1): # print(i, line) sys.exit(0)
m1040 Department of the Treasury—Internal Revenue Service U.S. Individual Income Tax Return 2018 OMB No. 1545-0074 etc.