File: composable.rst

package info (click to toggle)
pdfminer 20200726-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 15,716 kB
  • sloc: python: 13,377; xml: 423; makefile: 95; sh: 3
file content (33 lines) | stat: -rw-r--r-- 1,183 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
.. _tutorial_composable:

Extract text from a PDF using Python - part 2
*********************************************

The command line tools and the high-level API are just shortcuts for often
used combinations of pdfminer.six components. You can use these components to
modify pdfminer.six to your own needs.

For example, to extract the text from a PDF file and save it in a python
variable::

    from io import StringIO

    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser

    output_string = StringIO()
    with open('samples/simple1.pdf', 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    print(output_string.getvalue())