1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395
|
# Extract Text from a PDF
You can extract text from a PDF:
```{testsetup}
pypdf_test_setup("user/extract-text", {
"test Orient.pdf": "../resources/test Orient.pdf",
"GeoBase_NHNC1_Data_Model_UML_EN.pdf": "../resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf",
})
```
```{testcode}
from pypdf import PdfReader
reader = PdfReader("test Orient.pdf")
page = reader.pages[0]
print(page.extract_text())
# extract only text oriented up
print(page.extract_text(0))
# extract text oriented up and turned left
print(page.extract_text((0, 90)))
# extract text in a fixed width format that closely adheres to the rendered
# layout in the source pdf
print(page.extract_text(extraction_mode="layout"))
# extract text preserving horizontal positioning without excess vertical
# whitespace (removes blank and "whitespace only" lines)
print(page.extract_text(extraction_mode="layout", layout_mode_space_vertically=False))
# adjust horizontal spacing
print(page.extract_text(extraction_mode="layout", layout_mode_scale_weight=1.0))
# exclude (default) or include (as shown below) text rotated w.r.t. the page
print(page.extract_text(extraction_mode="layout", layout_mode_strip_rotated=False))
```
```{testoutput}
:options: +NORMALIZE_WHITESPACE
:hide:
(T) This is box text at top
written down from top
(B) This is box text at bottom written up from bottom
(L) This is box text on left written vertically to starboard
(R) This is box text on righy written vertically to port
(T) This is box text at top
written down from top
(T) This is box text at top
written down from top
(L) This is box text on left written vertically to starboard
(B)
This is box text at bottom
from bottom upwritten
(T) This is box text at top
written down from top
(B)
This is box text at bottom
from bottom upwritten
(T) This is box text at top
written down from top
(B)
This is box text at bottom
from bottom upwritten
(T) This is box text at top
written down from top
(B)
This is box text at bottom
from bottom upwritten
(L) This is box textwritten vertically to starboard
on righy
on left
) This is box text
written vertically to port (R
(T) This is box text at top
written down from top
```
Refer to {func}`~pypdf._page.PageObject.extract_text` for more details.
```{note}
Extracting the text of a page requires parsing its whole content stream. This can require quite a lot of memory -
we have seen 10 GB RAM being required for an uncompressed content stream of about 300 MB (which should not occur
very often).
To limit the size of the content streams to process (and avoid OOM errors in your application), consider
checking `len(page.get_contents().get_data())` beforehand.
```
```{note}
If a PDF page appears to contain only an image (e.g., a scanned document), the extracted text may be minimal or visually empty.
In such cases, consider using OCR software such as [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) to extract text from images.
```
## Using a visitor
You can use visitor functions to control which part of a page you want to process and extract. The visitor functions
you provide will get called for each operator or for each text fragment.
The function provided in argument visitor_text of function extract_text has five arguments:
* text: the current text (as long as possible, can be up to a full line)
* user_matrix: current matrix to move from user coordinate space (also known as CTM)
* tm_matrix: current matrix from text coordinate space
* font_dictionary: full font dictionary
* font_size: the size (in text coordinate space)
The matrix stores six parameters. The first four provide the rotation/scaling matrix, and the last two provide the translation (horizontal/vertical).
It is recommended to use the user_matrix as it takes into account all transformations.
Notes :
- As indicated in §8.3.3 of the PDF 1.7 or PDF 2.0 specification, the user matrix applies to text space/image space/form space/pattern space.
- If you want to get the full transformation from text to user space, you can use the {func}`~.pypdf.mult` function as follows:
`txt2user = mult(tm, cm)`.
The font size is the raw text size and affected by the `user_matrix`.
The `font_dictionary` may be `None` in case of unknown fonts.
If not `None`, it could contain something like the key `"/BaseFont"` with the value `"/Arial,Bold"`.
**Caveat**: In complicated documents, the calculated positions may be difficult to determine (if you move from multiple forms to page user space, for example).
The function provided in argument visitor_operand_before has four arguments:
operator, operand-arguments, current transformation matrix, and text matrix.
### Example 1: Ignore header and footer
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y > 720) and footer (y < 50). In this file we also need to include new line characters (y == 0).
```{testcode}
from pypdf import PdfReader
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]
parts = []
def visitor_body(text, cm, tm, font_dict, font_size):
y = tm[5]
if 50 < y < 720 or y == 0:
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body)
```
```{testoutput}
:options: +NORMALIZE_WHITESPACE
:hide:
TABLE OF CONTENTS
1 OVERVIEW ............................................................................................................................................ 6
2 LRS ........................................................................................................................................................ 6
2.1 LRS MODEL ...................................................................................................................................... 7
3 MODEL .................................................................................................................................................. 8
3.1 LRS MODEL ...................................................................................................................................... 9
3.1.1 Logical view ............................................................................................................................... 9
3.1.2 Hydro network.......................................................................................................................... 10
3.1.3 Hydro events............................................................................................................................ 11
3.1.4 Hydrographic ........................................................................................................................... 14
3.1.5 Toponymy (external package) ................................................................................................. 18
3.1.6 Metadata .................................................................................................................................. 19
```
### Example 2: Extract rectangles and texts into an SVG file
The following example converts page three of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into
an [SVG file](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics).
Such an SVG export may help to understand what is going on in a page.
% We prefer not to execute doc examples for unmaintained third-party package "svgwrite"
```{testcode}
:skipif: True
from pypdf import PdfReader
import svgwrite
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[2]
dwg = svgwrite.Drawing("GeoBase_test.svg", profile="tiny")
def visitor_svg_rect(op, args, cm, tm):
if op == b"re":
(x, y, w, h) = (args[i].as_numeric() for i in range(4))
dwg.add(dwg.rect((x, y), (w, h), stroke="red", fill_opacity=0.05))
def visitor_svg_text(text, cm, tm, font_dict, font_size):
(x, y) = (cm[4], cm[5])
dwg.add(dwg.text(text, insert=(x, y), fill="blue"))
page.extract_text(
visitor_operand_before=visitor_svg_rect, visitor_text=visitor_svg_text
)
dwg.save()
```
The SVG generated here is bottom-up because the coordinate systems of PDF and SVG differ.
Unfortunately, in complicated PDF documents the coordinates given to the visitor functions may be wrong.
## Why Text Extraction is hard
### Unclear Objective
Extracting text from a PDF can be tricky. In several cases, there is no
clear answer to what the expected result should look like:
1. **Paragraphs**: Should the text of a paragraph have line breaks at the same places
where the original PDF had them or should it rather be one block of text?
2. **Page numbers**: Should they be included in the extract?
3. **Headers and Footers**: Similar to page numbers - should they be extracted?
4. **Outlines**: Should outlines be extracted at all?
5. **Formatting**: If the text is **bold** or *italic*, should it be included in the
output?
6. **Tables**: Should the text extraction skip tables? Should it extract just the
text? Should the borders be shown in some Markdown-like way or should the
structure be present e.g. as an HTML table? How would you deal with merged
cells?
7. **Captions**: Should image and table captions be included?
8. **Ligatures**: The Unicode symbol [U+FB00](https://www.compart.com/de/unicode/U+FB00)
is a single symbol ff for two lowercase letters 'f'. Should that be parsed as
the Unicode symbol 'ff' or as two ASCII symbols 'ff'?
9. **SVG images**: Should the text parts be extracted?
10. **Mathematical Formulas**: Should they be extracted? Formulas have indices
and nested fractions.
11. **Whitespace characters**: How many new lines should be extracted for 3 cm of
vertical whitespace? How many spaces should be extracted if there is 3 cm of
horizontal whitespace? When would you extract tabs and when spaces?
12. **Footnotes**: When the text of multiple pages is extracted, where should footnotes be shown?
13. **Hyperlinks and Metadata**: Should it be extracted at all? Where should it
be placed in which format?
14. **Linearization**: Assume you have a floating figure in between a paragraph.
Do you first finish the paragraph, or do you put the figure text in between?
Then there are issues where most people would agree on the correct output, but
the way PDF stores information just makes it hard to achieve that:
1. **Tables**: Typically, tables are just absolutely positioned text. In the worst
case, every single letter could be absolutely positioned. That makes it hard
to tell where columns / rows are.
2. **Images**: Sometimes PDFs do not contain the text as it is displayed, but
instead an image. You notice that when you cannot copy the text. Then there
are PDF files that contain an image and a text layer in the background.
That typically happens when a document was scanned. Although the scanning
software (OCR) is pretty good today, it still fails once in a while. pypdf
is no OCR software; it will not be able to detect those failures. pypdf
will also never be able to extract text from images.
Finally, there are issues that pypdf will deal with. If you find such a
text extraction bug, please share the PDF with us so we can work on it!
### Missing Semantic Layer
The PDF file format is all about producing the desired visual result for
printing. It was not created for parsing the content. PDF files don't contain a
semantic layer.
Specifically, there is no information what the header, footer, page numbers,
tables, and paragraphs are. The visual appearance is there, and people might
find heuristics to make educated guesses, but there is no way of being certain.
This is a shortcoming of the PDF file format, not of pypdf.
It is possible to apply machine learning on PDF documents to make good
heuristics, but that will not be part of pypdf. However, pypdf could be used to
feed such a machine learning system with the relevant information.
### Whitespaces
The PDF format is meant for printing. It is not designed to be read by machines.
The text within a PDF document is absolutely positioned, meaning that every single
character could be positioned on the page.
The text
> This is a test document by Ethan Nelson.
can be represented as
> [(This is a )9(te)-3(st)9( do)-4(cu)13(m)-4(en)12(t )-3(b)3(y)-3( )9(Et)-2(h)3(an)4( Nels)13(o)-5(n)3(.)] TJ
Where the numbers are adjustments of vertical space. This representation used
within the PDF file makes it very hard to guarantee correct whitespaces.
More information:
* [issue #1507](https://github.com/py-pdf/pypdf/issues/1507)
* [Negative numbers in PDF content stream text object](https://stackoverflow.com/a/28203655/562769)
* Mark Stephens: [Understanding PDF text objects](https://blog.idrsolutions.com/understanding-pdf-text-objects/), 2010.
## OCR vs. Text Extraction
Optical Character Recognition (OCR) is the process of extracting text from
images. Software which does this is called *OCR software*. The
[tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) is the
most commonly known open source OCR software.
pypdf is **not** OCR software.
### Digitally-born vs. Scanned PDF files
PDF documents can contain images and text. PDF files don't store text in a
semantically meaningful way, but in a way that makes it easy to show the
text on screen or print it. For this reason, text extraction from PDFs is hard.
If you scan a document, the resulting PDF typically shows the image of the scan.
Scanners then also run OCR software and put the recognized text in the background
of the image. pypdf can extract this result of the scanners OCR software. However,
in such cases, it's recommended to directly use OCR software as
errors can accumulate: The OCR software is not perfect in recognizing the text.
Then it stores the text in a format that is not meant for text extraction and
pypdf might make mistakes parsing that.
Hence, I would distinguish three types of PDF documents:
* **Digitally born PDF files**: The file was created digitally on the computer.
It can contain images, texts, links, outline items (a.k.a., bookmarks), JavaScript, ...
If you Zoom in a lot, the text still looks sharp.
* **Scanned PDF files**: Any number of pages was scanned. The images were then
stored in a PDF file. Hence, the file is just a container for those images.
You cannot copy the text, you don't have links, outline items, JavaScript.
* **OCRed PDF files**: The scanner ran OCR software and put the recognized text
in the background of the image. Hence, you can copy the text, but it still looks
like a scan. If you zoom in enough, you can recognize pixels.
### Can we just always use OCR?
You might now wonder if it makes sense to just always use OCR software. If the
PDF file is digitally-born, you can render it to an image.
I would recommend not to do that.
Text extraction software like pypdf can use more information from the
PDF than just the image. It can know about fonts, encodings, typical character
distances and similar topics.
That means pypdf has a clear advantage when it
comes to characters which are easy to confuse such as `oO0ö`.
**pypdf will never confuse characters**. It just reads what is in the file.
pypdf also has an edge when it comes to characters which are rare, e.g.
🤰. OCR software will not be able to recognize smileys correctly.
## Attempts to prevent text extraction
If people who share PDF documents want to prevent text extraction, they have
multiple ways to do so:
1. Store the contents of the PDF as an image
2. [Use a scrambled font](https://stackoverflow.com/a/43466923/562769)
However, text extraction cannot be completely prevented if people should still
be able to read the document. In the worst case, people can make a screenshot,
print it, scan it, and run OCR over it.
|