File: post-processing-in-text-extraction.md

package info (click to toggle)
pypdf 6.9.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky
  • size: 18,184 kB
  • sloc: python: 48,595; makefile: 35
file content (111 lines) | stat: -rw-r--r-- 3,577 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# Post-Processing of Text Extraction

Post-processing can recognizably improve the results of text extraction. It is,
however, outside the scope of pypdf itself. Hence, the library will not give
any direct support for it. It is a natural language processing (NLP) task.

This page lists a few examples of what can be done as well as a community recipe
that can be used as a general purpose post-processing step. If you know more
about the specific domain of your documents, e.g., the language, it is likely
that you can find custom solutions that work better in your context.

## Ligature Replacement

```{testcode}
def replace_ligatures(text: str) -> str:
    ligatures = {
        "ff": "ff",
        "fi": "fi",
        "fl": "fl",
        "ffi": "ffi",
        "ffl": "ffl",
        "ſt": "ft",
        "st": "st",
        # "Ꜳ": "AA",
        # "Æ": "AE",
        "ꜳ": "aa",
    }
    for search, replace in ligatures.items():
        text = text.replace(search, replace)
    return text
```

## Dehyphenation

Hyphens are used to break words up so that the appearance of the page is nicer.

```{testcode}
from typing import List


def remove_hyphens(text: str) -> str:
    """

    This fails for:
    * Natural dashes: well-known, self-replication, use-cases, non-semantic,
                      Post-processing, Window-wise, viewpoint-dependent
    * Trailing math operands: 2 - 4
    * Names: Lopez-Ferreras, VGG-19, CIFAR-100
    """
    lines = [line.rstrip() for line in text.split("\n")]

    # Find dashes
    line_numbers = []
    for line_no, line in enumerate(lines[:-1]):
        if line.endswith("-"):
            line_numbers.append(line_no)

    # Replace
    for line_no in line_numbers:
        lines = dehyphenate(lines, line_no)

    return "\n".join(lines)


def dehyphenate(lines: List[str], line_no: int) -> List[str]:
    next_line = lines[line_no + 1]
    word_suffix = next_line.split(" ")[0]

    lines[line_no] = lines[line_no][:-1] + word_suffix
    lines[line_no + 1] = lines[line_no + 1][len(word_suffix) :]
    return lines
```

## Header/Footer Removal

The following header/footer removal has several drawbacks:

* False-positives, e.g., for the first page when there is a date like 2024.
* False-negatives in many cases:
    * Dynamic part, e.g., page label is in the header.
    * Even/odd pages have different headers.
    * Some pages, e.g., the first one or chapter pages, do not have a header.

```{testcode}
def remove_footer(extracted_texts: list[str], page_labels: list[str]):
    def remove_page_labels(extracted_texts, page_labels):
        processed = []
        for text, label in zip(extracted_texts, page_labels):
            text_left = text.lstrip()
            if text_left.startswith(label):
                text = text_left[len(label) :]

            text_right = text.rstrip()
            if text_right.endswith(label):
                text = text_right[: -len(label)]

            processed.append(text)
        return processed

    extracted_texts = remove_page_labels(extracted_texts, page_labels)
    return extracted_texts
```

## Other ideas

* Whitespaces in units: Between a number and its unit should be a space.
  ([source](https://tex.stackexchange.com/questions/20962/should-i-put-a-space-between-a-number-and-its-unit)).
  That means: 42 ms, 42 GHz, 42 GB.
* Percent: English style guides prescribe writing the percent sign following the number without any space between (e.g., 50%).
* Whitespaces before dots: Should typically be removed.
* Whitespaces after dots: Should typically be added.