File: forms.md

package info (click to toggle)
pypdf 6.9.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky
  • size: 18,184 kB
  • sloc: python: 48,595; makefile: 35
file content (120 lines) | stat: -rw-r--r-- 6,191 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# Interactions with PDF Forms

## Reading form fields

```{testsetup}
pypdf_test_setup("user/forms", {
    "form.pdf": "../resources/form.pdf",
})
```

```{testcode}
from pypdf import PdfReader

reader = PdfReader("form.pdf")
fields = reader.get_form_text_fields()
fields == {"key": "value", "key2": "value2"}

# You can also get all fields:
fields = reader.get_fields()
```

## Filling out forms

```{testcode}
from pypdf import PdfReader, PdfWriter

reader = PdfReader("form.pdf")
writer = PdfWriter()

page = reader.pages[0]
fields = reader.get_fields()

writer.append(reader)

writer.update_page_form_field_values(
    writer.pages[0],
    {"fieldname": "some filled in text"},
    auto_regenerate=False,
)

writer.write("out-filled-form.pdf")
```

Generally speaking, you will always want to use `auto_regenerate=False`. The
parameter is `True` by default for legacy compatibility, but this flags the PDF
processor to recompute the field's rendering, and may trigger a "save changes"
dialog for users who open the generated PDF.

If you want to flatten your form, that is, keeping all form field contents while
removing the form fields themselves, you can set the `flatten` parameter in
{func}`~pypdf.PdfWriter.update_page_form_field_values` to `True`. This
will convert form field  contents to regular PDF content. Afterwards, use
{func}`~pypdf.PdfWriter.remove_annotations` with `subtypes="/Widget"`
to remove all form fields to get an actual flattened PDF.

## Some notes about form fields and annotations

PDF forms have a dual-nature approach to the fields:

* Within the root object, an `/AcroForm` structure exists.
  Inside it, you could find (optional):

  - some global elements (Fonts, Resources,...)
  - some global flags (like `/NeedAppearances` (set/cleared with `auto_regenerate` parameter in `update_page_form_field_values()`) that indicates if the reading program should re-render the visual fields upon document launch)
  - `/XFA` that houses a form in XDP format (very specific XML that describes the form rendered by some viewers); the `/XFA` form overrides the page content
  - `/Fields` that houses an array of indirect references that reference the upper _Field_ Objects (roots)

* Within the page `/Annots`, you will spot `/Widget` annotations that define the visual rendering.

To flesh out this overview:

* The core-specific properties of a field are:
  - `/FT`: The field type (Button, Text, Choice, or Signature).
  - `/T`:  The partial field name.
  - `/V`:  The field’s value, whose format varies depending on the field type.
  - `/DV`: The default value to which the field reverts when a reset-form action is executed.
* To streamline readability, _Field_ Objects and _Widget_ Objects can be fused housing all properties.
* Fields can be organized hierarchically, id est one field can be placed under another. In such instances, the `/Parent` will have an IndirectObject providing Bottom-Up links and `/Kids` is an array carrying IndirectObjects for Top-Down navigation; _Widget_ Objects are still required for visual rendering. To call upon them, use the *fully qualified field name* (where all the individual names of the parent objects are separated by `.`)

  For instance, take two (visual) fields both called _city_, but attached below _sender_ and _receiver_; the corresponding full names will be _sender.city_ and _receiver.city_.
* When a field is repeated on multiple pages, the Field Object will have many _Widget_ Objects in  `/Kids`. These objects are pure _widgets_, containing no _field_ specific data.
* If Fields stores only hidden values, no _Widgets_ are required.

In _pypdf_ fields are extracted from the `/Fields` array:

```{testcode}
from pypdf import PdfReader

reader = PdfReader("form.pdf")
fields = reader.get_fields()
```

```{testcode}
from pypdf import PdfReader
from pypdf.constants import AnnotationDictionaryAttributes

reader = PdfReader("form.pdf")
fields = []
for page in reader.pages:
    for annot in page.annotations:
        annot = annot.get_object()
        if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
            fields.append(annot)
```

However, while similar, there are some relevant differences between the two above blocks of code. Most importantly, the first block will return a list of Field objects, whereas the second will return more generic dictionary-like objects. The objects lists will *mostly* reference the same object in the underlying PDF, meaning you'll find that `obj_taken_fom_first_list.indirect_reference == obj_taken_from _second_list.indirect_reference`. Field objects are generally more ergonomic, as the exposed data can be accessed via clearly named properties. However, the more generic dictionary-like objects will contain data that the Field object does not expose, such as the Rect (the widget's position on the page). Therefore, the correct approach depends on your use case.

However, it is also important to note that the two lists do not *always* refer to the same underlying PDF object. For example, if the form contains radio buttons, you will find that `reader.get_fields()` will get the parent object (the group of radio buttons) whereas `page.annotations` will return all the child objects (the individual radio buttons).

```{note}
Remember that fields are not stored in pages; if you use `add_page()` the field structure is not copied. It is recommended to use `.append()` with the proper parameters instead.
```

In case of missing _field_ objects in `/Fields`, `writer.reattach_fields()` will parse page(s) annotations and will reattach them. This fix cannot guess intermediate fields and will not report fields using the same _name_.

## Identify pages where fields are used

To ease locating page fields you can use `get_pages_showing_field` of PdfReader or PdfWriter. This method accepts a field object, a *PdfObject* that represents a field (as extracted from `_root_object["/AcroForm"]["/Fields"]`). The method returns a list of pages, because a field can have multiple widgets as mentioned previously (e.g., radio buttons or text displayed on multiple pages).

The page numbers can then be retrieved as usual by using `page.page_number`.