1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
|
# The PDF Format
It is recommended to look in the PDF specification for details and clarifications.
* [PDF Specification Archive](https://pdfa.org/resource/pdf-specification-archive/)
* [Portable Document Format Reference Manual, 1993. ISBN 0-201-62628-4](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf)
* [ISO 32000-1:2008 (PDF 1.7)](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf)
* ISO 32000-2:2020 (PDF 2.0)
Below is only intended to give a very rough overview of the format.
## Overall Structure
A PDF consists of:
1. Header: Contains the version of the PDF, e.g. `%PDF-1.7`
2. Body: Contains a sequence of indirect objects
3. Cross-reference table (xref): Contains a list of the indirect objects in the body
4. Trailer
## The xref table
A cross-reference table (xref) is a table of the indirect objects in the body.
It allows quick access to those objects by pointing to their location in the file.
It looks like this:
```text
xref 42 5
0000001000 65535 f
0000001234 00000 n
0000001987 00000 n
0000011987 00000 n
0000031987 00000 n
```
Let's go through it step-by-step:
* `xref` is just a keyword that specifies the start of the xref table.
* `42` is the numerical ID of the first object in this xref section; `5` is the number of entries in the xref table.
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: a 10-digit byte offset,
a 5-digit generation number, and a literal keyword which is either `n` or `f`.
* `nnnnnnnnnn` is the byte offset of the object. It tells the reader where
the object is in the file.
* `ggggg` is the generation number. It tells the reader how old the object is.
* `n` means that the object is a normal in-use object, `f` means that the object
is a free object.
* The first free object always has a generation number of 65535. It forms
the head of a linked-list of all free objects.
* The generation number of a normal object is always 0. The generation
number allows the PDF format to contain multiple versions of the same
object. This is a version history mechanism.
## The body
The body is a sequence of indirect objects:
`counter generation_number << the_object >> endobj`
* `counter` (integer) is a unique identifier for the object.
* `generation_number` (integer) is the generation number of the object.
* `the_object` is the object itself. It can be empty. Starts with `/Keyword` to
specify which kind of object it is.
* `endobj` marks the end of the object.
A concrete example can be found in `test_reader.py::test_get_images_raw`:
```text
1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj
2 0 obj << >> endobj
3 0 obj << >> endobj
4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]
/MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R
/Resources << /Font << >> >>
/Rotate 0 /Type /Page >> endobj
5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj
```
## The trailer
The trailer looks like this:
```text
trailer << /Root 5 0 R
/Size 6
>>
startxref 1234
%%EOF
```
Let's go through it:
* `trailer <<` indicates that the *trailer dictionary* starts. It ends with `>>`.
* `startxref` is a keyword followed by the byte-location of the `xref` keyword.
As the trailer is always at the bottom of the file, this allows readers to
quickly find the xref table.
* `%%EOF` is the end-of-file marker.
The trailer dictionary is a key-value list. The keys are specified in
Table 15 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).
* `/Root` (dictionary) contains the document catalog.
* The `5` is the object number of the catalog dictionary.
* `0` is the generation number of the catalog dictionary.
* `R` is the keyword that indicates that the object is a reference to the
catalog dictionary.
* `/Size` (integer) contains the total number of entries in the files xref table.
## Reading PDF files
Most PDF files are compressed. If you want to read them, first uncompress them:
```bash
pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
```
Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in
your favorite IDE / text editor.
|