1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095
|
.. include:: header.rst
.. _TheBasics:
==============================
The Basics
==============================
.. _The_Basics_Opening_Files:
Opening a File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To open a file, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("a.pdf") # open a document
.. note::
**Taking it further**
See the :ref:`list of supported file types<Supported_File_Types>` and :ref:`The How to Guide on Opening Files <HowToOpenAFile>` for more advanced options.
----------
.. _The_Basics_Extracting_Text:
Extract text from a |PDF|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To extract all the text from a |PDF| file, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("a.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
out.write(text) # write text of page
out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()
Of course it is not just |PDF| which can have text extracted - all the :ref:`supported document file formats <About_Feature_Matrix>` such as :title:`MOBI`, :title:`EPUB`, :title:`TXT` can have their text extracted.
.. note::
**Taking it further**
If your document contains image based text content the use OCR on the page for subsequent text extraction:
.. code-block:: python
tp = page.get_textpage_ocr()
text = page.get_text(textpage=tp)
There are many more examples which explain how to extract text from specific areas or how to extract tables from documents. Please refer to the :ref:`How to Guide for Text<RecipesText>`.
You can now also :ref:`extract text in Markdown format<rag_outputting_as_md>`.
**API reference**
- :meth:`Page.get_text`
----------
.. _The_Basics_Extracting_Images:
Extract images from a |PDF|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To extract all the images from a |PDF| file, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
image_list = page.get_images()
# print the number of images found on the page
if image_list:
print(f"Found {len(image_list)} images on page {page_index}")
else:
print("No images found on page", page_index)
for image_index, img in enumerate(image_list, start=1): # enumerate the image list
xref = img[0] # get the XREF of the image
pix = pymupdf.Pixmap(doc, xref) # create a Pixmap
if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
pix = None
.. note::
**Taking it further**
There are many more examples which explain how to extract text from specific areas or how to extract tables from documents. Please refer to the :ref:`How to Guide for Text<RecipesText>`.
**API reference**
- :meth:`Page.get_images`
- :ref:`Pixmap<Pixmap>`
.. _The_Basics_Extracting_Vector_Graphics:
Extract vector graphics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To extract all the vector graphics from a document page, do the following:
.. code-block:: python
doc = pymupdf.open("some.file")
page = doc[0]
paths = page.get_drawings()
This will return a dictionary of paths for any vector drawings found on the page.
.. note::
**Taking it further**
Please refer to: :ref:`How to Extract Drawings<RecipesDrawingAndGraphics_Extract_Drawings>`.
**API reference**
- :meth:`Page.get_drawings`
----------
.. _The_Basics_Merging_PDF:
.. _merge PDF:
.. _join PDF:
Merging |PDF| files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To merge |PDF| files, do the following:
.. code-block:: python
import pymupdf
doc_a = pymupdf.open("a.pdf") # open the 1st document
doc_b = pymupdf.open("b.pdf") # open the 2nd document
doc_a.insert_pdf(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename
Merging |PDF| files with other types of file
"""""""""""""""""""""""""""""""""""""""""""""""""""""
With :meth:`Document.insert_file` you can invoke the method to merge :ref:`supported files<Supported_File_Types>` with |PDF|. For example:
.. code-block:: python
import pymupdf
doc_a = pymupdf.open("a.pdf") # open the 1st document
doc_b = pymupdf.open("b.svg") # open the 2nd document
doc_a.insert_file(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename
.. note::
**Taking it further**
It is easy to join PDFs with :meth:`Document.insert_pdf` & :meth:`Document.insert_file`. Given open |PDF| documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation.
The GUI script `join.py <https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/join-documents/join.py>`_ uses this method to join a list of files while also joining the respective table of contents segments. It looks like this:
.. image:: images/img-pdfjoiner.*
:scale: 60
**API reference**
- :meth:`Document.insert_pdf`
- :meth:`Document.insert_file`
----------
Working with Coordinates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is one *mathematical term* that you should feel comfortable with when using **PyMuPDF** - **"coordinates"**. Please have a quick look at the :ref:`Coordinates` section to understand the coordinate system to help you with positioning objects and understand your document space.
----------
.. _The_Basics_Watermarks:
Adding a watermark to a |PDF|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To add a watermark to a |PDF| file, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("document.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
# insert an image watermark from a file name to fit the page bounds
page.insert_image(page.bound(),filename="watermark.png", overlay=False)
doc.save("watermarked-document.pdf") # save the document with a new filename
.. note::
**Taking it further**
Adding watermarks is essentially as simple as adding an image at the base of each |PDF| page. You should ensure that the image has the required opacity and aspect ratio to make it look the way you need it to.
In the example above a new image is created from each file reference, but to be more performant (by saving memory and file size) this image data should be referenced only once - see the code example and explanation on :meth:`Page.insert_image` for the implementation.
**API reference**
- :meth:`Page.bound`
- :meth:`Page.insert_image`
----------
.. _The_Basics_Images:
Adding an image to a |PDF|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To add an image to a |PDF| file, for example a logo, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("document.pdf") # open a document
for page_index in range(len(doc)): # iterate over pdf pages
page = doc[page_index] # get the page
# insert an image logo from a file name at the top left of the document
page.insert_image(pymupdf.Rect(0,0,50,50),filename="my-logo.png")
doc.save("logo-document.pdf") # save the document with a new filename
.. note::
**Taking it further**
As with the watermark example you should ensure to be more performant by only referencing the image once if possible - see the code example and explanation on :meth:`Page.insert_image`.
**API reference**
- :ref:`Rect<Rect>`
- :meth:`Page.insert_image`
----------
.. _The_Basics_Rotating:
Rotating a |PDF|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To add a rotation to a page, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
page.set_rotation(90) # rotate the page
doc.save("rotated-page-1.pdf")
.. note::
**API reference**
- :meth:`Page.set_rotation`
----------
.. _The_Basics_Cropping:
Cropping a |PDF|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To crop a page to a defined :ref:`Rect<Rect>`, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
page.set_cropbox(pymupdf.Rect(100, 100, 400, 400)) # set a cropbox for the page
doc.save("cropped-page-1.pdf")
.. note::
**API reference**
- :meth:`Page.set_cropbox`
----------
.. _The_Basics_Attaching_Files:
:index:`Attaching Files <triple: attach;embed;file>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To attach another file to a page, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open main document
attachment = pymupdf.open("my-attachment.pdf") # open document you want to attach
page = doc[0] # get the 1st page of the document
point = pymupdf.Point(100, 100) # create the point where you want to add the attachment
attachment_data = attachment.tobytes() # get the document byte data as a buffer
# add the file annotation with the point, data and the file name
file_annotation = page.add_file_annot(point, attachment_data, "attachment.pdf")
doc.save("document-with-attachment.pdf") # save the document
.. note::
**Taking it further**
When adding the file with :meth:`Page.add_file_annot` note that the third parameter for the `filename` should include the actual file extension. Without this the attachment possibly will not be able to be recognized as being something which can be opened. For example, if the `filename` is just *"attachment"* when view the resulting PDF and attempting to open the attachment you may well get an error. However, with *"attachment.pdf"* this can be recognized and opened by PDF viewers as a valid file type.
The default icon for the attachment is by default a "push pin", however you can change this by setting the `icon` parameter.
**API reference**
- :ref:`Point<Point>`
- :meth:`Document.tobytes`
- :meth:`Page.add_file_annot`
----------
.. _The_Basics_Embedding_Files:
:index:`Embedding Files <triple: attach;embed;file>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To embed a file to a document, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open main document
embedded_doc = pymupdf.open("my-embed.pdf") # open document you want to embed
embedded_data = embedded_doc.tobytes() # get the document byte data as a buffer
# embed with the file name and the data
doc.embfile_add("my-embedded_file.pdf", embedded_data)
doc.save("document-with-embed.pdf") # save the document
.. note::
**Taking it further**
As with :ref:`attaching files<The_Basics_Attaching_Files>`, when adding the file with :meth:`Document.embfile_add` note that the first parameter for the `filename` should include the actual file extension.
**API reference**
- :meth:`Document.tobytes`
- :meth:`Document.embfile_add`
----------
.. _The_Basics_Deleting_Pages:
Deleting Pages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To delete a page from a document, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.delete_page(0) # delete the 1st page of the document
doc.save("test-deleted-page-one.pdf") # save the document
To delete a multiple pages from a document, do the following:
.. raw:: html
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.delete_pages(from_page=9, to_page=14) # delete a page range from the document
doc.save("test-deleted-pages.pdf") # save the document
What happens if I delete a page referred to by bookmarks or hyperlinks?
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
- A bookmark (entry in the Table of Contents) will become inactive and will no longer navigate to any page.
- A hyperlink will be removed from the page that contains it. The visible content on that page will not otherwise be changed in any way.
.. note::
**Taking it further**
The page index is zero-based, so to delete page 10 of a document you would do the following `doc.delete_page(9)`.
Similarly, `doc.delete_pages(from_page=9, to_page=14)` will delete pages 10 - 15 inclusive.
**API reference**
- :meth:`Document.delete_page`
- :meth:`Document.delete_pages`
----------
.. _The_Basics_Rearrange_Pages:
Re-Arranging Pages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To change the sequence of pages, i.e. re-arrange pages, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.move_page(1,0) # move the 2nd page of the document to the start of the document
doc.save("test-page-moved.pdf") # save the document
.. note::
**API reference**
- :meth:`Document.move_page`
----------
.. _The_Basics_Copying_Pages:
Copying Pages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To copy pages, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.copy_page(0) # copy the 1st page and puts it at the end of the document
doc.save("test-page-copied.pdf") # save the document
.. note::
**API reference**
- :meth:`Document.copy_page`
----------
.. _The_Basics_Selecting_Pages:
Selecting Pages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To select pages, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open("test.pdf") # open a document
doc.select([0, 1]) # select the 1st & 2nd page of the document
doc.save("just-page-one-and-two.pdf") # save the document
.. note::
**Taking it further**
With |PyMuPDF| you have all options to copy, move, delete or re-arrange the pages of a |PDF|. Intuitive methods exist that allow you to do this on a page-by-page level, like the :meth:`Document.copy_page` method.
Or you alternatively prepare a complete new page layout in form of a :title:`Python` sequence, that contains the page numbers you want, in the sequence you want, and as many times as you want each page. The following may illustrate what can be done with :meth:`Document.select`
.. code-block:: python
doc.select([1, 1, 1, 5, 4, 9, 9, 9, 0, 2, 2, 2])
Now let's prepare a PDF for double-sided printing (on a printer not directly supporting this):
The number of pages is given by `len(doc)` (equal to `doc.page_count`). The following lists represent the even and the odd page numbers, respectively:
.. code-block:: python
p_even = [p in range(doc.page_count) if p % 2 == 0]
p_odd = [p in range(doc.page_count) if p % 2 == 1]
This snippet creates the respective sub documents which can then be used to print the document:
.. code-block:: python
doc.select(p_even) # only the even pages left over
doc.save("even.pdf") # save the "even" PDF
doc.close() # recycle the file
doc = pymupdf.open(doc.name) # re-open
doc.select(p_odd) # and do the same with the odd pages
doc.save("odd.pdf")
For more information also have a look at this Wiki `article <https://github.com/pymupdf/PyMuPDF/wiki/Rearranging-Pages-of-a-PDF>`_.
The following example will reverse the order of all pages (**extremely fast:** sub-second time for the 756 pages of the :ref:`AdobeManual`):
.. code-block:: python
lastPage = doc.page_count - 1
for i in range(lastPage):
doc.move_page(lastPage, i) # move current last page to the front
This snippet duplicates the PDF with itself so that it will contain the pages *0, 1, ..., n, 0, 1, ..., n* **(extremely fast and without noticeably increasing the file size!)**:
.. code-block:: python
page_count = len(doc)
for i in range(page_count):
doc.copy_page(i) # copy this page to after last page
**API reference**
- :meth:`Document.select`
----------
.. _The_Basics_Adding_Blank_Pages:
Adding Blank Pages
~~~~~~~~~~~~~~~~~~~~~
To add a blank page, do the following:
.. code-block:: python
import pymupdf
doc = pymupdf.open(...) # some new or existing PDF document
page = doc.new_page(-1, # insertion point: end of document
width = 595, # page dimension: A4 portrait
height = 842)
doc.save("doc-with-new-blank-page.pdf") # save the document
.. note::
**Taking it further**
Use this to create the page with another pre-defined paper format:
.. code-block:: python
w, h = pymupdf.paper_size("letter-l") # 'Letter' landscape
page = doc.new_page(width = w, height = h)
The convenience function :meth:`paper_size` knows over 40 industry standard paper formats to choose from. To see them, inspect dictionary :attr:`paperSizes`. Pass the desired dictionary key to :meth:`paper_size` to retrieve the paper dimensions. Upper and lower case is supported. If you append "-L" to the format name, the landscape version is returned.
Here is a 3-liner that creates a |PDF|: with one empty page. Its file size is 460 bytes:
.. code-block:: python
doc = pymupdf.open()
doc.new_page()
doc.save("A4.pdf")
**API reference**
- :meth:`Document.new_page`
- :attr:`paperSizes`
----------
.. _The_Basics_Inserting_Pages:
Inserting Pages with Text Content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the :meth:`Document.insert_page` method also inserts a new page and accepts the same `width` and `height` parameters. But it lets you also insert arbitrary text into the new page and returns the number of inserted lines.
.. code-block:: python
import pymupdf
doc = pymupdf.open(...) # some new or existing PDF document
n = doc.insert_page(-1, # default insertion point
text = "The quick brown fox jumped over the lazy dog",
fontsize = 11,
width = 595,
height = 842,
fontname = "Helvetica", # default font
fontfile = None, # any font file name
color = (0, 0, 0)) # text color (RGB)
.. note::
**Taking it further**
The text parameter can be a (sequence of) string (assuming UTF-8 encoding). Insertion will start at :ref:`Point` (50, 72), which is one inch below top of page and 50 points from the left. The number of inserted text lines is returned.
**API reference**
- :meth:`Document.insert_page`
----------
.. _The_Basics_Spliting_Single_Pages:
Splitting Single Pages
~~~~~~~~~~~~~~~~~~~~~~~~~~
This deals with splitting up pages of a |PDF| in arbitrary pieces. For example, you may have a |PDF| with *Letter* format pages which you want to print with a magnification factor of four: each page is split up in 4 pieces which each going to a separate |PDF| page in *Letter* format again.
.. code-block:: python
import pymupdf
src = pymupdf.open("test.pdf")
doc = pymupdf.open() # empty output PDF
for spage in src: # for each page in input
r = spage.rect # input page rectangle
d = pymupdf.Rect(spage.cropbox_position, # CropBox displacement if not
spage.cropbox_position) # starting at (0, 0)
#--------------------------------------------------------------------------
# example: cut input page into 2 x 2 parts
#--------------------------------------------------------------------------
r1 = r / 2 # top left rect
r2 = r1 + (r1.width, 0, r1.width, 0) # top right rect
r3 = r1 + (0, r1.height, 0, r1.height) # bottom left rect
r4 = pymupdf.Rect(r1.br, r.br) # bottom right rect
rect_list = [r1, r2, r3, r4] # put them in a list
for rx in rect_list: # run thru rect list
rx += d # add the CropBox displacement
page = doc.new_page(-1, # new output page with rx dimensions
width = rx.width,
height = rx.height)
page.show_pdf_page(
page.rect, # fill all new page with the image
src, # input document
spage.number, # input page number
clip = rx, # which part to use of input page
)
# that's it, save output file
doc.save("poster-" + src.name,
garbage=3, # eliminate duplicate objects
deflate=True, # compress stuff where possible
)
Example:
.. image:: images/img-posterize.png
.. note::
**API reference**
- :meth:`Page.cropbox_position`
- :meth:`Page.show_pdf_page`
--------------------------
.. _The_Basics_Combining_Single_Pages:
Combining Single Pages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This deals with joining |PDF| pages to form a new |PDF| with pages each combining two or four original ones (also called "2-up", "4-up", etc.). This could be used to create booklets or thumbnail-like overviews.
.. code-block:: python
import pymupdf
src = pymupdf.open("test.pdf")
doc = pymupdf.open() # empty output PDF
width, height = pymupdf.paper_size("a4") # A4 portrait output page format
r = pymupdf.Rect(0, 0, width, height)
# define the 4 rectangles per page
r1 = r / 2 # top left rect
r2 = r1 + (r1.width, 0, r1.width, 0) # top right
r3 = r1 + (0, r1.height, 0, r1.height) # bottom left
r4 = pymupdf.Rect(r1.br, r.br) # bottom right
# put them in a list
r_tab = [r1, r2, r3, r4]
# now copy input pages to output
for spage in src:
if spage.number % 4 == 0: # create new output page
page = doc.new_page(-1,
width = width,
height = height)
# insert input page into the correct rectangle
page.show_pdf_page(r_tab[spage.number % 4], # select output rect
src, # input document
spage.number) # input page number
# by all means, save new file using garbage collection and compression
doc.save("4up.pdf", garbage=3, deflate=True)
Example:
.. image:: images/img-4up.png
.. note::
**API reference**
- :meth:`Page.cropbox_position`
- :meth:`Page.show_pdf_page`
--------------------------
.. _The_Basics_Encryption_and_Decryption:
|PDF| Encryption & Decryption
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Starting with version 1.16.0, |PDF| decryption and encryption (using passwords) are fully supported. You can do the following:
* Check whether a document is password protected / (still) encrypted (:attr:`Document.needs_pass`, :attr:`Document.is_encrypted`).
* Gain access authorization to a document (:meth:`Document.authenticate`).
* Set encryption details for PDF files using :meth:`Document.save` or :meth:`Document.write` and
- decrypt or encrypt the content
- set password(s)
- set the encryption method
- set permission details
.. note:: A PDF document may have two different passwords:
* The **owner password** provides full access rights, including changing passwords, encryption method, or permission detail.
* The **user password** provides access to document content according to the established permission details. If present, opening the |PDF| in a viewer will require providing it.
Method :meth:`Document.authenticate` will automatically establish access rights according to the password used.
The following snippet creates a new |PDF| and encrypts it with separate user and owner passwords. Permissions are granted to print, copy and annotate, but no changes are allowed to someone authenticating with the user password.
.. code-block:: python
import pymupdf
text = "some secret information" # keep this data secret
perm = int(
pymupdf.PDF_PERM_ACCESSIBILITY # always use this
| pymupdf.PDF_PERM_PRINT # permit printing
| pymupdf.PDF_PERM_COPY # permit copying
| pymupdf.PDF_PERM_ANNOTATE # permit annotations
)
owner_pass = "owner" # owner password
user_pass = "user" # user password
encrypt_meth = pymupdf.PDF_ENCRYPT_AES_256 # strongest algorithm
doc = pymupdf.open() # empty pdf
page = doc.new_page() # empty page
page.insert_text((50, 72), text) # insert the data
doc.save(
"secret.pdf",
encryption=encrypt_meth, # set the encryption method
owner_pw=owner_pass, # set the owner password
user_pw=user_pass, # set the user password
permissions=perm, # set permissions
)
.. note::
**Taking it further**
Opening this document with some viewer (Nitro Reader 5) reflects these settings:
.. image:: images/img-encrypting.*
**Decrypting** will automatically happen on save as before when no encryption parameters are provided.
To **keep the encryption method** of a PDF save it using `encryption=pymupdf.PDF_ENCRYPT_KEEP`. If `doc.can_save_incrementally() == True`, an incremental save is also possible.
To **change the encryption method** specify the full range of options above (`encryption`, `owner_pw`, `user_pw`, `permissions`). An incremental save is **not possible** in this case.
**API reference**
- :meth:`Document.save`
--------------------------
.. _The_Basics_Extracting_Tables:
Extracting Tables from a :title:`Page`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tables can be found and extracted from any document :ref:`Page`.
.. code-block:: python
import pymupdf
from pprint import pprint
doc = pymupdf.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
tabs = page.find_tables() # locate and extract any tables on page
print(f"{len(tabs.tables)} found on {page}") # display number of found tables
if tabs.tables: # at least one table found?
pprint(tabs[0].extract()) # print content of first table
.. note::
**API reference**
- :meth:`Page.find_tables`
.. important::
There is also the `pdf2docx extract tables method`_ which is capable of table extraction if you prefer.
--------------------------
.. _The_Basics_Get_Page_Links:
Getting Page Links
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Links can be extracted from a :ref:`Page` to return :ref:`Link` objects.
.. code-block:: python
import pymupdf
for page in doc: # iterate the document pages
link = page.first_link # a `Link` object or `None`
while link: # iterate over the links on page
# do something with the link, then:
link = link.next # get next link, last one has `None` in its `next`
.. note::
**API reference**
- :meth:`Page.first_link`
-----------------------------
.. _The_Basics_Get_All_Annotations:
Getting All Annotations from a Document
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Annotations (:ref:`Annot`) on pages can be retrieved with the `page.annots()` method.
.. code-block:: python
import pymupdf
for page in doc:
for annot in page.annots():
print(f'Annotation on page: {page.number} with type: {annot.type} and rect: {annot.rect}')
.. note::
**API reference**
- :meth:`Page.annots`
--------------------------
.. _The_Basics_Redacting:
Redacting content from a **PDF**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Redactions are special types of annotations which can be marked onto a document page to denote an area on the page which should be securely removed. After marking an area with a rectangle then this area will be marked for *redaction*, once the redaction is *applied* then the content is securely removed.
For example if we wanted to redact all instances of the name "Jane Doe" from a document we could do the following:
.. code-block:: python
import pymupdf
# Open the PDF document
doc = pymupdf.open('test.pdf')
# Iterate over each page of the document
for page in doc:
# Find all instances of "Jane Doe" on the current page
instances = page.search_for("Jane Doe")
# Redact each instance of "Jane Doe" on the current page
for inst in instances:
page.add_redact_annot(inst)
# Apply the redactions to the current page
page.apply_redactions()
# Save the modified document
doc.save('redacted_document.pdf')
# Close the document
doc.close()
Another example could be redacting an area of a page, but not to redact any line art (i.e. vector graphics) within the defined area, by setting a parameter flag as follows:
.. code-block:: python
import pymupdf
# Open the PDF document
doc = pymupdf.open('test.pdf')
# Get the first page
page = doc[0]
# Add an area to redact
rect = [0,0,200,200]
# Add a redacction annotation which will have a red fill color
page.add_redact_annot(rect, fill=(1,0,0))
# Apply the redactions to the current page, but ignore vector graphics
page.apply_redactions(graphics=0)
# Save the modified document
doc.save('redactied_document.pdf')
# Close the document
doc.close()
.. warning::
Once a redacted version of a document is saved then the redacted content in the **PDF** is *irretrievable*. Thus, a redacted area in a document removes text and graphics completely from that area.
.. note::
**Taking it further**
The are a few options for creating and applying redactions to a page, for the full API details to understand the parameters to control these options refer to the API reference.
**API reference**
- :meth:`Page.add_redact_annot`
- :meth:`Page.apply_redactions`
--------------------------
.. _The Basics_Coverting_PDF_Documents:
Converting PDF Documents
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We recommend the pdf2docx_ library which uses **PyMuPDF** and the **python-docx** library to provide simple document conversion from **PDF** to **DOCX** format.
.. include:: footer.rst
|