File: index.rst

package info (click to toggle)
pymupdf 1.25.4%2Bds1-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 98,632 kB
  • sloc: python: 43,379; ansic: 75; makefile: 6
file content (151 lines) | stat: -rw-r--r-- 4,570 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151

.. include:: ../header.rst

.. _pymupdf4llm


PyMuPDF4LLM
===========================================================================

|PyMuPDF4LLM| is aimed to make it easier to extract **PDF** content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction <extracting_as_md>` as well as :ref:`LlamaIndex document output <extracting_as_llamaindex>`.

.. important::

    You can extend the supported file types to also include **Office** document formats (DOC/DOCX, XLS/XLSX, PPT/PPTX, HWP/HWPX) by :ref:`using PyMuPDF Pro with PyMuPDF4LLM <using_pymupdf4llm_withpymupdfpro>`.

Features
-------------------------------

    - Support for multi-column pages
    - Support for image and vector graphics extraction (and inclusion of references in the MD text)
    - Support for page chunking output.
    - Direct support for output as :ref:`LlamaIndex Documents <extracting_as_llamaindex>`.


Functionality
--------------------

- This package converts the pages of a file to text in **Markdown** format using |PyMuPDF|.

- Standard text and tables are detected, brought in the right reading sequence and then together converted to **GitHub**-compatible **Markdown** text.

- Header lines are identified via the font size and appropriately prefixed with one or more `#` tags.

- Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.

- By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of `0`-based page numbers.


Installation
----------------


Install the package via **pip** with:


.. code-block:: bash

    pip install pymupdf4llm


.. _extracting_as_md:

Extracting a file as **Markdown**
--------------------------------------------------------------

To retrieve your document content in **Markdown** simply install the package and then use a couple of lines of **Python** code to get results.



Then in your **Python** script do:


.. code-block:: python

    import pymupdf4llm
    md_text = pymupdf4llm.to_markdown("input.pdf")


.. note::

    Instead of the filename string as above, one can also provide a :ref:`PyMuPDF Document <Document>`. A second parameter may be a list of `0`-based page numbers, e.g. `[0,1]` would just select the first and second pages of the document.


If you want to store your **Markdown** file, e.g. store as a UTF8-encoded file, then do:


.. code-block:: python

    import pathlib
    pathlib.Path("output.md").write_bytes(md_text.encode())



.. _extracting_as_llamaindex:

Extracting a file as a **LlamaIndex** document
--------------------------------------------------------------

|PyMuPDF4LLM| supports direct conversion to a **LLamaIndex** document. A document is first converted into **Markdown** format and then a **LlamaIndex** document is returned as follows:



.. code-block:: python

    import pymupdf4llm
    llama_reader = pymupdf4llm.LlamaMarkdownReader()
    llama_docs = llama_reader.load_data("input.pdf")


.. _using_pymupdf4llm_withpymupdfpro:

Using with |PyMuPDF Pro|
---------------------------


For **Office** document support, |PyMuPDF4LLM| works seamlessly with |PyMuPDF Pro|. Assuming you have :doc:`../pymupdf-pro` installed you will be able to work with **Office** documents as expected:


.. code-block:: python

    import pymupdf4llm
    import pymupdf.pro
    pymupdf.pro.unlock()
    md_text = pymupdf4llm.to_markdown("sample.doc")


As you can see |PyMuPDF Pro| functionality will be available within the |PyMuPDF4LLM| context!



API
-------

See :ref:`the PyMuPDF4LLM API <pymupdf4llm-api>`.

Further Resources
-------------------


Sample code
~~~~~~~~~~~~~~~

- `Command line RAG Chatbot with PyMuPDF <https://github.com/pymupdf/RAG/tree/main/examples/country-capitals>`_
- `Example of a Browser Application using Langchain and PyMuPDF <https://github.com/pymupdf/RAG/tree/main/examples/GUI>`_


Blogs
~~~~~~~~~~~~~~

- `RAG/LLM and PDF: Enhanced Text Extraction <https://artifex.com/blog/rag-llm-and-pdf-enhanced-text-extraction>`_
- `Creating a RAG Chatbot with ChatGPT and PyMuPDF <https://artifex.com/blog/creating-a-rag-chatbot-with-chatgpt-and-pymupdf>`_
- `Building a RAG Chatbot GUI with the ChatGPT API and PyMuPDF <https://artifex.com/blog/building-a-rag-chatbot-gui-with-the-chatgpt-api-and-pymupdf>`_
- `RAG/LLM and PDF: Conversion to Markdown Text with PyMuPDF <https://artifex.com/blog/rag-llm-and-pdf-conversion-to-markdown-text-with-pymupdf>`_







.. include:: ../footer.rst