File: usage.md

package info (click to toggle)
python-html2text 2025.4.15-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 976 kB
  • sloc: python: 1,659; sh: 32; makefile: 5
file content (153 lines) | stat: -rw-r--r-- 8,261 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
Usage
=====

The module is simple enough to use. This tutorial will get you started.

Installing
----------

These are the methods you can get the module installed:-

### PIP

For those who have pip, we got your back.

```
$ pip install html2text
```

### Clone from Git Repository

Clone the repository from https://github.com/Alir3z4/html2text

```
$ git clone --depth 50 https://github.com/Alir3z4/html2text.git
$ python -m build -nwx
$ python -m pip install --upgrade ./dist/*.whl
```



Basic Usage
-----------

Once installed the module can be used as follows.

    import html2text
    html = function_to_get_some_html()
    text = html2text.html2text(html)
    print(text)

This converts the provided html to text( Markdown text) with all the
options set to default.

Using Options
--------------

To customize the options provided by the module the usage is as follows:

    import html2text
    text_maker = html2text.HTML2Text()
    text_maker.ignore_links = True
    text_maker.bypass_tables = False
    html = function_to_get_some_html()
    text = text_maker.handle(html)
    print(text)


Available Options
-----------------

All options exist in the config.py file. A list is provided here with
simple indications of their function.


    - UNICODE_SNOB for using unicode
    - ESCAPE_SNOB for escaping every special character
    - LINKS_EACH_PARAGRAPH for putting links after every paragraph
    - BODY_WIDTH for wrapping long lines
    - SKIP_INTERNAL_LINKS to skip #local-anchor things
    - INLINE_LINKS for formatting images and links
    - PROTECT_LINKS protect from line breaks
    - GOOGLE_LIST_INDENT no of pixels to indent nested lists
    - IGNORE_ANCHORS
    - IGNORE_IMAGES
    - IMAGES_AS_HTML always generate HTML tags for images; preserves `height`, `width`, `alt` if possible.
    - IMAGES_TO_ALT
    - IMAGES_WITH_SIZE
    - IGNORE_EMPHASIS
    - BYPASS_TABLES format tables in HTML rather than Markdown
    - IGNORE_TABLES ignore table-related tags (table, th, td, tr) while keeping rows
    - SINGLE_LINE_BREAK to use a single line break rather than two
    - UNIFIABLE is a dictionary which maps unicode abbreviations to ASCII
                values
    - RE_SPACE for finding space-only lines
    - RE_ORDERED_LIST_MATCHER for matching ordered lists in MD
    - RE_UNORDERED_LIST_MATCHER for matching unordered list matcher in MD
    - RE_MD_CHARS_MATCHER for matching Md \,[,],( and )
    - RE_MD_CHARS_MATCHER_ALL for matching `,*,_,{,},[,],(,),#,!
    - RE_MD_DOT_MATCHER for matching lines starting with <space>1.<space>
    - RE_MD_PLUS_MATCHER for matching lines starting with <space>+<space>
    - RE_MD_DASH_MATCHER for matching lines starting with <space>(-)<space>
    - RE_SLASH_CHARS a string of slash escapeable characters
    - RE_MD_BACKSLASH_MATCHER to match \char
    - USE_AUTOMATIC_LINKS to convert <a href='http://xyz'>http://xyz</a> to <http://xyz>
    - MARK_CODE to wrap 'pre' blocks with [code]...[/code] tags
    - WRAP_LINKS to decide if links have to be wrapped during text wrapping (implies INLINE_LINKS = False)
    - WRAP_LIST_ITEMS to decide if list items have to be wrapped during text wrapping
    - WRAP_TABLES to decide if tables have to be wrapped during text wrapping
    - DECODE_ERRORS to handle decoding errors. 'strict', 'ignore', 'replace' are the acceptable values.
    - DEFAULT_IMAGE_ALT takes a string as value and is used whenever an image tag is missing an `alt` value. The default for this is an empty string '' to avoid backward breakage
    - OPEN_QUOTE is the character used to open a quote when replacing the `<q>` tag. It defaults to `"`.
    - CLOSE_QUOTE is the character used to close a quote when replacing the `<q>` tag. It defaults to `"`.

Options that are not in the config.py file:

    - emphasis_mark is the character used when replacing the `<em>` tag. It defaults to `_`.
    - strong_mark is the characer used when replacing the `<strong>` tag. It defaults to `**`.

To alter any option the procedure is to create a parser with
`parser = html2text.HTML2Text()` and to set the option on the parser.
example: `parser.unicode_snob = True` to set the UNICODE_SNOB option.


Command line options
--------------------


| Option                                                 | Description
|--------------------------------------------------------|---------------------------------------------------
| `--version`                                            | Show program version number and exit
| `-h`, `--help`                                         | Show this help message and exit
| `--ignore-links`                                       | Do not include any formatting for links
|`--protect-links`                                       | Protect links from line breaks surrounding them "+" with angle brackets
|`--ignore-images`                                       | Do not include any formatting for images
|`--images-as-html`                                      | Always write image tags as raw html; preserves "height", "width" and "alt" if possible.
|`--images-to-alt`                                       | Discard image data, only keep alt text
|`--images-with-size`                                    | Write image tags with height and width attrs as raw html to retain dimensions
|`-g`, `--google-doc`                                    | Convert an html-exported Google Document
|`-d`, `--dash-unordered-list`                           | Use a dash rather than a star for unordered list items
|`-b` `BODY_WIDTH`, `--body-width`=`BODY_WIDTH`          | Number of characters per output line, `0` for no wrap
|`-i` `LIST_INDENT`, `--google-list-indent`=`LIST_INDENT`| Number of pixels Google indents nested lists
|`-s`, `--hide-strikethrough`                            | Hide strike-through text. only relevant when `-g` is specified as well
|`--escape-all`                                          | Escape all special characters.  Output is less readable, but avoids corner case formatting issues.
| `--bypass-tables`                                      | Format tables in HTML rather than Markdown syntax.
| `--ignore-tables`                                      | Ignore table-related tags (table, th, td, tr) while keeping rows.
| `--single-line-break`                                  | Use a single line break after a block element rather than two.
| `--reference-links`                                    | Use reference links instead of inline links to create markdown
| `--ignore-emphasis`                                    | Ignore all emphasis formatting in the html.
| `--include-sup-sub`                                    | Include `<sub>` and `<sup>` tags.
| `-e`, `--asterisk-emphasis`                            | Use asterisk rather than underscore to emphasize text
| `--unicode-snob`                                       | Use unicode throughout instead of ASCII
| `--no-automatic-links`                                 | Do not use automatic links like <https://www.google.com/>
| `--no-skip-internal-links`                             | Turn off skipping of internal links
| `--links-after-para`                                   | Put the links after the paragraph and not at end of document
| `--mark-code`                                          | Mark code with [code]...[/code] blocks
| `--no-wrap-links`                                      | Do not wrap links during text wrapping. Implies `--reference-links`
| `--wrap-list-items`                                    | Wrap list items during text wrapping.
| `--wrap-tables`                                        | Wrap tables during text wrapping.
| `--decode-errors`=`HANDLER`                            | What to do in case an error is encountered. `ignore`, `strict`, `replace` etc.
| `--pad-tables`                                         | Use padding to make tables look good.
| `--default-image-alt`=`Image_Here`                     | Inserts the given `alt` text whenever images are missing `alt` values.
| `--open-quote`=`"`                                     | Inserts the given text when opening a quote. Defaults to `"`.
| `--close-quote`=`"`                                    | Inserts the given text when closing a quote. Defaults to `"`.