File: control

package info (click to toggle)
html-text 0.7.1-2
  • links: PTS
  • area: main
  • in suites: forky, sid
  • size: 292 kB
  • sloc: python: 536; makefile: 5
file content (27 lines) | stat: -rw-r--r-- 1,173 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Source: html-text
Section: python
Priority: optional
Maintainer: Christian Marillat <marillat@debian.org>
Homepage: https://github.com/zytedata/html-text
Rules-Requires-Root: no
Standards-Version: 4.7.0
Build-Depends: debhelper-compat (= 13), dh-sequence-python3, python3,
 python3-setuptools, pybuild-plugin-pyproject, python3-hatchling,
 python3-lxml, python3-pytest, python3-lxml-html-clean,

Package: python3-html-text
Architecture: all
Depends: ${python3:Depends}, ${misc:Depends},
Description: extract text from HTML.
 How is html_text different from .xpath('//text()') from LXML or .get_text()
 from Beautiful Soup ?
 .
  * Text extracted with html_text does not contain inline styles,
    javascript, comments and other text that is not normally visible to
    users;
  * html_text normalizes whitespace, but in a way smarter than
    .xpath('normalize-space()), adding spaces around inline elements (which
    are often used as block elements in html markup), and trying to avoid
    adding extra spaces for punctuation;
  * html-text can add newlines (e.g. after headers or paragraphs), so that
    the output text looks more like how it is rendered in browsers.