File: StructuredText.rst

package info (click to toggle)
mupdf 1.27.0%2Bds1-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 29,224 kB
  • sloc: ansic: 335,320; python: 20,906; java: 7,520; javascript: 2,213; makefile: 1,152; xml: 675; cpp: 639; sh: 513; cs: 307; awk: 10; sed: 7; lisp: 3
file content (206 lines) | stat: -rw-r--r-- 5,213 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
.. default-domain:: js

.. highlight:: javascript

StructuredText
===================

StructuredText objects hold text from a page that has been analyzed and grouped
into blocks, lines and spans.

Constructors
------------

.. class:: StructuredText

	|no_new|

To obtain a StructuredText instance use `Page.prototype.toStructuredText()`.

Static properties
-----------------

.. data:: StructuredText.SEARCH_EXACT

	used to search untransformed text

.. data:: StructuredText.SEARCH_IGNORE_CASE

	used to search text ignoring case differences

.. data:: StructuredText.SEARCH_IGNORE_DIACRITICS

	used to search text ignoring diacritics

.. data:: StructuredText.SEARCH_REGEXP

	used to search text with the needle being a regexp

.. data:: StructuredText.SEARCH_KEEP_LINES

	used to search text preserving line breaks.

.. data:: StructuredText.SEARCH_KEEP_PARAGRAPHS

	used to search text preserving paragraph breaks.

.. data:: StructuredText.SEARCH_KEEP_HYPHENS

	used to search text preserving hyphens and not joining lines.

Instance methods
----------------

.. method:: StructuredText.prototype.search(needle, maxHits)

	Search the text for all instances of needle, and return an array with all matches found on the page.

	Each match in the result is an array containing one or more Quads that cover the matching text.

	:param string needle: The text to search for.
	:param number options: Optional options for the search. A logical or of options such as `StructuredText.SEARCH_EXACT`.

	:returns: Array of Array of `Quad`

	.. code-block::

		var result = sText.search("Hello World!")

.. method:: StructuredText.prototype.highlight(p, q, maxHits)

	Return an array of `Quad` used to highlight a selection defined by the start and end points.

	:param Point p: Start point.
	:param Point q: End point.
	:param number maxHits: The maximum number of hits to return. Default 500.

	:returns: Array of `Quad`

	.. code-block::

		var result = sText.highlight([100, 100], [200, 100])

.. method:: StructuredText.prototype.copy(p, q)

	Return the text from the selection defined by the start and end points.

	:param Point p: Start point.
	:param Point q: End point.

	:returns: string

	.. code-block::

		var result = sText.copy([100, 100], [200, 100])

.. method:: StructuredText.prototype.walk(walker)

	:param StructuredTextWalker walker: Callback object.

	Walk through the blocks (images or text blocks) of the structured text.
	For each text block walk over its lines of text, and for each line each
	of its characters. For each block, line or character the walker will
	have a method called.

	.. code-block::

		var sText = page.toStructuredText()
		sText.walk({
			beginLine: function (bbox, wmode, direction) {
				console.log("beginLine", bbox, wmode, direction)
			},
			endLine: function () {
				console.log("endLine")
			},
			beginTextBlock: function (bbox) {
				console.log("beginTextBlock", bbox)
			},
			endTextBlock: function () {
				console.log("endTextBlock")
			},
			beginStruct: function (standard, raw, index) {
				console.log("beginStruct", standard, raw, index)
			},
			endStruct: function () {
				console.log("endStruct")
			},
			onChar: function (utf, origin, font, size, quad, argb, flags) {
				console.log("onChar", utf, origin, font, size, quad, argb, flags)
			},
			onImageBlock: function (bbox, transform, image) {
				console.log("onImageBlock", bbox, transform, image)
			},
			onVector: function (bbox, flags, argb) {
				console.log("onVector", bbox, flags, argb)
			},
		})

.. method:: StructuredText.prototype.asText()

	Returns a plain text representation.

	:returns: string

.. method:: StructuredText.prototype.asHTML(id)

	Returns a string containing an HTML rendering of the text.

	:param number id:
		Used to number the "id" on the top div tag (as ``"page" + id``).

	:returns: string

.. method:: StructuredText.prototype.asJSON(scale)

	Returns a JSON string representing the structured text data.

	This is a simplified serialization of the information that
	`StructuredText.prototype.walk()` provides.

	Note: You must extract the structured text with "preserve-spans"!
	If you forget to set this option, any font changes in the middle of the
	line will not be present in the JSON output.

	:param number scale: Optional scaling factor to multiply all the coordinates by.

	:returns: string containing JSON of the following schema:

		.. code-block:: typescript

			type StructuredTextPage = {
				blocks: StructuredTextBlock[]
			}
			type StructuredTextBlock = {
				type: "image" | "text",
				bbox: {
					x: number,
					y: number,
					w: number,
					h: number
				},
				lines: StructuredTextLine[],
			}
			type StructuredTextLine = {
				wmode: 0 | 1,	// 0=horizontal, 1=vertical
				bbox: {
					x: number,
					y: number,
					w: number,
					h: number
				},
				font: {
					name: string,
					family: "serif" | "sans-serif" | "monospace",
					weight: "normal" | "bold",
					style: "normal" | "italic",
					size: number
				},
				// text origin point for first character in line
				x: number,
				y: number,
				text: string
			}

	.. code-block::

		var data = JSON.parse(page.toStructuredText("preserve-spans").asJSON())