1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
|
.. default-domain:: js
.. highlight:: javascript
StructuredText
===================
StructuredText objects hold text from a page that has been analyzed and grouped
into blocks, lines and spans.
Constructors
------------
.. class:: StructuredText
|no_new|
To obtain a StructuredText instance use `Page.prototype.toStructuredText()`.
Static properties
-----------------
.. data:: StructuredText.SEARCH_EXACT
used to search untransformed text
.. data:: StructuredText.SEARCH_IGNORE_CASE
used to search text ignoring case differences
.. data:: StructuredText.SEARCH_IGNORE_DIACRITICS
used to search text ignoring diacritics
.. data:: StructuredText.SEARCH_REGEXP
used to search text with the needle being a regexp
.. data:: StructuredText.SEARCH_KEEP_LINES
used to search text preserving line breaks.
.. data:: StructuredText.SEARCH_KEEP_PARAGRAPHS
used to search text preserving paragraph breaks.
.. data:: StructuredText.SEARCH_KEEP_HYPHENS
used to search text preserving hyphens and not joining lines.
Instance methods
----------------
.. method:: StructuredText.prototype.search(needle, maxHits)
Search the text for all instances of needle, and return an array with all matches found on the page.
Each match in the result is an array containing one or more Quads that cover the matching text.
:param string needle: The text to search for.
:param number options: Optional options for the search. A logical or of options such as `StructuredText.SEARCH_EXACT`.
:returns: Array of Array of `Quad`
.. code-block::
var result = sText.search("Hello World!")
.. method:: StructuredText.prototype.highlight(p, q, maxHits)
Return an array of `Quad` used to highlight a selection defined by the start and end points.
:param Point p: Start point.
:param Point q: End point.
:param number maxHits: The maximum number of hits to return. Default 500.
:returns: Array of `Quad`
.. code-block::
var result = sText.highlight([100, 100], [200, 100])
.. method:: StructuredText.prototype.copy(p, q)
Return the text from the selection defined by the start and end points.
:param Point p: Start point.
:param Point q: End point.
:returns: string
.. code-block::
var result = sText.copy([100, 100], [200, 100])
.. method:: StructuredText.prototype.walk(walker)
:param StructuredTextWalker walker: Callback object.
Walk through the blocks (images or text blocks) of the structured text.
For each text block walk over its lines of text, and for each line each
of its characters. For each block, line or character the walker will
have a method called.
.. code-block::
var sText = page.toStructuredText()
sText.walk({
beginLine: function (bbox, wmode, direction) {
console.log("beginLine", bbox, wmode, direction)
},
endLine: function () {
console.log("endLine")
},
beginTextBlock: function (bbox) {
console.log("beginTextBlock", bbox)
},
endTextBlock: function () {
console.log("endTextBlock")
},
beginStruct: function (standard, raw, index) {
console.log("beginStruct", standard, raw, index)
},
endStruct: function () {
console.log("endStruct")
},
onChar: function (utf, origin, font, size, quad, argb, flags) {
console.log("onChar", utf, origin, font, size, quad, argb, flags)
},
onImageBlock: function (bbox, transform, image) {
console.log("onImageBlock", bbox, transform, image)
},
onVector: function (bbox, flags, argb) {
console.log("onVector", bbox, flags, argb)
},
})
.. method:: StructuredText.prototype.asText()
Returns a plain text representation.
:returns: string
.. method:: StructuredText.prototype.asHTML(id)
Returns a string containing an HTML rendering of the text.
:param number id:
Used to number the "id" on the top div tag (as ``"page" + id``).
:returns: string
.. method:: StructuredText.prototype.asJSON(scale)
Returns a JSON string representing the structured text data.
This is a simplified serialization of the information that
`StructuredText.prototype.walk()` provides.
Note: You must extract the structured text with "preserve-spans"!
If you forget to set this option, any font changes in the middle of the
line will not be present in the JSON output.
:param number scale: Optional scaling factor to multiply all the coordinates by.
:returns: string containing JSON of the following schema:
.. code-block:: typescript
type StructuredTextPage = {
blocks: StructuredTextBlock[]
}
type StructuredTextBlock = {
type: "image" | "text",
bbox: {
x: number,
y: number,
w: number,
h: number
},
lines: StructuredTextLine[],
}
type StructuredTextLine = {
wmode: 0 | 1, // 0=horizontal, 1=vertical
bbox: {
x: number,
y: number,
w: number,
h: number
},
font: {
name: string,
family: "serif" | "sans-serif" | "monospace",
weight: "normal" | "bold",
style: "normal" | "italic",
size: number
},
// text origin point for first character in line
x: number,
y: number,
text: string
}
.. code-block::
var data = JSON.parse(page.toStructuredText("preserve-spans").asJSON())
|