File: html2textrender.xml

package info (click to toggle)
lazarus 4.0%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 275,760 kB
  • sloc: pascal: 2,341,904; xml: 509,420; makefile: 348,726; cpp: 93,608; sh: 3,387; java: 609; perl: 297; sql: 222; ansic: 137
file content (628 lines) | stat: -rw-r--r-- 19,657 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
<?xml version="1.0" encoding="UTF-8"?>
<!--

Documentation for LCL (Lazarus Component Library) and LazUtils (Lazarus 
Utilities) are published under the Creative Commons Attribution-ShareAlike 4.0 
International public license.

https://creativecommons.org/licenses/by-sa/4.0/legalcode.txt
https://gitlab.com/freepascal.org/lazarus/lazarus/-/blob/main/docs/cc-by-sa-4-0.txt

Copyright (c) 1997-2025, by the Lazarus Development Team.

-->
<fpdoc-descriptions>
<package name="lazutils">
<!--
====================================================================
html2textrender
====================================================================
-->
<module name="html2textrender">
<short>
Contains an HTML-to-Text renderer.
</short>
<descr>
<p>
<file>html2textrender.pas</file> contains an HTML-to-Text renderer. It 
converts HTML into plain text by converting tags and their attributes to a 
representation as plain text.
</p>
<p>
<file>html2textrender.pas</file> is part of the <file>lazutils</file> package.
</p>
</descr>

<!-- unresolved externals -->
<element name="Classes"/>
<element name="SysUtils"/>
<element name="LConvEncoding"/>

<element name="THTML2TextRenderer">
<short>Implements an HTML-to-Text renderer.</short>
<descr>
<p>
<var>THTML2TextRenderer</var> is an HTML-to-Text renderer. It converts HTML 
into plain text by stripping tags and their attributes. Converted text 
includes configurable indentation for HTML tags that affect the indentation 
level. The following HTML tags include special processing in the renderer:
</p>
<ul>
<li>HTML</li>
<li>BODY</li>
<li>P</li>
<li>BR</li>
<li>HR</li>
<li>OL</li>
<li>UL</li>
<li>LI</li>
<li>DIV CLASS="TITLE" (forces title mark output)</li>
</ul>
<p>
The following Named character entities are converted to their plain text 
equivalent:
</p>
<dl>
<dt>&amp;nbsp;</dt>
<dd>' '</dd>
<dt>&amp;lt;</dt>
<dd>'&lt;'</dd>
<dt>&amp;gt;</dt>
<dd>'&gt;;'</dd>
<dt>&amp;amp;</dt>
<dd>'&amp;'</dd>
</dl>
<p>
Other named character entities or numeric character entities are included 
verbatim in the plain text output.
</p>
<p>
A UTF-8 Byte Order Mark in the HTML is ignored.
</p>
<p>
Set property values in the class instance to customize the content and 
formatting produced in the output. Use the Render method to parse and process 
the HTML content passed to the constructor, and generate the output for the 
class instance.
</p>
</descr>
<seealso>
<link id="THTML2TextRenderer.Create"/>
<link id="THTML2TextRenderer.Render"/>
</seealso>
</element>

<element name="THTML2TextRenderer.fHTML">
<short>HTML content examined in the class.</short>
</element>
<element name="THTML2TextRenderer.fOutput">
<short>Output value without HTML tags and attributes.</short>
</element>
<element name="THTML2TextRenderer.fMaxLines">
<short>Maximum number of lines allowed in the output from the class.</short>
</element>

<!-- TODO: Added in 103d9f42. -->
<element name="THTML2TextRenderer.fMaxLineLen">
<short>Maximum length allowed for a line in the render output.</short>
</element>

<element name="THTML2TextRenderer.fLineEndMark">
<short>End of line marker, by default standard LineEnding.</short>
</element>
<element name="THTML2TextRenderer.fTitleMark">
<short>Markup used at the start/end of title text.</short>
</element>
<element name="THTML2TextRenderer.fHorzLine">
<short>Markup used for an HR Tag.</short>
</element>
<element name="THTML2TextRenderer.fLinkBegin">
<short>Markup used at the start of an Anchor Tag.</short>
</element>
<element name="THTML2TextRenderer.fLineEnd`">
<short>Markup used at the end of an Anchor Tag.</short>
</element>
<element name="THTML2TextRenderer.fListItemMark">
<short>Markup used for a list item tag.</short>
</element>
<element name="THTML2TextRenderer.fMoreMark">
<short>Text added when there are too many lines.</short>
</element>
<element name="THTML2TextRenderer.fInHeader">
<short>Flag used to suppress output of line breaks in the output.</short>
</element>
<element name="THTML2TextRenderer.fInDivTitle">
<short>
Flag used to indicate that a DIV tag with a TITLE attribute is being 
processed.
</short>
</element>
<element name="THTML2TextRenderer.fPendingSpace">
<short>
Flag used to indicate that a space character needs to be added the end of a 
wrapped line.
</short>
</element>
<element name="THTML2TextRenderer.fPendingNewLineCnt">
<short>Indicates a line break needs to be appended in the output.</short>
</element>
<element name="THTML2TextRenderer.fIndentStep">
<short>Increment (in spaces) for each nested HTML level.</short>
</element>
<element name="THTML2TextRenderer.fIndent">
<short>
The current indentation level for the renderer.
</short>
</element>

<element name="THTML2TextRenderer.fLineLen">
<short></short>
</element>

<element name="THTML2TextRenderer.fLineCnt">
<short>Number of lines added to the output for the class.</short>
</element>
<element name="THTML2TextRenderer.fHtmlLen">
<short>Length of the HTML examined in the class.</short>
</element>
<element name="THTML2TextRenderer.p">
<short>Current character position in the HTML.</short>
</element>

<element name="THTML2TextRenderer.AddNewLine">
<short>Sets a pending line break to be added later.</short>
</element>

<element name="THTML2TextRenderer.AddOneNewLine">
<short>Sets a maximum of one pending line break to be added later.</short>
</element>

<element name="THTML2TextRenderer.AddOutput">
<short>Appends text to the plaint-text output for the renderer.</short>
<descr>
<p>
<var>AddOutput</var> is a <var>Boolean</var> function used to append the 
value specified in <var>aText</var> to the output for the renderer.
</p>
<p>
AddOutput ensures that a space character is included for wrapped lines in the 
HTML when there are no pending new lines. Otherwise, the required number of 
line ending sequences are appended to the output for the render and the line 
count is increased accordingly. If the line count exceeds the maximum number 
allowed in <var>Render</var>, the value in <var>MoreMark</var> is appended to 
the output.
</p>
<p>
Pending new line(s) also cause indentation spaces to be appended to the 
output.
</p>
<p>
The value in aText is appended to the output for the renderer prior to 
exiting from the method.
</p>
<p>
AddOutput is used in the implementation of the <var>Render</var>, 
<var>HtmlTag</var>, and <var>HtmlEntity</var> methods.
</p>
</descr>
<seealso>
<link id="THTML2TextRenderer.Render"/>
<link id="THTML2TextRenderer.LineEndMark"/>
<link id="THTML2TextRenderer.MoreMark"/>
</seealso>
</element>
<element name="THTML2TextRenderer.AddOutput.aText">
<short>Text value appended to the output for the renderer.</short>
</element>
<element name="THTML2TextRenderer.AddOutput.Result">
<short>
<b>True</b> when the value was added to the output; <b>False</b> when the 
maximum number of lines is exceeded.
</short>
</element>

<element name="THTML2TextRenderer.HtmlTag">
<short>Handles an HTML tag and its attributes values.</short>
<descr>
<p>
<var>HtmlTag</var> is a <var>Boolean</var> function used to locate and 
process an HTML start or end tag, and any attribute name/value pairs present 
in the tag. HtmlTag handles the following HTML tag and attribute/value names:
</p>
<dl>
<dt>HTML</dt>
<dd>Sets the FInHeader flag to indicate that the content is for a whole 
page.</dd>
<dt>BODY</dt>
<dd>Call Reset to initialize the renderer.</dd>
<dt>P, /P, BR, /UL</dt>
<dd>Adds a new line sequence to the output.</dd>
<dt>DIV CLASS="Title"</dt>
<dd>
Sets the fInDivTitle flag, and adds a NewLine and a TitleMark to the output. 
When the CLASS attribute is omitted or has a different value, only a NewLine 
sequence is appended.
</dd>
<dt>/DIV</dt>
<dd>
Appends a trailing TitleMark, resets the FInDivTitle flag, and appends a 
NewLine sequence and decrements the indentation level.
</dd>
<dt>LI</dt>
<dd>
Increments the indentation level and adds a single NewLine prior to adding 
the content in the list.
</dd>
<dt>/LI</dt>
<dd>Decrements the indentation level.</dd>
<dt>A</dt>
<dd>Appends a Space character and the LinkBegin sequence to the output.</dd>
<dt>/A</dt>
<dd>Appends a LinkEnd sequence and a Space character to the output.</dd>
<dt>HR</dt>
<dd>Adds a single NewLine and the content in HorzLine to the output.</dd>
</dl>
<p>
All other tag names are ignored in the method.
</p>
<p>
The return value is <b>True</b> when the HTML content is successfully added 
by calling <var>AddOutput</var>. The return value is <b>False</b> when the 
maximum number of lines specified in the <var>Render</var> method is exceeded.
</p>
</descr>
</element>
<element name="THTML2TextRenderer.HtmlTag.Result">
<short>
<b>True</b> when output is successfully added; <b>False</b> when the maximum 
number of lines is exceeded.
</short>
</element>

<element name="THTML2TextRenderer.HtmlEntity">
<short>Handles an HTML character entity.</short>
<descr>
<p>
<var>HtmlEntity</var> is a <var>Boolean</var> function used to convert common 
character entities in HTML to their plain text equivalent. The following 
Named character entities are converted to their plain text equivalent:
</p>
<dl>
<dt>&amp;nbsp;</dt>
<dd>' '</dd>
<dt>&amp;lt;</dt>
<dd>'&lt;'</dd>
<dt>&amp;gt;</dt>
<dd>'&gt;;'</dd>
<dt>&amp;amp;</dt>
<dd>'&amp;'</dd>
</dl>
<p>
Other named character entities or numeric character entities are included 
verbatim in the plain text output.
</p>
<p>
The return value is the result from the <var>AddOutput</var> method, and 
contains <b>False</b> when the maximum number of lines has been exceeded in 
the renderer.
</p>
</descr>
<seealso/>
</element>
<element name="THTML2TextRenderer.HtmlEntity.Result">
<short>
<b>True</b> on success, <b>False</b> when the maximum number of lines is 
exceeded.
</short>
</element>

<element name="THTML2TextRenderer.Reset">
<short>Resets the state and output for the renderer.</short>
<descr>
<p>
<var>Reset</var> is a procedure used to reset the state and output for the 
renderer. Reset sets values for internal flags used in the class, and clears 
any content stored in the render output.
</p>
</descr>
</element>

<element name="THTML2TextRenderer.SplitLongLine">
<short>
Adds a new line to split a line when it reaches the maximum length allowed in 
the class.
</short>
</element>
<element name="THTML2TextRenderer.SplitLongLine.Result">
<short>
Returns <b>True</b> if the line exceeded the allowed length and a new line was added.
</short>
</element>

<element name="THTML2TextRenderer.Create">
<short>Constructor for the class instance.</short>
<descr>
<p>
<var>Create</var> is the overloaded constructor for the class instance. HTML 
content is passed as an argument to the method using either a 
<var>String</var> value or a <var>TStream</var> instance. The Stream-based 
variant reads the content in Stream into a String variable for processing. 
The position in the stream is not changed prior to or after reading its 
content.
</p>
<p>
Create stores the HTML content in <var>aHTML</var> to an internal member used 
when parsing and processing using methods in the class. A UTF-8 Byte Order 
Mark (BOM) at the start of the HTML content is removed prior to processing.
</p>
<p>
Create sets the default values for the following properties:
</p>
<dl>
<dt>LineEndMark</dt>
<dd>Set to the value in the LineEnding constant for the platform or OS.</dd>
<dt>TitleMark</dt>
<dd>Set to the UTF-8 character '◈' (#9672 or #x25C8)</dd>
<dt>HorzLineMark</dt>
<dd>Set to the UTF-8 characters 
'——————————————————'.</dd>
<dt>LinkBeginMark</dt>
<dd>Set to the character '_'. </dd>
<dt>LinkEndMark</dt>
<dd>Set to the character '_'. </dd>
<dt>ListItemMark</dt>
<dd>Set to the UTF-8 characters '✶ ' (Hex  #$2736).</dd>
<dt>MoreMark</dt>
<dd>
Set to the characters '...' (Three Period characters - not an Ellipsis 
character).
</dd>
<dt>IndentStep</dt>
<dd>Set to 2.</dd>
</dl>
</descr>
<seealso/>
</element>
<element name="THTML2TextRenderer.Create.aHTML">
<short>String with the HTML content examined in the class.</short>
</element>
<element name="THTML2TextRenderer.Create.Stream">
<short>TStream instance with the HTML content examined in the class.</short>
</element>

<element name="THTML2TextRenderer.Destroy">
<short>Frees the class instance.</short>
<descr>
<p>
<var>Destroy</var> is the overridden destructor for the class instance. 
Destroy calls the inherited destructor.
</p>
</descr>
<seealso/>
</element>

<element name="THTML2TextRenderer.Render">
<short>
Parses the HTML and renders the plain text output.
</short>
<descr>
<p>
<var>Render</var> is a <var>String</var> function used to parse the HTML 
passed as an argument to the constructor, and to render the plain text output 
in the return value. The output is limited to the number of lines specified 
in the <var>aMaxLines</var> argument. The default value for the argument is 
the <var>MaxInt</var> constant.
</p>
<remark>
<var>AddOutput</var>, <var>HtmlTag</var>, and <var>HtmlEntity</var> return 
<b>False</b> if <var>aMaxLines</var> was exceeded.
</remark>
<p>
Renders calls the <var>Reset</var> method to set the initial values for 
members and flags used in the class instance. The parsing mechanism looks for 
HTML tags and character entities/references, processes their content, and 
calls the <var>AddOutput</var> method. Whitespace (characters #32, #9, #10, 
and #13) between tags and entities is always normalized into a single space 
character.
</p>
<p>
Render calls the HtmlTag, HtmlEntity, and AddOutput methods to process the 
HTML content passed to the method.
</p>
</descr>
<seealso/>
</element>
<element name="THTML2TextRenderer.Render.aMaxLines">
<short>Maximum number of lines to process in the method.</short>
</element>
<element name="THTML2TextRenderer.Render.Result">
<short>String with the plain text content extracted from the HTML.</short>
</element>

<element name="THTML2TextRenderer.LineEndMark">
<short>Defines the end-of-line character sequence.</short>
<descr>
<p>
<var>LineEndMark</var> is a <var>String</var> property which contains the 
end-of-line character sequence inserted in the plain text output for the 
renderer. By convention, the default value for the property is the value from 
the <var>LineEnding</var> constant defined for the platform or OS. The value 
is inserted in the renderer output in the <var>AddOutput</var> method.
</p>
</descr>
<seealso>
<link id="#rtl.system.LineEnding">LineEnding</link>
</seealso>
</element>

<element name="THTML2TextRenderer.TitleMark">
<short>Defines the character used to delimit a title or header.</short>
<descr>
<p>
<var>TitleMark</var> is inserted both prior to and following a title/header 
found in the HTML content in the <var>HtmlTag</var> method. The default value 
is the UTF-8 character '◈' (Decimal #9672 or Hex #x25C8).
</p>
</descr>
</element>

<element name="THTML2TextRenderer.HorzLineMark">
<short>Represents a HR tag in the plaint text output.</short>
<descr>
<p>
<var>HorzLineMark</var> is used in the implementation of the 
<var>HtmlTag</var> method when a <b>HR</b> tag is encountered in the HTML 
content. The default value for the property is the UTF-8  characters 
'——————————————————' (Eighteen Hex #$2013 
characters).
</p>
</descr>
<seealso/>
</element>

<element name="THTML2TextRenderer.LinkBeginMark">
<short>Represents an A start tag in the plain text output.</short>
<descr>
<p>
<var>LinkBeginMark</var> is a <var>String</var> property used to represent 
the start of the plain text output for an HTML A tag. <var>LinkEndMark</var> 
is used to represent the end of the anchor. The value is added to the plain 
text output for the renderer in the <var>HtmlTag</var> method.
</p>
</descr>
<seealso>
<link id="THTML2TextRenderer.LinkEndMark"/>
</seealso>
</element>

<element name="THTML2TextRenderer.LinkEndMark">
<short>Represents an A end tag in the plain text output.</short>
<descr>
<p>
<var>LinkEndMark</var> is a <var>String</var> property used to represent the 
end of the plain text output for an HTML A tag. <var>LinkBeginMark</var> is 
used to represent the start of the anchor. The value is added to the plain 
text output for the renderer in the <var>HtmlTag</var> method.
</p>
</descr>
<seealso>
<link id="THTML2TextRenderer.LinkBeginMark"/>
</seealso>
</element>

<element name="THTML2TextRenderer.ListItemMark">
<short>Represents a list item in the plain text output.</short>
<descr>
<p>
<var>ListItemMark</var> is a <var>String</var> property which contains the 
character(s) inserted before a HTML LI tag. The value is added to the plain 
text output for the renderer in the <var>HtmlTag</var> method.
</p>
</descr>
<seealso/>
</element>

<element name="THTML2TextRenderer.MoreMark">
<short>
Indicates that the plain text output is truncated due to a line limit 
restriction.
</short>
<descr>
<p>
The default value for the property is three (3) Period ('.') characters - 
<b>NOT</b> an Ellipsis character. The value is added to the plain text output 
for the renderer when the maximum number of lines has been exceeded in the 
<var>AddOutput</var> method.
</p>
</descr>
<seealso/>
</element>

<element name="THTML2TextRenderer.IndentStep">
<short>
Number of space characters used for each indentation level in the plain text 
output.
</short>
<descr>
<p>
<var>IndentStep</var> is an <var>Integer</var> property used to indicate the 
number of space characters generated for each indentation level in the plain 
text output for the renderer. The default value for the property is <b>2</b>, 
and is used in the implementation of the <var>AddOutput</var> method.
</p>
</descr>
<seealso/>
</element>

<element name="THTML2TextRenderer.MaxLineLen">
<short>
Maximum number of characters allowed in a line output by the renderer.
</short>
<descr>
<p>
The default value for the property is 80 (characters) as assigned in the Create 
constructor.
</p>
<p>
<var>MaxLineLen</var> is used when the AddOutput method checks the length of 
the generated output from the class.
</p>
</descr>
<version>
Added in LazUtils version 4.0.
</version>
<seealso/>
</element>

<element name="THTML2TextRenderer.IndentStep">
<short>
Increment (in spaces) for each nested HTML level.
</short>
<descr>
<p>
<var>IndentStep</var> contains the number of spaces included at the start a 
line each time a new Indent level is set for the renderer. The default value 
for the property is 2 (characters) per Indent level as assigned in the Create 
constructor.
</p>
</descr>
<version>
Added in LazUtils version 4.0.
</version>
<seealso/>
</element>

<element name="RenderHTML2Text">
<short>Converts the specified HTML content to a plain text value.</short>
<descr>
<p>
<var>RenderHTML2Text</var> is a <var>String</var> function used to convert 
the HTML content specified in <var>AHTML</var> to a string with the plain 
text for the content. RenderHTML2Text creates a temporary 
<var>THTML2TextRenderer</var> instance (using its default configuration 
values) to remove any HTML mark-up found in the AHTML argument by calling its 
<var>Render</var> method.
</p>
<p>
RenderHTML2Text is a convenience routine; use a THTML2TextRenderer instance 
when the HTML content is stored in a TStream instance, or to override the 
default configuration settings for the class instance.
</p>
</descr>
<seealso>
<link id="THTML2TextRenderer.Render"/>
</seealso>
</element>
<element name="RenderHTML2Text.Result">
<short>
String with the plain text value for the specified HTML content.
</short>
</element>
<element name="RenderHTML2Text.AHTML">
<short>String with the HTML content converted in the routine.</short>
</element>

</module>
<!-- html2textrender -->
</package>
</fpdoc-descriptions>