1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845
|
NAME
HTML::StripScripts - Strip scripting constructs out of HTML
SYNOPSIS
use HTML::StripScripts;
my $hss = HTML::StripScripts->new({ Context => 'Inline' });
$hss->input_start_document;
$hss->input_start('<i>');
$hss->input_text('hello, world!');
$hss->input_end('</i>');
$hss->input_end_document;
print $hss->filtered_document;
DESCRIPTION
This module strips scripting constructs out of HTML, leaving as much
non-scripting markup in place as possible. This allows web applications
to display HTML originating from an untrusted source without introducing
XSS (cross site scripting) vulnerabilities.
You will probably use HTML::StripScripts::Parser rather than using this
module directly.
The process is based on whitelists of tags, attributes and attribute
values. This approach is the most secure against disguised scripting
constructs hidden in malicious HTML documents.
As well as removing scripting constructs, this module ensures that there
is a matching end for each start tag, and that the tags are properly
nested.
Previously, in order to customise the output, you needed to subclass
"HTML::StripScripts" and override methods. Now, most customisation can
be done through the "Rules" option provided to "new()". (See
examples/declaration/ and examples/tags/ for cases where subclassing is
necessary.)
The HTML document must be parsed into start tags, end tags and text
before it can be filtered by this module. Use either
HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you
want to input an unparsed HTML document.
See examples/direct/ for an example of how to feed tokens directly to
HTML::StripScripts.
CONSTRUCTORS
new ( CONFIG )
Creates a new "HTML::StripScripts" filter object, bound to a
particular filtering policy. If present, the CONFIG parameter must
be a hashref. The following keys are recognized (unrecognized keys
will be silently ignored).
$s = HTML::Stripscripts->new({
Context => 'Document|Flow|Inline|NoTags',
BanList => [qw( br img )] | {br => '1', img => '1'},
BanAllBut => [qw(p div span)],
AllowSrc => 0|1,
AllowHref => 0|1,
AllowRelURL => 0|1,
AllowMailto => 0|1,
EscapeFiltered => 0|1,
Rules => { See below for details },
});
"Context"
A string specifying the context in which the filtered document
will be used. This influences the set of tags that will be
allowed.
If present, the "Context" value must be one of:
"Document"
If "Context" is "Document" then the filter will allow a full
HTML document, including the "HTML" tag and "HEAD" and
"BODY" sections.
"Flow"
If "Context" is "Flow" then most of the cosmetic tags that
one would expect to find in a document body are allowed,
including lists and tables but not including forms.
"Inline"
If "Context" is "Inline" then only inline tags such as "B"
and "FONT" are allowed.
"NoTags"
If "Context" is "NoTags" then no tags are allowed.
The default "Context" value is "Flow".
"BanList"
If present, this option must be an arrayref or a hashref. Any
tag that would normally be allowed (because it presents no XSS
hazard) will be blocked if the lowercase name of the tag is in
this list.
For example, in a guestbook application where "HR" tags are used
to separate posts, you may wish to prevent posts from including
"HR" tags, even though "HR" is not an XSS risk.
"BanAllBut"
If present, this option must be reference to an array holding a
list of lowercase tag names. This has the effect of adding all
but the listed tags to the ban list, so that only those tags
listed will be allowed.
"AllowSrc"
By default, the filter won't allow constructs that cause the
browser to fetch things automatically, such as "SRC" attributes
in "IMG" tags. If this option is present and true then those
constructs will be allowed.
"AllowHref"
By default, the filter won't allow constructs that cause the
browser to fetch things if the user clicks on something, such as
the "HREF" attribute in "A" tags. Set this option to a true
value to allow this type of construct.
"AllowRelURL"
By default, the filter won't allow relative URLs such as
"../foo.html" in "SRC" and "HREF" attribute values. Set this
option to a true value to allow them. "AllowHref" and / or
"AllowSrc" also need to be set to true for this to have any
effect.
"AllowMailto"
By default, "mailto:" links are not allowed. If "AllowMailto" is
set to a true value, then this construct will be allowed. This
can be enabled separately from AllowHref.
"EscapeFiltered"
By default, any filtered tags are outputted as
"<!--filtered-->". If "EscapeFiltered" is set to a true value,
then the filtered tags are converted to HTML entities.
For instance:
<br> --> <br>
"Rules"
The "Rules" option provides a very flexible way of customising
the filter.
The focus is safety-first, so it is applied after all of the
previous validation. This means that you cannot all malicious
data should already have been cleared.
Rules can be specified for tags and for attributes. Any tag or
attribute not explicitly listed will be handled by the default
"*" rules.
The following is a synopsis of all of the options that you can
use to configure rules. Below, an example is broken into
sections and explained.
Rules => {
tag => 0 | 1 | sub { tag_callback }
| {
attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
'*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
required => [qw(attrname attrname)],
tag => sub { tag_callback }
},
'*' => 0 | 1 | sub { tag_callback }
| {
attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
'*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
tag => sub { tag_callback }
}
}
EXAMPLE:
Rules => {
##########################
##### EXPLICIT RULES #####
##########################
## Allow <br> tags, reject <img> tags
br => 1,
img => 0,
## Send all <div> tags to a sub
div => sub { tag_callback },
## Allow <blockquote> tags,and allow the 'cite' attribute
## All other attributes are handled by the default C<*>
blockquote => {
cite => 1,
},
## Allow <a> tags, and
a => {
## Allow the 'title' attribute
title => 1,
## Allow the 'href' attribute if it matches the regex
href => '^http://yourdomain.com'
OR href => qr{^http://yourdomain.com},
## 'style' attributes are handled by a sub
style => sub { attr_callback },
## All other attributes are rejected
'*' => 0,
## Additionally, the <a> tag should be handled by this sub
tag => sub { tag_callback},
## If the <a> tag doesn't have these attributes, filter the tag
required => [qw(href title)],
},
##########################
##### DEFAULT RULES #####
##########################
## The default '*' rule - accepts all the same options as above.
## If a tag or attribute is not mentioned above, then the default
## rule is applied:
## Reject all tags
'*' => 0,
## Allow all tags and all attributes
'*' => 1,
## Send all tags to the sub
'*' => sub { tag_callback },
## Allow all tags, reject all attributes
'*' => { '*' => 0 },
## Allow all tags, and
'*' => {
## Allow the 'title' attribute
title => 1,
## Allow the 'href' attribute if it matches the regex
href => '^http://yourdomain.com'
OR href => qr{^http://yourdomain.com},
## 'style' attributes are handled by a sub
style => sub { attr_callback },
## All other attributes are rejected
'*' => 0,
## Additionally, all tags should be handled by this sub
tag => sub { tag_callback},
},
Tag Callbacks
sub tag_callback {
my ($filter,$element) = (@_);
$element = {
tag => 'tag',
content => 'inner_html',
attr => {
attr_name => 'attr_value',
}
};
return 0 | 1;
}
A tag callback accepts two parameters, the $filter object
and the C$element>. It should return 0 to completely ignore
the tag and its content (which includes any nested HTML
tags), or 1 to accept and output the tag.
The $element is a hash ref containing the keys:
"tag"
This is the tagname in lowercase, eg "a", "br", "img". If
you set the tag value to an empty string, then the tag will
not be outputted, but the tag contents will.
"content"
This is the equivalent of DOM's innerHTML. It contains the
text content and any HTML tags contained within this
element. You can change the content or set it to an empty
string so that it is not outputted.
"attr"
"attr" contains a hashref containing the attribute names and
values
If for instance, you wanted to replace "<b>" tags with "<span>"
tags, you could do this:
sub b_callback {
my ($filter,$element) = @_;
$element->{tag} = 'span';
$element->{attr}{style} = 'font-weight:bold';
return 1;
}
Attribute Callbacks
sub attr_callback {
my ( $filter, $tag, $attr_name, $attr_val ) = @_;
return undef | '' | 'value';
}
Attribute callbacks accept four parameters, the $filter object,
the $tag name, the $attr_name and the $attr_value.
It should return either "undef" to reject the attribute, or the
value to be used. An empty string keeps the attribute, but
without a value.
"BanList" vs "BanAllBut" vs "Rules"
It is not necessary to use "BanList" or "BanAllBut" - everything
can be done via "Rules", however it may be simpler to write:
BanAllBut => [qw(p div span)]
The logic works as follows:
* If BanAllBut exists, then ban everything but the tags in the list
* Add to the ban list any elements in BanList
* Any tags mentioned explicitly in Rules (eg a => 0, br => 1)
are added or removed from the BanList
* A default rule of { '*' => 0 } would ban all tags except
those mentioned in Rules
* A default rule of { '*' => 1 } would allow all tags except
those disallowed in the ban list, or by explicit rules
METHODS
This class provides the following methods:
hss_init ()
This method is called by new() and does the actual initialisation
work for the new HTML::StripScripts object.
input_start_document ()
This method initializes the filter, and must be called once before
starting on each HTML document to be filtered.
input_start ( TEXT )
Handles a start tag from the input document. TEXT must be the full
text of the tag, including angle-brackets.
input_end ( TEXT )
Handles an end tag from the input document. TEXT must be the full
text of the end tag, including angle-brackets.
input_text ( TEXT )
Handles some non-tag text from the input document.
input_process ( TEXT )
Handles a processing instruction from the input document.
input_comment ( TEXT )
Handles an HTML comment from the input document.
input_declaration ( TEXT )
Handles an declaration from the input document.
input_end_document ()
Call this method to signal the end of the input document.
filtered_document ()
Returns the filtered document as a string.
SUBCLASSING
The only reason for subclassing this module now is to add to the list of
accepted tags, attributes and styles (See "WHITELIST INITIALIZATION
METHODS"). Everything else can be achieved with "Rules".
The "HTML::StripScripts" class is subclassable. Filter objects are plain
hashes and "HTML::StripScripts" reserves only hash keys that start with
"_hss". The filter configuration can be set up by invoking the
hss_init() method, which takes the same arguments as new().
OUTPUT METHODS
The filter outputs a stream of start tags, end tags, text, comments,
declarations and processing instructions, via the following "output_*"
methods. Subclasses may override these to intercept the filter output.
The default implementations of the "output_*" methods pass the text on
to the output() method. The default implementation of the output()
method appends the text to a string, which can be fetched with the
filtered_document() method once processing is complete.
If the output() method or the individual "output_*" methods are
overridden in a subclass, then filtered_document() will not work in that
subclass.
output_start_document ()
This method gets called once at the start of each HTML document
passed through the filter. The default implementation does nothing.
output_end_document ()
This method gets called once at the end of each HTML document passed
through the filter. The default implementation does nothing.
output_start ( TEXT )
This method is used to output a filtered start tag.
output_end ( TEXT )
This method is used to output a filtered end tag.
output_text ( TEXT )
This method is used to output some filtered non-tag text.
output_declaration ( TEXT )
This method is used to output a filtered declaration.
output_comment ( TEXT )
This method is used to output a filtered HTML comment.
output_process ( TEXT )
This method is used to output a filtered processing instruction.
output ( TEXT )
This method is invoked by all of the default "output_*" methods. The
default implementation appends the text to the string that the
filtered_document() method will return.
output_stack_entry ( TEXT )
This method is invoked when a tag plus all text and nested HTML
content within the tag has been processed. It adds the tag plus its
content to the content for its parent tag.
REJECT METHODS
When the filter encounters something in the input document which it
cannot transform into an acceptable construct, it invokes one of the
following "reject_*" methods to put something in the output document to
take the place of the unacceptable construct.
The TEXT parameter is the full text of the unacceptable construct.
The default implementations of these methods output an HTML comment
containing the text "filtered". If "EscapeFiltered" is set to true, then
the rejected text is HTML escaped instead.
Subclasses may override these methods, but should exercise caution. The
TEXT parameter is unfiltered input and may contain malicious constructs.
reject_start ( TEXT )
reject_end ( TEXT )
reject_text ( TEXT )
reject_declaration ( TEXT )
reject_comment ( TEXT )
reject_process ( TEXT )
WHITELIST INITIALIZATION METHODS
The filter refers to various whitelists to determine which constructs
are acceptable. To modify these whitelists, subclasses can override the
following methods.
Each method is called once at object initialization time, and must
return a reference to a nested data structure. These references are
installed into the object, and used whenever the filter needs to refer
to a whitelist.
The default implementations of these methods can be invoked as class
methods.
See examples/tags/ and examples/declaration/ for examples of how to
override these methods.
init_context_whitelist ()
Returns a reference to the "Context" whitelist, which determines
which tags may appear at each point in the document, and which other
tags may be nested within them.
It is a hash, and the keys are context names, such as "Flow" and
"Inline".
The values in the hash are hashrefs. The keys in these subhashes are
lowercase tag names, and the values are context names, specifying
the context that the tag provides to any other tags nested within
it.
The special context "EMPTY" as a value in a subhash indicates that
nothing can be nested within that tag.
init_attrib_whitelist ()
Returns a reference to the "Attrib" whitelist, which determines
which attributes each tag can have and the values that those
attributes can take.
It is a hash, and the keys are lowercase tag names.
The values in the hash are hashrefs. The keys in these subhashes are
lowercase attribute names, and the values are attribute value class
names, which are short strings describing the type of values that
the attribute can take, such as "color" or "number".
init_attval_whitelist ()
Returns a reference to the "AttVal" whitelist, which is a hash that
maps attribute value class names from the "Attrib" whitelist to
coderefs to subs to validate (and optionally transform) a particular
attribute value.
The filter calls the attribute value validation subs with the
following parameters:
"filter"
A reference to the filter object.
"tagname"
The lowercase name of the tag in which the attribute appears.
"attrname"
The name of the attribute.
"attrval"
The attribute value found in the input document, in canonical
form (see "CANONICAL FORM").
The validation sub can return undef to indicate that the attribute
should be removed from the tag, or it can return the new value for
the attribute, in canonical form.
init_style_whitelist ()
Returns a reference to the "Style" whitelist, which determines which
CSS style directives are permitted in "style" tag attributes. The
keys are value names such as "color" and "background-color", and the
values are class names to be used as keys into the "AttVal"
whitelist.
init_deinter_whitelist
Returns a reference to the "DeInter" whitelist, which determines
which inline tags the filter should attempt to automatically
de-interleave if they are encountered interleaved. For example, the
filter will transform:
<b>hello <i>world</b> !</i>
Into:
<b>hello <i>world</i></b><i> !</i>
because both "b" and "i" appear as keys in the "DeInter" whitelist.
CHARACTER DATA PROCESSING
These methods transform attribute values and non-tag text from the input
document into canonical form (see "CANONICAL FORM"), and transform text
in canonical form into a suitable form for the output document.
text_to_canonical_form ( TEXT )
This method is used to reduce non-tag text from the input document
to canonical form before passing it to the filter_text() method.
The default implementation unescapes all entities that map to
"US-ASCII" characters other than ampersand, and replaces any
ampersands that don't form part of valid entities with "&".
quoted_to_canonical_form ( VALUE )
This method is used to reduce attribute values quoted with
doublequotes or singlequotes to canonical form before passing it to
the handler subs in the "AttVal" whitelist.
The default behavior is the same as that of
"text_to_canonical_form()", plus it converts any CR, LF or TAB
characters to spaces.
unquoted_to_canonical_form ( VALUE )
This method is used to reduce attribute values without quotes to
canonical form before passing it to the handler subs in the "AttVal"
whitelist.
The default implementation simply replaces all ampersands with
"&", since that corresponds with the way most browsers treat
entities in unquoted values.
canonical_form_to_text ( TEXT )
This method is used to convert the text in canonical form returned
by the filter_text() method to a form suitable for inclusion in the
output document.
The default implementation runs anything that doesn't look like a
valid entity through the escape_html_metachars() method.
canonical_form_to_attval ( ATTVAL )
This method is used to convert the text in canonical form returned
by the "AttVal" handler subs to a form suitable for inclusion in
doublequotes in the output tag.
The default implementation converts CR, LF and TAB characters to a
single space, and runs anything that doesn't look like a valid
entity through the escape_html_metachars() method.
validate_href_attribute ( TEXT )
If the "AllowHref" filter configuration option is set, then this
method is used to validate "href" type attribute values. TEXT is the
attribute value in canonical form. Returns a possibly modified
attribute value (in canonical form) or "undef" to reject the
attribute.
The default implementation allows only absolute "http" and "https"
URLs, permits port numbers and query strings, and imposes reasonable
length limits.
It does not URI escape the query string, and it does not guarantee
properly formatted URIs, it just tries to give safe URIs. You can
always use an attribute callback (see "Attribute Callbacks") to
provide stricter handling.
validate_mailto ( TEXT )
If the "AllowMailto" filter configuration option is set, then this
method is used to validate "href" type attribute values which begin
with "mailto:". TEXT is the attribute value in canonical form.
Returns a possibly modified attribute value (in canonical form) or
"undef" to reject the attribute.
This uses a lightweight regex and does not guarantee that email
addresses are properly formatted. You can always use an attribute
callback (see "Attribute Callbacks") to provide stricter handling.
validate_src_attribute ( TEXT )
If the "AllowSrc" filter configuration option is set, then this
method is used to validate "src" type attribute values. TEXT is the
attribute value in canonical form. Returns a possibly modified
attribute value (in canonical form) or "undef" to reject the
attribute.
The default implementation behaves as validate_href_attribute().
OTHER METHODS TO OVERRIDE
As well as the output, reject, init and cdata methods listed above, it
might make sense for subclasses to override the following methods:
filter_text ( TEXT )
This method will be invoked to filter blocks of non-tag text in the
input document. Both input and output are in canonical form, see
"CANONICAL FORM".
The default implementation does no filtering.
escape_html_metachars ( TEXT )
This method is used to escape all HTML metacharacters in TEXT. The
return value must be a copy of TEXT with metacharacters escaped.
The default implementation escapes a minimal set of metacharacters
for security against XSS vulnerabilities. The set of characters to
escape is a compromise between the need for security and the need to
ensure that the filter will work for documents in as many different
character sets as possible.
Subclasses which make strong assumptions about the document
character set will be able to escape much more aggressively.
strip_nonprintable ( TEXT )
Returns a copy of TEXT with runs of nonprintable characters replaced
with spaces or some other harmless string. Avoids replacing anything
with the empty string, as that can lead to other security issues.
The default implementation strips out only NULL characters, in order
to avoid scrambling text for as many different character sets as
possible.
Subclasses which make some sort of assumption about the character
set in use will be able to have a much wider definition of a
nonprintable character, and hence a more secure strip_nonprintable()
implementation.
ATTRIBUTE VALUE HANDLER SUBS
References to the following subs appear in the "AttVal" whitelist
returned by the init_attval_whitelist() method.
_hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value hander for the "style" attribute.
_hss_attval_size ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for attributes who's values are some sort of
size or length.
_hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for attributes who's values are a simple
integer.
_hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for color attributes.
_hss_attval_text ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for text attributes.
_hss_attval_word ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for attributes who's values must consist of
a single short word, with minus characters permitted.
_hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for attributes who's values must consist of
one or more words, separated by spaces and/or commas.
_hss_attval_wordlistq ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for attributes who's values must consist of
one or more words, separated by commas, with optional doublequotes
around words and spaces allowed within the doublequotes.
_hss_attval_href ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for "href" type attributes. If the
"AllowHref" or "AllowMailto" configuration options are set, uses the
validate_href_attribute() method to check the attribute value.
_hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for "src" type attributes. If the "AllowSrc"
configuration option is set, uses the validate_src_attribute()
method to check the attribute value.
_hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for "src" type style pseudo attributes.
_hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
Attribute value handler for attributes that have no value or a value
that is ignored. Just returns the attribute name as the value.
CANONICAL FORM
Many of the methods described above deal with text from the input
document, encoded in what I call "canonical form", defined as follows:
All characters other than ampersands represent themselves. Literal
ampersands are encoded as "&". Non "US-ASCII" characters may appear
as literals in whatever character set is in use, or they may appear as
named or numeric HTML entities such as "æ", "穩" and
"ÿ". Unknown named entities such as "&foo;" may appear.
The idea is to be able to be able to reduce input text to a minimal
form, without making too many assumptions about the character set in
use.
PRIVATE METHODS
The following methods are internal to this class, and should not be
invoked from elsewhere. Subclasses should not use or override these
methods.
_hss_prepare_ban_list (CFG)
Returns a hash ref representing all the banned tags, based on the
values of BanList and BanAllBut
_hss_prepare_rules (CFG)
Returns a hash ref representing the tag and attribute rules (See
"Rules").
Returns undef if no filters are specified, in which case the
attribute filter code has very little performance impact. If any
rules are specified, then every tag and attribute is checked.
_hss_get_attr_filter ( DEFAULT_FILTERS TAG_FILTERS ATTR_NAME)
Returns the attribute filter rule to apply to this particular
attribute.
Checks for:
- a named attribute rule in a named tag
- a default * attribute rule in a named tag
- a named attribute rule in the default * rules
- a default * attribute rule in the default * rules
_hss_join_attribs (FILTERED_ATTRIBS)
Accepts a hash ref containing the attribute names as the keys, and
the attribute values as the values. Escapes them and returns a
string ready for output to HTML
_hss_decode_numeric ( NUMERIC )
Returns the string that should replace the numeric entity NUMERIC in
the text_to_canonical_form() method.
_hss_tag_is_banned ( TAGNAME )
Returns true if the lower case tag name TAGNAME is on the list of
harmless tags that the filter is configured to block, false
otherwise.
_hss_get_to_valid_context ( TAG )
Tries to get the filter to a context in which the tag TAG is
allowed, by introducing extra end tags or start tags if necessary.
TAG can be either the lower case name of a tag or the string
'CDATA'.
Returns 1 if an allowed context is reached, or 0 if there's no
reasonable way to get to an allowed context and the tag should just
be rejected.
_hss_close_innermost_tag ()
Closes the innermost open tag.
_hss_context ()
Returns the current named context of the filter.
_hss_valid_in_context ( TAG, CONTEXT )
Returns true if the lowercase tag name TAG is valid in context
CONTEXT, false otherwise.
_hss_valid_in_current_context ( TAG )
Returns true if the lowercase tag name TAG is valid in the filter's
current context, false otherwise.
BUGS AND LIMITATIONS
Performance
This module does a lot of work to ensure that tags are correctly
nested and are not left open, causing unnecessary overhead for
applications where that doesn't matter.
Such applications may benefit from using the more lightweight
HTML::Scrubber::StripScripts module instead.
Strictness
URIs and email addresses are cleaned up to be safe, but not
necessarily accurate. That would have required adding dependencies.
Attribute callbacks can be used to add this functionality if
required, or the validation methods can be overridden.
By default, filtered HTML may not be valid strict XHTML, for
instance empty required attributes may be outputted. However, with
"Rules", it should be possible to force the HTML to validate.
REPORTING BUGS
Please report any bugs or feature requests to
bug-html-stripscripts@rt.cpan.org, or through the web interface at
<http://rt.cpan.org>.
SEE ALSO
HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex
AUTHOR
Original author Nick Cleaton <nick@cleaton.net>
New code added and module maintained by Clinton Gormley
<clint@traveljury.com>
COPYRIGHT
Copyright (C) 2003 Nick Cleaton. All Rights Reserved.
Copyright (C) 2007 Clinton Gormley. All Rights Reserved.
LICENSE
This module is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
|