File: Strip.pm

package info (click to toggle)
libhtml-strip-perl 2.12-1
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 172 kB
  • sloc: ansic: 266; perl: 88; makefile: 3
file content (323 lines) | stat: -rw-r--r-- 7,910 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
package HTML::Strip;

require DynaLoader;
our @ISA = qw(DynaLoader);
our $VERSION = '2.12';
bootstrap HTML::Strip $VERSION;

use 5.008;
use warnings;
use strict;

use Carp;

my $_html_entities_p = eval { require HTML::Entities; 1 };

my %defaults = (
    striptags => [qw( title
                      style
                      script
                      applet )],
    emit_spaces	    => 1,
    emit_newlines	=> 0,
    decode_entities	=> 1,
    filter          => $_html_entities_p ? 'filter_entities' : undef,
    auto_reset      => 0,
    debug           => 0,
);

sub new {
    my $class = shift;
    my $obj = _create();
    bless $obj, $class;

    my %args = (%defaults, @_);
    while( my ($key, $value) = each %args ) {
        my $method = "set_${key}";
        if( $obj->can($method) ) {
            $obj->$method($value);
        } else {
            Carp::carp "Invalid setting '$key'";
        }
    }
    return $obj;
}

sub set_striptags {
    my ($self, @tags) = @_;
    if( ref($tags[0]) eq 'ARRAY' ) {
        $self->_set_striptags_ref( $tags[0] );
    } else {
        $self->_set_striptags_ref( \@tags );
    }
}

{
    # an inside-out object approach
    # for the 'filter' attribute
    my %filter_of;

    sub set_filter {
        my ($self, $filter) = @_;
        $filter_of{0+$self} = $filter;
    }

    sub filter {
        my $self = shift;
        return $filter_of{0+$self}
    }

    # XXX rename _xs_destroy() to DESTROY() in Strip.xs if removing this code
    sub DESTROY {
        my $self = shift;
        delete $filter_of{0+$self};
        $self->_xs_destroy;
    }
}

# $decoded_string = $self->filter_entities( $string )
sub filter_entities {
    my $self = shift;
    if( $self->decode_entities ) {
        return HTML::Entities::decode($_[0]);
    }
    return $_[0];
}

sub _do_filter {
    my $self = shift;
    my $filter = $self->filter;
    # no filter: return immediately
    return $_[0] unless defined $filter;

    if ( !ref $filter ) { # method name
        return $self->$filter( @_ );
    } else { # code ref
        return $filter->( @_ );
    }
}

sub parse {
    my ($self, $text) = @_;
    my $stripped = $self->_strip_html( $text );
    return $self->_do_filter( $stripped );
}

sub eof {
    my $self = shift;
    $self->_reset();
}

1;
__END__

=head1 NAME

HTML::Strip - Perl extension for stripping HTML markup from text.

=head1 SYNOPSIS

  use HTML::Strip;

  my $hs = HTML::Strip->new();

  my $clean_text = $hs->parse( $raw_html );
  $hs->eof;

=head1 DESCRIPTION

This module simply strips HTML-like markup from text rapidly and
brutally.  It could easily be used to strip XML or SGML markup
instead - but as removing HTML is a much more common problem, this
module lives in the HTML:: namespace.

It is written in XS, and thus about five times quicker than using
regular expressions for the same task.

It does I<not> do any syntax checking.  If you want that, use
L<HTML::Parser>.  Instead it merely applies the following rules:

=over 4

=item 1

Anything that looks like a tag, or group of tags will be replaced with
a single space character.  Tags are considered to be anything that
starts with a C<E<lt>> and ends with a C<E<gt>>; with the caveat that a
C<E<gt>> character may appear in either of the following without
ending the tag:

=over 4

=item Quote

Quotes are considered to start with either a C<'> or a C<"> character,
and end with a matching character I<not> preceded by an even number or
escaping slashes (i.e. C<\"> does not end the quote but C<\\\\"> does).

=item Comment

If the tag starts with an exclamation mark, it is assumed to be a
declaration or a comment.   Within such tags, C<E<gt>> characters do not
end the tag if they appear within pairs of double dashes
(e.g. C<E<lt>!-- E<lt>a href="old.htm"E<gt>old pageE<lt>/aE<gt> --E<gt>>
would be stripped completely).  No parsing for quotes is performed
within comments, so for instance
C<E<lt>!-- comment with both ' quote types " --E<gt>>
would be entirely stripped.

=back

=item 2

Anything that appears between tags which we term I<strip tags> is removed.
By default, these tags are C<title>, C<script>, C<style> and C<applet>.

=back

HTML::Strip maintains state between calls, so you can parse a document
in chunks should you wish.  If a call to C<parse()> ends half-way through
a tag, quote or comment; the next call to C<parse()> expects its input to
carry on from that point.

If this is not the behaviour you want, you can either call C<eof()>
between calls to C<parse()>, or set C<auto_reset> to true (either
on the constructor or with C<set_auto_reset>) so that the parser will
reset after each call.

=head2 METHODS

=over

=item new()

Constructor.  Can optionally take a hash of settings (with keys
corresponding to the C<set_> methods below).

Example:

 my $hs = HTML::Strip->new(
     striptags   => [ 'script', 'iframe' ],
     emit_spaces => 0
 );

=item parse()

Takes a string as an argument, returns it stripped of HTML.

=item eof()

Resets the current state information, ready to parse a new block of HTML.

=item clear_striptags()

Clears the current set of strip tags.

=item add_striptag()

Adds the string passed as an argument to the current set of strip tags.

=item set_striptags()

Takes a reference to an array of strings, which replace the current
set of strip tags.

=item set_emit_spaces()

Takes a boolean value.  If set to false, HTML::Strip will not attempt
any conversion of tags into spaces.  Set to true by default.

=item set_emit_newlines()

Takes a boolean value.  If set to true, HTML::Strip will output newlines
after C<E<lt>brE<gt>> and C<E<lt>pE<gt>> tags.  Set to false by default.

=item set_decode_entities()

Takes a boolean value.  If set to false, HTML::Strip will not decode HTML
entities.  Set to true by default.

=item filter_entities()

If HTML::Entities is available, this method behaves just
like invoking HTML::Entities::decode_entities, except that
it respects the current setting of 'decode_entities'.

=item set_filter()

Sets a filter to be applied after tags were stripped.
It may accept the name of a method (like 'filter_entities')
or a code ref.  By default, its value is 'filter_entities'
if HTML::Entities is available or C<undef> otherwise.

=item set_auto_reset()

Takes a boolean value.  If set to true, C<parse> resets after
each call (equivalent to calling C<eof>).  Otherwise, the
parser remembers its state from one call to C<parse> to
another, until you call C<eof> explicitly.  Set to false
by default.

=item set_debug()

Outputs extensive debugging information on internal state during the parse.
Not intended to be used by anyone except the module maintainer.

=item decode_entities()

=item filter()

=item auto_reset()

=item debug()

Readonly accessors for their respective settings.

=back

=head2 LIMITATIONS

=over 4

=item Whitespace

Despite only outputting one space character per group of tags, and
avoiding doing so when tags are bordered by spaces or the start or
end of strings, HTML::Strip can often output more than desired; such
as with the following HTML:

 <h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>

Which gives the following output:

C<E<nbsp>HTML::StripE<nbsp>E<nbsp>E<nbsp>E<nbsp>fast, and brutalE<nbsp>E<nbsp>E<nbsp>>

Thus, you may want to post-filter the output of HTML::Strip to remove
excess whitespace (for example, using C<tr/ / /s;>).
(This has been improved since previous releases, but is still an issue)

=item HTML Entities

HTML::Strip will only attempt decoding of HTML entities if
L<HTML::Entities> is installed.

=back

=head2 EXPORT

None by default.

=head1 AUTHOR

Alex Bowley E<lt>kilinrax@cpan.orgE<gt>

=head1 SEE ALSO

L<perl>, L<HTML::Parser>, L<HTML::Entities>

=head1 LICENSE

This program is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.

=cut