File: StopWords.pm

package info (click to toggle)
liblingua-stopwords-perl 0.12-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 196 kB
  • sloc: perl: 778; makefile: 7
file content (137 lines) | stat: -rw-r--r-- 4,231 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
package Lingua::StopWords;
use strict;
use warnings;

require Exporter;
our @ISA = qw(Exporter);
our %EXPORT_TAGS = ( 'all' => [ qw( getStopWords ) ] );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our $VERSION = 0.12;

sub getStopWords {
    my ( $language, $encoding ) = @_;

    return undef unless $language;

    $language = uc($language);
    eval { require "Lingua/StopWords/$language.pm"; };
    return undef if $@;

    my @args = $encoding ? ($encoding) : ();
    no strict 'refs';
    return &{ "Lingua::StopWords::$language\::getStopWords" }(@args);
}

1;

__END__

=head1 NAME

Lingua::StopWords - Stop words for several languages.

=head1 SYNOPSIS

    use Lingua::StopWords qw( getStopWords );
    my $stopwords = getStopWords('en');

    my @words = qw( i am the walrus goo goo g'joob );

    # prints "walrus goo goo g'joob"
    print join ' ', grep { !$stopwords->{$_} } @words;

=head1 DESCRIPTION

In keyword search, it is common practice to suppress a collection of
"stopwords": words such as "the", "and", "maybe", etc. which exist in in a
large number of documents and do not tell you anything important about any
document which contains them.  This module provides such "stoplists" in
several languages.

=head2 Supported Languages

    |-----------------------------------------------------------|
    | Language   | ISO code | default encoding | also available |
    |-----------------------------------------------------------|
    | Danish     | da       | ISO-8859-1       | UTF-8          |
    | Dutch      | nl       | ISO-8859-1       | UTF-8          |
    | English    | en       | ISO-8859-1       | UTF-8          |
    | Finnish    | fi       | ISO-8859-1       | UTF-8          |
    | French     | fr       | ISO-8859-1       | UTF-8          |
    | German     | de       | ISO-8859-1       | UTF-8          |
    | Hungarian  | hu       | ISO-8859-2       | UTF-8          |
    | Indonesian | id       | ISO-8859-1       | UTF-8          |
    | Italian    | it       | ISO-8859-1       | UTF-8          |
    | Norwegian  | no       | ISO-8859-1       | UTF-8          |
    | Portuguese | pt       | ISO-8859-1       | UTF-8          |
    | Romanian   | ro       | ISO-8859-2       | UTF-8          |
    | Spanish    | es       | ISO-8859-1       | UTF-8          |
    | Swedish    | sv       | ISO-8859-1       | UTF-8          |
    | Russian    | ru       | KOI8-R           | UTF-8          |
    |-----------------------------------------------------------|

=head1 FUNCTIONS

=head2 getStopWords

    my $stoplist      = getStopWords('en');
    my $utf8_stoplist = getStopWords('en', 'UTF-8');

Retrieve a stoplist in the form of a hashref where the keys are all
stopwords and the values are all 1.

    $stoplist = {
        and => 1,
        if  => 1,
        # ...
    };

getStopWords() expects 1-2 arguments.  The first, which is required, is an ISO
code representing a supported language.  If the ISO code cannot be found,
getStopWords returns undef.

The second argument should be 'UTF-8' if you want the stopwords encoded in
UTF-8.  The UTF-8 flag will be turned on, so make sure you understand all the
implications of that.

=head1 INSTALLATION

To install this module type the following:

   perl Build.PL
   ./Build
   ./Build test
   ./Build install

=head1 SEE ALSO

The stoplists supplied by this module were created as part of the Snowball
project (see L<http://snowball.tartarus.org>,
L<Lingua::Stem::Snowball|Lingua::Stem::Snowball>).

L<Lingua::EN::StopWords|Lingua::EN::StopWords> provides a different stoplist
for English.

=head1 SOURCE REPOSITORY

L<https://github.com/wollmers/Lingua-StopWords>

=head1 AUTHOR

Maintained by Helmut Wollmersdorfer E<lt>helmut@wollmersdorfer.atE<gt>
and Marvin Humphrey E<lt>marvin at rectangular dot comE<gt>.
Original author Fabien Potencier, E<lt>fabpot at cpan dot orgE<gt>.

=head1 COPYRIGHT

Copyright 2021 Helmut Wollmersdorfer
Copyright 2004-2008 Fabien Potencier, Marvin Humphrey

=head1 LICENSE

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.3 or,
at your option, any later version of Perl 5 you may have available.

=cut