1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
|
package Lingua::StopWords;
use strict;
use warnings;
require Exporter;
our @ISA = qw(Exporter);
our %EXPORT_TAGS = ( 'all' => [ qw( getStopWords ) ] );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our $VERSION = 0.12;
sub getStopWords {
my ( $language, $encoding ) = @_;
return undef unless $language;
$language = uc($language);
eval { require "Lingua/StopWords/$language.pm"; };
return undef if $@;
my @args = $encoding ? ($encoding) : ();
no strict 'refs';
return &{ "Lingua::StopWords::$language\::getStopWords" }(@args);
}
1;
__END__
=head1 NAME
Lingua::StopWords - Stop words for several languages.
=head1 SYNOPSIS
use Lingua::StopWords qw( getStopWords );
my $stopwords = getStopWords('en');
my @words = qw( i am the walrus goo goo g'joob );
# prints "walrus goo goo g'joob"
print join ' ', grep { !$stopwords->{$_} } @words;
=head1 DESCRIPTION
In keyword search, it is common practice to suppress a collection of
"stopwords": words such as "the", "and", "maybe", etc. which exist in in a
large number of documents and do not tell you anything important about any
document which contains them. This module provides such "stoplists" in
several languages.
=head2 Supported Languages
|-----------------------------------------------------------|
| Language | ISO code | default encoding | also available |
|-----------------------------------------------------------|
| Danish | da | ISO-8859-1 | UTF-8 |
| Dutch | nl | ISO-8859-1 | UTF-8 |
| English | en | ISO-8859-1 | UTF-8 |
| Finnish | fi | ISO-8859-1 | UTF-8 |
| French | fr | ISO-8859-1 | UTF-8 |
| German | de | ISO-8859-1 | UTF-8 |
| Hungarian | hu | ISO-8859-2 | UTF-8 |
| Indonesian | id | ISO-8859-1 | UTF-8 |
| Italian | it | ISO-8859-1 | UTF-8 |
| Norwegian | no | ISO-8859-1 | UTF-8 |
| Portuguese | pt | ISO-8859-1 | UTF-8 |
| Romanian | ro | ISO-8859-2 | UTF-8 |
| Spanish | es | ISO-8859-1 | UTF-8 |
| Swedish | sv | ISO-8859-1 | UTF-8 |
| Russian | ru | KOI8-R | UTF-8 |
|-----------------------------------------------------------|
=head1 FUNCTIONS
=head2 getStopWords
my $stoplist = getStopWords('en');
my $utf8_stoplist = getStopWords('en', 'UTF-8');
Retrieve a stoplist in the form of a hashref where the keys are all
stopwords and the values are all 1.
$stoplist = {
and => 1,
if => 1,
# ...
};
getStopWords() expects 1-2 arguments. The first, which is required, is an ISO
code representing a supported language. If the ISO code cannot be found,
getStopWords returns undef.
The second argument should be 'UTF-8' if you want the stopwords encoded in
UTF-8. The UTF-8 flag will be turned on, so make sure you understand all the
implications of that.
=head1 INSTALLATION
To install this module type the following:
perl Build.PL
./Build
./Build test
./Build install
=head1 SEE ALSO
The stoplists supplied by this module were created as part of the Snowball
project (see L<http://snowball.tartarus.org>,
L<Lingua::Stem::Snowball|Lingua::Stem::Snowball>).
L<Lingua::EN::StopWords|Lingua::EN::StopWords> provides a different stoplist
for English.
=head1 SOURCE REPOSITORY
L<https://github.com/wollmers/Lingua-StopWords>
=head1 AUTHOR
Maintained by Helmut Wollmersdorfer E<lt>helmut@wollmersdorfer.atE<gt>
and Marvin Humphrey E<lt>marvin at rectangular dot comE<gt>.
Original author Fabien Potencier, E<lt>fabpot at cpan dot orgE<gt>.
=head1 COPYRIGHT
Copyright 2021 Helmut Wollmersdorfer
Copyright 2004-2008 Fabien Potencier, Marvin Humphrey
=head1 LICENSE
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.3 or,
at your option, any later version of Perl 5 you may have available.
=cut
|