1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
|
package Unicode::CaseFold;
# ABSTRACT: Unicode case-folding for case-insensitive lookups.
BEGIN {
our $VERSION = '1.01'; # VERSION
}
our $AUTHORITY = 'cpan:ARODLAND'; # AUTHORITY
use strict;
use warnings;
use 5.008001;
use Scalar::Util 1.11 ();
require Exporter;
our @ISA = qw(Exporter);
our @EXPORT_OK = qw(case_fold);
our @EXPORT = qw(fc);
our $SIMPLE_FOLDING;
our $XS = 0;
BEGIN {
unless ($ENV{PERL_UNICODE_CASEFOLD_PP}) {
local $@;
eval {
our $VERSION;
require XSLoader;
XSLoader::load(__PACKAGE__,
exists $Unicode::CaseFold::{VERSION}
? ${ $Unicode::CaseFold::{VERSION} }
: ()
);
};
die $@ if $@ && $@ !~ /object version|loadable object/;
$XS = 1 unless $@;
$SIMPLE_FOLDING = 0 unless $@;
}
if (!$XS) {
require Unicode::CaseFoldPP;
}
}
sub fc {
@_ = ($_) unless @_;
goto &case_fold;
}
BEGIN {
# Perl 5.10+ supports the (_) prototype which does the $_-defaulting for us,
# and handles "lexical $_". Older perl doesn't, but we can fake it fairly
# closely with a (;$) prototype. Older perl didn't have lexical $_ anyway.
if ($^V ge v5.10.0) {
Scalar::Util::set_prototype(\&fc, '_');
} else {
Scalar::Util::set_prototype(\&fc, ';$');
}
}
1;
__END__
=pod
=encoding UTF-8
=head1 NAME
Unicode::CaseFold - Unicode case-folding for case-insensitive lookups.
=head1 VERSION
version 1.01
=head1 SYNOPSIS
use Unicode::CaseFold;
my $folded = fc $string;
=head2 What is Case-Folding?
In non-Unicode contexts, a common idiom to compare two strings
case-insensitively is C<lc($this) eq lc($that)>. Before comparing two strings
we I<normalize> them to an all-lowercase version. C<"Hello">, C<"HELLO">, and
C<"HeLlO"> all have the same lowercase form (C<"hello">), so it doesn't matter
which one we start with; they are all equal to one another after C<lc>.
In Unicode, things aren't so simple. A Unicode character might have mappings
for I<uppercase>, I<lowercase>, and I<titlecase>, and the lowercase mapping of
the uppercase mapping of a given character might not be the character that you
started with! For example C<< lc(uc("\N{LATIN SMALL LETTER SHARP S")) >> is
C<"ss">, not the eszett we started off with! Case-folding is a part of the
Unicode standard that allows any two strings that differ from one another only
by case to map to the same "case-folded" form, even when those strings include
characters with complex case-mappings.
=head2 Use for Case-insensitive Comparison
Simply write C<fc($this) eq fc($that)> instead of C<lc($this) eq lc($that)>.
You can also use C<index> on case-folded strings for substring search.
=head2 Use for String Lookups
Frequently we want to store data in a hash, or a database, or an external file
for later retrieval. Sometimes we want to be able to match the keys in this
data case-insensitively -- that is, we should be able to store some data under
the key "hello" and later retrieve it with the key "HELLO". Some databases
have complete support for collation, but in other databases the support is
missing or broken, and Perl hashes don't support it at all. By making
case-folding part of the process you use to normalize your keys before using
them to access a database or data structure, you get case-insensitive lookup.
$roles{fc "Samuel L. Jackson"} = ["Gin Rummy", "Nick Fury", "Mace Windu"];
$roles = $roles{fc "Samuel l. JACKSON"}; # Gets the data.
=head1 DESCRIPTION
This module provides Unicode case-folding for Perl. Case-folding is a tool
that allows a program to make case-insensitive string comparisons or do
case-insensitive lookups.
=head1 EXPORTS
=head2 fc($str)
Exported by default when you use the module. C<use Unicode::CaseFold ()> or
C<use Unicode::CaseFold qw(case_fold !fc)> if you don't want it to be
exported.
Returns the case-folded version of C<$str>. This function is prototyped to act
as much as possible like the built-ins C<lc> and C<uc>; it imposes a scalar
context on its argument, and if called with no argument it will return the
case-folded version of C<$_>.
=head2 case_fold($str)
Exported on request. Just like C<fc>, except that it has no prototype and
won't case-fold C<$_> if called without an argument.
=head1 VARIABLES
=head2 $Unicode::CaseFold::XS
Whether the XS extension is in use. The pure-perl implementation is 5-10 times
slower than the XS extension, and on versions of perl before 5.10.0 it will
use simple case-folding instead of full case-folding (see below).
=head2 $Unicode::CaseFold::SIMPLE_FOLDING
Is set to true if the perl version is prior to 5.10.0 and the XS extension is
not available. In this case, C<fc> will perform a simple case-folding instead
of a full case-folding. Although relatively few characters are affected,
strings case-folded using simple folding might not compare equal to the
corresponding strings case-folded with full folding, which may cause
compatibility issues.
Furthermore, when simple folding is in use, some strings
that would have case-folded to the same value when using full folding will
instead case-fold to different values. For example, C<fc("Wei\x{df}")> and
C<fc("Weiss")> both produce C<"weiss"> when full folding is in effect, but
the former produces C<"wei\x{df}"> when using simple folding.
If you want to check for this potentially dangerous situation, consult the
C<$Unicode::CaseFold::SIMPLE_FOLDING> variable.
=head1 COMPATIBILITY
=over 4
=item *
C<Unicode::CaseFold> requires Perl 5.8.1 or newer.
=item *
Different versions of perl include different versions of the Unicode database,
which is revised over time. If you are likely to be comparing strings that
have been folded using different versions of perl, you may need to consult the
changes for intervening Unicode standard versions to find out whether your
code will work correctly.
=item *
C<Unicode::CaseFold> uses "simple" rather than "full" case-folding when
operating in Pure-perl mode on perl versions previous to 5.10.0. For
compatibility implications, see L</$Unicode::CaseFold::SIMPLE_FOLDING>.
=back
=head1 SEE ALSO
=over 4
=item *
L<http://unicode.org/reports/tr21/tr21-5.html>: Unicode Standard Annex #21: Case Mappings
=back
=head1 AUTHOR
Andrew Rodland <arodland@cpan.org>
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2017 by Andrew Rodland.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.
=cut
|