# $File: //member/autrijus/Locale-Maketext-Fuzzy/lib/Locale/Maketext/Fuzzy.pm $ $Author: autrijus $
# $Revision: #4 $ $Change: 1124 $ $DateTime: 2002/10/01 07:37:28 $

package Locale::Maketext::Fuzzy;
$Locale::Maketext::Fuzzy::VERSION = '0.02';

use strict;
use Locale::Maketext;
use base 'Locale::Maketext';

=head1 NAME

Locale::Maketext::Fuzzy - Maketext from already interpolated strings

=head1 VERSION

This document describes version 0.02 of Locale::Maketext::Fuzzy.

=head1 SYNOPSIS

    package MyApp::L10N;
    use base 'Locale::Maketext::Fuzzy'; # instead of Locale::Maketext

    package MyApp::L10N::de;
    use base 'MyApp::L10N';
    our %Lexicon = (
	# Exact match should always be preferred if possible
	"0 camels were released."
	    => "Exact match",

	# Fuzzy match candidate
	"[quant,_1,camel was,camels were] released."
	    => "[quant,_1,Kamel wurde,Kamele wurden] freigegeben.",

	# This could also match fuzzily, but is less preferred
	"[_2] released[_1]"
	    => "[_1][_2] ist frei[_1]",
    );

    package main;
    my $lh = MyApp::L10N->get_handle('de');

    # All ->maketext calls below will become ->maketext_fuzzy instead
    $lh->override_maketext(1);

    # This prints "Exact match"
    print $lh->maketext('0 camels were released.');

    # "1 Kamel wurde freigegeben." -- quant() gets 1
    print $lh->maketext('1 camel was released.');

    # "2 Kamele wurden freigegeben." -- quant() gets 2
    print $lh->maketext('2 camels were released.');

    # "3 Kamele wurden freigegeben." -- parameters are ignored
    print $lh->maketext('3 released.');

    # "4 Kamele wurden freigegeben." -- normal usage
    print $lh->maketext('[*,_1,camel was,camels were] released.', 4);

    # "!Perl ist frei!" -- matches the broader one
    # Note that the sequence ([_2] before [_1]) is preserved
    print $lh->maketext('Perl released!');

=head1 DESCRIPTION

This module is a subclass of C<Locale::Maketext>, with additional
support for localizing messages that already contains interpolated
variables.  This is most useful when the messages are returned by
external modules -- for example, to match C<dir: command not found>
against C<[_1]: command not found>.

Of course, this module is also useful if you're simply too lazy
to use the

    $lh->maketext("[quant,_1,file,files] deleted.", $count);

syntax, but wish to write

    $lh->maketext_fuzzy("$count files deleted");

instead, and have the correct plural form figured out automatically.

If C<maketext_fuzzy> seems too long to type for you, this module
also provides a C<override_maketext> method to turn I<all> C<maketext>
calls into C<maketext_fuzzy> calls.

=head1 METHODS

=head2 $lh->maketext_fuzzy(I<key>[, I<parameters...>]);

That method takes exactly the same arguments as the C<maketext> method
of C<Locale::Maketext>.

If I<key> is found in lexicons, it is applied in the same way as
C<maketext>.  Otherwise, it looks at all lexicon entries that could
possibly yield I<key>, by turning C<[...]> sequences into C<(.*?)> and
match the resulting regular expression against I<key>.

Once it finds all candidate entries, the longest one replaces the
I<key> for the real C<maketext> call.  Variables matched by its bracket
sequences (C<$1>, C<$2>...) are placed before I<parameters>; the order
of variables in the matched entry are correctly preserved.

For example, if the matched entry in C<%Lexicon> is C<Test [_1]>,
this call:

    $fh->maketext_fuzzy("Test string", "param");

is equivalent to this:

    $fh->maketext("Test [_1]", "string", "param");

However, most of the time you won't need to supply I<parameters> to
a C<maketext_fuzzy> call, since all parameters are already interpolated
into the string.

=head2 $lh->override_maketext([I<flag>]);

If I<flag> is true, this accessor method turns C<$lh-E<gt>maketext>
into an alias for C<$lh-E<gt>maketext_fuzzy>, so all consecutive
C<maketext> calls in the C<$lh>'s packages are automatically fuzzy.
A false I<flag> restores the original behaviour.  If the flag is not
specified, returns the current status of override; the default is
0 (no overriding).

Note that this call only modifies the symbol table of the I<language
class> that C<$lh> belongs to, so other languages are not affected.
If you want to override all language handles in a certain application,
try this:

    MyApp::L10N->override_maketext(1);

=head1 CAVEATS

=over 4

=item *

The "longer is better" heuristic to determine the best match is
reasonably good, but could certainly be improved.

=item *

Currently, C<"[quant,_1,file] deleted"> won't match C<"3 files deleted">;
you'll have to write C<"[quant,_1,file,files] deleted"> instead, or
simply use C<"[_1] file deleted"> as the lexicon key and put the correct
plural form handling into the corresponding value.

=item *

When used in combination with C<Locale::Maketext::Lexicon>'s C<Tie>
backend, all keys would be iterated over each time a fuzzy match is
performed, and may cause serious speed penalty.  Patches welcome.

=back

=cut

sub override_maketext {
    my ($class, $flag) = @_;
    $class = ref($class) if ref($class);

    no strict 'refs';

    if ($flag) {
	*{"$class\::maketext"} = \&maketext_fuzzy;
    }
    elsif (@_ >= 2) {
	delete ${"$class\::"}{maketext};
    }

    return (defined &{"$class\::maketext"} ? 1 : 0);
}

# Global cache of entries and their regexified forms
my %regex_cache;

sub maketext_fuzzy {
    my ($handle, $phrase) = splice(@_, 0, 2);

    # An array of all lexicon hashrefs
    my @lexicons = @{$handle->_lex_refs};

    # Try exact match if possible at all.
    foreach my $lex (@lexicons) {
	return $handle->SUPER::maketext($phrase, @_)
	    if exists $lex->{$phrase};
    }

    # Keys are matched entries; values are arrayrefs of extracted params
    my %candidate;

    # Fuzzy match phase 1 -- extract all candidates
    foreach my $lex (@lexicons) {
	# We're not interested in non-bracketed entries, so ignore them
	foreach my $entry (grep /(?:(?<!~)(?:~~)*)\[/, keys %{$lex}) {
	    my $re = ($regex_cache{$entry} ||= [ _regexify($entry) ]);
	    my @vars = ($phrase =~ $re->[0]) or next;
	    $candidate{$entry} ||= (
		@{$re->[1]} ? [ @vars[@{$re->[1]}] ] : \@vars
	    );
	}
    }

    # Fail early if we cannot find anything that matches
    return $phrase unless %candidate;

    # Fuzzy match phase 2 -- select the best candidate
    $phrase = (sort {
	# For now, we just use a very crude heuristic: "Longer is better"
	length($b) <=> length($a)
	    or $b cmp $a
    } keys %candidate)[0];

    return $handle->SUPER::maketext(
	$phrase, @{$candidate{$phrase}}, @_
    );
}

sub _regexify {
    my $text = quotemeta(shift);
    my @ords;

    $text =~ s{
	(				# capture into $1...
	    (?<!\\~)(?:\\~\\~)*		#   an even number of ~ characters
	)				#   (to be restored back)
	\\\[				# opening bracket

	(				# capture into $2...
	    (?:				#   any numbers of
		[^~\]]			#     ordinary non-] characters
		    |			#       or
		~\\?.			#     escaped characters
	    )*
	)
	\\\]				# closing bracket
    }{
	$1._paramify($2, \@ords)
    }egx;

    $text =~ s/\Q.*?\E$/.*/;
    return qr/^$text$/, \@ords;
}

sub _paramify {
    my ($text, $ordref) = @_;
    my $out = '(.*?)';
    my @choices = split(/\\,/, $text);

    if ($choices[0] =~ /^(?:\w+|\\#|\\\*)$/) {
	# Do away with the function name
	shift @choices unless $choices[0] =~ /^_(?:\d+|\\\*)$/;

	# Build an alternate regex to weed out vars
	$out .= '(?:' . join(
	    '|', sort {
		length($b) <=> length($a)	# longest first
	    } map {
		/^_(?:(\d+)|\\\*)$/ ? do {
		    push @{$ordref}, ($1 - 1) if defined $1;
		    '';
		} : $_	# turn _1, _2, _*... into ''
	    } @choices
	) . ')';

	$out =~ s/\Q(?:)\E$//;
    }

    return $out;
}

1;

=head1 SEE ALSO

L<Locale::Maketext>, L<Locale::Maketext::Lexicon>

=head1 BACKGROUND

This particular module was written to facilitate an I<auto-extraction>
layer for Slashcode's I<Template Toolkit> provider, based on
C<HTML::Parser> and C<Template::Parser>.  It would work like this:

    Input | <B>from the [% story.dept %] dept.</B>
    Output| <B>[%|loc( story.dept )%]from the [_1] dept.[%END%]</B>

Now, this layer suffers from the same linguistic problems as an
ordinary C<Msgcat> or C<Gettext> framework does -- what if we want
to make ordinates from C<[% story.dept %]> (i.e. C<from the 3rd dept.>),
or expand the C<dept.> to C<department> / C<departments>?

The same problem occurred in RT's web interface, where it had to
localize messages returned by external modules, which may already
contain interpolated variables, e.g. C<"Successfully deleted 7
ticket(s) in 'c:\temp'.">.

Since I didn't have the time to refactor C<DBI> and C<DBI::SearchBuilder>,
I devised a C<loc_match> method to pre-process their messages into one
of the I<candidate strings>, then applied the matched string to C<maketext>.

Afterwards, I realized that instead of preparing a set of candidate
strings, I could actually use the original I<lexicon file> (i.e. PO files
via C<Locale::Maketext::Lexicon>) to match against.  This is how
C<Locale::Maketext::Fuzzy> was born.

=head1 AUTHORS

Autrijus Tang E<lt>autrijus@autrijus.orgE<gt>

=head1 COPYRIGHT

Copyright 2002 by Autrijus Tang E<lt>autrijus@autrijus.orgE<gt>.

This program is free software; you can redistribute it and/or 
modify it under the same terms as Perl itself.

See L<http://www.perl.com/perl/misc/Artistic.html>

=cut
