File: Recode.pm

package info (click to toggle)
libintl-perl 1.33-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 5,736 kB
  • sloc: perl: 156,229; makefile: 113; sh: 3
file content (364 lines) | stat: -rw-r--r-- 10,044 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
#! /bin/false

# vim: set autoindent shiftwidth=4 tabstop=4:

# Portable character conversion for Perl.
# Copyright (C) 2002-2017 Guido Flohr <guido.flohr@cantanea.com>,
# all rights reserved.

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

package Locale::Recode;

use strict;

require Locale::Recode::_Conversions;

my $loaded = {};
my $has_encode;

sub new
{
    my $class = ref($_[0]) || $_[0];
    shift;
    my %args = @_;

    my $self = bless {}, $class;

    my ($from_codeset, $to_codeset) = @args{qw (from to)};
    
    unless ($from_codeset && $to_codeset) {
		require Carp;
        Carp::croak (<<EOF);
	Usage: $class->new (from => FROM_CODESET, to => TO_CODESET);
EOF
    }

    # Find a conversion path.
	my $path = Locale::Recode::_Conversions->findPath ($from_codeset, 
													   $to_codeset);
	unless ($path) {
		$self->{__error} = 'EINVAL';
		return $self;
	}

	my @conversions = ();
	foreach (@$path) {
		my ($module, $from, $to) = @$_;
		
		unless ($loaded->{$module}) {
			eval "require Locale::RecodeData::$module";
			if ($@) {
				$self->{__error} = $@;
				return $self;
			}
			
			$loaded->{$module} = 1;
		}
		
		my $module_name = "Locale::RecodeData::$module";
		my $method = 'new';
		my $object = $module_name->$method (from => $from,
											to => $to);
		
		push @conversions, $object;
	}

	$self->{__conversions} = \@conversions;
		
    return $self;
}

sub resolveAlias
{
	my ($class, $alias) = @_;

	return Locale::Recode::_Conversions->resolveAlias ($alias);
}

sub getSupported
{
	return [ Locale::Recode::_Conversions->listSupported ];
}

sub getCharsets
{
	my $self = shift;
	my %all = map { $_ => 1 } @{&getSupported};

	require Locale::Recode::_Aliases;

	my $conversions = Locale::Recode::_Conversions->listSupported;
	foreach my $charset (keys %{Locale::Recode::_Aliases::ALIASES()}) {
		my $mime_name = $self->resolveAlias ($charset);
		next unless exists $all{$mime_name};
		$all{$charset} = 1;
	}
	
	return [ keys %all ];
}

sub recode
{
    my $self = $_[0];

    return if $self->{__error};

    return 1 unless defined $_[1];

    my $chain = $self->{__conversions};
    
    foreach my $module (@$chain) {
		my $success = $module->_recode ($_[1]);
		
		unless ($success) {
			$self->{__error} = $module->_getError;
			return;
		}
    }

    return 1;
}

sub getError
{
    my $self = shift;
    my $error = $self->{__error} or return;

    if ('EINVAL' eq $error) {
		return 'Invalid conversion';
    } else {
		return $error;
    }
}

1;

__END__

=head1 NAME

Locale::Recode - Object-Oriented Portable Charset Conversion

=head1 SYNOPSIS

  use Locale::Recode;

  $cd = Locale::Recode->new (from => 'UTF-8',
                             to   => 'ISO-8859-1');

  die $cd->getError if $cd->getError;

  $cd->recode ($text) or die $cd->getError;

  $mime_name = Locale::Recode->resolveAlias ('latin-1');

  $supported = Locale::Recode->getSupported;

  $complete = Locale::Recode->getCharsets;

=head1 DESCRIPTION

This module provides routines that convert textual data from one
codeset to another in a portable way.  The module has been started
before Encode(3) was written.  It's main purpose today is to provide
charset conversion even when Encode(3) is not available on the system.
It should also work for older Perl versions without Unicode support.

Internally Locale::Recode(3) will use Encode(3) whenever possible,
to allow for a faster conversion and for a wider range of supported
charsets, and will only fall back to the Perl implementation when
Encode(3) is not available or does not support a particular charset
that Locale::Recode(3) does.

Locale::Recode(3) is part of libintl-perl, and it's main purpose is
actually to implement a portable charset conversion framework for
the message translation facilities described in Locale::TextDomain(3).

=head1 CONSTRUCTOR

The constructor C<new()> requires two named arguments:

=over 4

=item B<from>

The encoding of the original data.  Case doesn't matter, aliases
are resolved.

=item B<to>

The target encoding.  Again, case doesn't matter, and aliases
are resolved.

=back

The constructor will never fail.  In case of an error, the object's
internal state is set to bad and it will refuse to do any conversions.
You can inquire the reason for the failure with the method
getError().

=head1 OBJECT METHODS

The following object methods are available.

=over 4

=item B<recode (STRING)>

Converts B<STRING> from the source encoding into the destination
encoding.  In case of success, a truth value is returned, false
otherwise.  You can inquire the reason for the failure with the
method getError().

=item B<getError>

Returns either false if the object is not in an error state or
an error message.

=back

=head1 CLASS METHODS

The object provides some additional class methods:

=over 4

=item B<getSupported>

Returns a reference to a list of all supported charsets.  This
may implicitly load additional Encode(3) conversions like
Encode::HanExtra(3) which may produce considerable load on your
system.

The method is therefore not intended for regular use but rather
for getting resp. displaying I<once> a list of available encodings.

The members of the list are all converted to uppercase!

=item B<getCharsets>

Like getSupported() but also returns all available aliases.

=back

=head1 SUPPORTED CHARSETS

The range of supported charsets is system-dependent.  The following
somewhat special charsets are always available:

=over 4

=item B<UTF-8>

UTF-8 is available independently of your Perl version.  For Perl 5.6
or better or in the presence of Encode(3), conversions are not done
in Perl but with the interfaces provided by these facilities which
are written in C, hence much faster.

Encoding data I<into> UTF-8 is fast, even if it is done in Perl.
Decoding it in Perl may become quite slow.  If you frequently have
to decode UTF-8 with B<Locale::Recode> you will probably want to
make sure that you do that with Perl 5.6 or beter, or install Encode(3) to
speed up things.

=item B<INTERNAL>

UTF-8 is fast to write but hard to read for applications.  It is 
therefore not the worst for internal string representation but not
far from that.  Locale::Recode(3) stores strings internally as a
reference to an array of integer values like most programming languages
(Perl is an exception) do, trading memory for performance.

The integer values are the UCS-4 codes of the characters in host
byte order.

The encoding B<INTERNAL> is directly available via Locale::Recode(3)
but of course you should not really use it for data exchange, unless
you know what you are doing.

=back

Locale::Recode(3) has native support for a plethora of other encodings,
most of them 8 bit encodings that are fast to decode, including most
encodings used on popular micros like the ISO-8859-* series of encodings,
most Windows-* encodings (also known as CP*), Macintosh, Atari, etc.

=head1 NAMES AND ALIASES

Each charset resp. encoding is available internally under a unique
name.  Whenever the information was available, the preferred MIME name
(see L<http://www.iana.org/assignments/character-sets/>) was chosen as 
the internal name.

Alias handling is quite strict.  The module does not make wild guesses
at what you mean ("What's the meaning of the acronym JIS" is a valid
alias for "7bit-jis" in Encode(3) ....) but aims at providing common
aliases only.  The same applies to so-called aliases that are really 
mistakes, like "utf8" for UTF-8.

The module knows all aliases that are listed with the IANA character
set registry (L<http://www.iana.org/assignments/character-sets/>), plus
those known to libiconv version 1.8, and a bunch of additional ones.

=head1 CONVERSION TABLES

The conversion tables have either been taken from official sources
like the IANA or the Unicode Consortium, from Bruno Haible's libiconv,
or from the sources of the GNU libc and the regression tests for 
libintl-perl will check for conformance here.  For some encodings this data
differs from Encode(3)'s data which would cause these tests to fail.  
In these cases, the module will not invoke the Encode(3) methods, but
will fall back to the internal implementation for the sake of 
consistency.

The few encodings that are affected are so simple that you will not
experience any real performance penalty unless you convert large chunks
of data.  But the package is not really intended for such use anyway, and
since Encode(3) is relatively new, I rather think that the differences
are bugs in Encode which will be fixed soon.

=head1 BUGS

The module should provide fall back conversions for other Unicode
encoding schemes like UCS-2, UCS-4 (big- and little-endian).

The pure Perl UTF-8 decoder will not always handle corrupt UTF-8
correctly, especially at the end and at the beginning of the string.
This is not likely to be fixed, since the module's intention is not
to be a consistency checker for UTF-8 data.

=head1 AUTHOR

Copyright (C) 2002-2017 L<Guido Flohr|http://www.guido-flohr.net/>
(L<mailto:guido.flohr@cantanea.com>), all rights reserved.  See the source
code for details!code for details!

=head1 SEE ALSO

Encode(3), iconv(3), iconv(1), recode(1), perl(1)

=cut
Local Variables:
mode: perl
perl-indent-level: 4
perl-continued-statement-offset: 4
perl-continued-brace-offset: 0
perl-brace-offset: -4
perl-brace-imaginary-offset: 0
perl-label-offset: -4
cperl-indent-level: 4
cperl-continued-statement-offset: 2
tab-width: 4
End: