File: UTF8.pm

package info (click to toggle)
libmdn-perl 2.4-2
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 240 kB
  • ctags: 73
  • sloc: perl: 780; makefile: 87; ansic: 60
file content (196 lines) | stat: -rw-r--r-- 7,121 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# $Id: UTF8.pm,v 1.21 2002/02/27 09:26:48 m-kasahr Exp $
#
# Copyright (c) 2000 Japan Network Information Center.  All rights reserved.
#  
# By using this file, you agree to the terms and conditions set forth bellow.
# 
#                      LICENSE TERMS AND CONDITIONS 
# 
# The following License Terms and Conditions apply, unless a different
# license is obtained from Japan Network Information Center ("JPNIC"),
# a Japanese association, Fuundo Bldg., 1-2 Kanda Ogawamachi, Chiyoda-ku,
# Tokyo, Japan.
# 
# 1. Use, Modification and Redistribution (including distribution of any
#    modified or derived work) in source and/or binary forms is permitted
#    under this License Terms and Conditions.
# 
# 2. Redistribution of source code must retain the copyright notices as they
#    appear in each source code file, this License Terms and Conditions.
# 
# 3. Redistribution in binary form must reproduce the Copyright Notice,
#    this License Terms and Conditions, in the documentation and/or other
#    materials provided with the distribution.  For the purposes of binary
#    distribution the "Copyright Notice" refers to the following language:
#    "Copyright (c) Japan Network Information Center.  All rights reserved."
# 
# 4. Neither the name of JPNIC may be used to endorse or promote products
#    derived from this Software without specific prior written approval of
#    JPNIC.
# 
# 5. Disclaimer/Limitation of Liability: THIS SOFTWARE IS PROVIDED BY JPNIC
#    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
#    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
#    PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL JPNIC BE LIABLE
#    FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
#    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
#    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
#    BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
#    WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
#    OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
#    ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
# 
# 6. Indemnification by Licensee
#    Any person or entities using and/or redistributing this Software under
#    this License Terms and Conditions shall defend indemnify and hold
#    harmless JPNIC from and against any and all judgements damages,
#    expenses, settlement liabilities, cost and other liabilities of any
#    kind as a result of use and redistribution of this Software or any
#    claim, suite, action, litigation or proceeding by any third party
#    arising out of or relates to this License Terms and Conditions.
# 
# 7. Governing Law, Jurisdiction and Venue
#    This License Terms and Conditions shall be governed by and and
#    construed in accordance with the law of Japan. Any person or entities
#    using and/or redistributing this Software under this License Terms and
#    Conditions hereby agrees and consent to the personal and exclusive
#    jurisdiction and venue of Tokyo District Court of Japan.
#
package MDN::UTF8;

use strict;
use vars qw($VERSION @ISA @EXPORT @EXPORT_OK);

require Exporter;
require DynaLoader;

@ISA = qw(Exporter DynaLoader);
# Items to export into callers namespace by default. Note: do not export
# names by default without a very good reason. Use EXPORT_OK instead.
# Do not simply export all your public functions/methods/constants.
@EXPORT = qw();
$VERSION = '2.4';

bootstrap MDN::UTF8 $VERSION;

# Preloaded methods go here.

sub mblen {
    my ($package_name, $string) = @_;
    my ($wc, $length);

    if (($wc, $length) = $package_name->getwc($string)) {
	return $length;
    }
    return 0;
}

# Autoload methods go after =cut, and are processed by the autosplit program.

1;
__END__

=head1 NAME

MDN::UTF8 - Perl extension for libmdn utf8 module.

=head1 SYNOPSIS

  use MDN::UTF8;
  $length = MDN::UTF8->mblen($utf8_string);
  @ucs4_characters = MDN::UTF8->unpack($utf8_string);
  $utf8_string = MDN::UTF8->pack(@ucs4_characters);
  die if (!MDN::UTF8->isvalid($utf8_string));

=head1 DESCRIPTION

C<MDN::UTF8> provides a Perl interface to UTF-8 utility
module of the MDN library (a C library for handling
multilingual domain names) in the mDNkit.

=head1 CLASS METHODS

Although this module does not provide object interface,
all the functions should be called as class methods,
in order to be consistent with other modules in C<MDN::>.

	MDN::UTF8->mblen($string);	# OK
	MDN::UTF8::mblen($string);	# NG

=over 4

=item mblen($utf8_string)

Returns the length (in bytes) of the first character of C<$utf8_string>.
If the character is not a valid UTF-8 character, this method returns 0.

=item getwc($utf8_string)

Inspects the first character of C<$utf8_string>, and resturns the
result as a list with two elements.
The first elemnt of the list is the integer code value of the character
in the form of UCS-4, and the second is the length (in bytes) of the
character in the form of UTF-8.

	($wc, $length) = MDN::UTF8->getwc($string);

The value of the second element is the same as the one retruned from
C<mblen()>.
If the character is not a valid UTF-8 character, this method returns
an empty list.
Note that it also returns an empty list for an empty UTF-8 string.

=item unpack($utf8_string)

Unpacks C<$utf8_string> into a list of UCS-4 characters, and 
returns the list of integer code values of them.
An empty list is returned if C<$utf8_string> contains an invalid
character or C<$utf8_string> is empty.

=item pack(@ucs4_characters)

Packs a list of UCS-4 characters into an UTF-8 string, and returns
the string.  This is the reverse of C<unpack> method above.
If C<@ucs4_characters> contains an invalid UCS-4 character, it returns 
C<undef>.

=item isvalid($utf8_string)

Checks if C<$utf8_string> is a valid UTF-8 encoded string.
Returns 1 if it is valid, 0 otherwise.

=back

=head1 ISSUE OF HANDLING UNICODE CHARACTERS

Beginning with version 5.6, Perl supports Unicode character, but the
implementation is incomplete and highly experimental.

Perl provides the `character' and `byte' semantics.
In the character semantics, an Unicode character is recognized as a
character even if that occupies two or more bytes.
In the byte semantics, Unicode character is recognized as a sequence
of bytes.

Some Perl operators changes theier behaviors according with the
semantics, and Perl decides whether an operator uses the character
or bytes semantics based on whether input data is byte or character
data.
For example, a string literal which contains C<\x{304B}> (Unicode
character U+304B) is recognized as character data.

Also the MDN modules dealing with UTF-8.
If you don't have special reason to use the character semantics, or
you aren't familier with the character semantics, we recommend you to
use C<bytes> pragmra:

  use bytes;

That forces the byte semantics everywhere in your program.
See L<perlunicode> and L<perlbytes> for more details about this issue.

=head1 SEE ALSO

MDN library specification, L<perlunicode>, L<perlbytes>

=cut