1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350
|
Unicode Conversion Module for Ruby
version 0.6.1
Yoshida Masato
- Introduction
This is the module to convert ISO/IEC 10646 (Unicode) string
and Japanese string each other.
Supported character encodings are UCS-4, UTF-16, UTF-8,
EUC-JP, CP932 (a variant of Shift_JIS for Japanese Windows).
This cannot detect character encoding automatically.
Note that EUC-JP conversion table has been changed.
- Install
This can work with ruby-1.6. I recommend you to use
ruby-1.6.7 or later.
Extract this package.
gzip -dc < uconv-0.2.tar.gz | tar xvf -
cd uconv
If you do not need EUC-JP or CP932 conversion, you can
undefine USE_EUC or USE_SJIS in extconf.rb to reduce the size of this
module. On Windows System, you can define USE_WIN32API in extconf.rb
to use Win32 encoding conversion API.
And make and install usually.
For example, when Ruby supports dynamic linking on your OS,
ruby extconf.rb
make
make install
or using gem,
gem build uconv.gemspac
gem install uconv -- --enable-compat-win32api
-- Options of extconf.rb
* --enable-euc [default]
--disable-euc
* --enable-sjis [default]
--disable-sjis
* --enable-win32api
--disable-win32api [default]
* --enable-fullwidth-reverse-solidus [default]
--disable-fullwidth-reverse-solidus
* --enable-compat-win32api [default]
--disable-compat-win32api
* --enable-thread-local [default]
--disable-thread-local
* --enable-utf-32 [default]
--disable-utf-32
- Usage
If you do not link this module with Ruby statically,
require "uconv"
before using.
- Module Function
UTF-16 and UCS-4 strings must be little-endian without
using u16swap (u2swap) and u4swap.
The functions that had treated USC-2 now can treat UTF-16.
All ZERO WIDTH NO-BREAK SPACE (U+FEFF) are regarded as
BYTE ORDER MARK (BOM) and deleted in some functions.
The function matrix is the following.
| dest
| EUC-JP CP932 UTF-8 UTF-16 UCS-4
---------+------------------------------------------------
EUC-JP| \ - euctou8 euctou16 -
s CP932 | - \ sjistou8 sjistou16 -
r UTF-8 | u8toeuc u8tosjis \ u8tou16 u8tou4
c UTF-16| u16toeuc u16tosjis u16tou8 u16swap u16tou4
USC-4 | - - u4tou8 u4tou16 u4swap
utf16 = Uconv.u16swap(utf16)
ucs2 = Uconv.u2swap(ucs2)
utf16 = Uconv.u16swap!(utf16)
ucs2 = Uconv.u2swap!(ucs2)
Byte-swaps a UTF-16 string. The little-endian string is
converted to the big-endian string.
Bang functions change the the parameter string directly.
ucs4 = Uconv.u4swap(ucs4)
ucs4 = Uconv.u4swap!(ucs4)
Byte-swaps a UCS-4 string. The 1234-ordered string is
converted into the 4321-ordered string.
Bang function changes the the parameter string directly.
utf16 = Uconv.u8tou16(utf8)
ucs2 = Uconv.u8tou2(utf8)
Converts a UTF-8 string into an UTF-16 string. The
Illegal UTF-8 sequence raises the exception. The
character except for a range from U-00000000 to
U-0010FFFF also raises the exception.
utf8 = Uconv.u16tou8(utf16)
utf8 = Uconv.u2tou8(ucs2)
Converts a UTF-16 string into a UTF-8 string. ZWNBSPs
(U+FEFF) are deleted in default. Illegal surrogate pair
raises the exception.
utf8 = Uconv.u4tou8(ucs4)
Converts a UTF-16 string into a UTF-8 string. ZWNBSPs
(U+FEFF) are deleted in default.
ucs4 = Uconv.u8tou4(utf8)
Converts a UTF-8 string into an UCS-4 string. The Illegal
UTF-8 sequence raises the exception.
utf16 = Uconv.u4tou16(ucs4)
Converts a UTF-8 string into an UTF-16 string. The
character except for a range from U-00000000 to
U-0010FFFF also raises the exception.
ucs = Uconv.u16tou4(utf16)
Converts a UTF-16 string into a UTF-8 string. Illegal
surrogate pair raises the exception.
euc = Uconv.u16toeuc(utf16)
euc = Uconv.u2toeuc(ucs2)
Converts a UTF-16 string into an EUC-JP string. If
"Uconv.unknown_unicode_handler" function is not defined,
the character that cannot be converted is converted into '?'.
utf16 = Uconv.euctou16(euc)
ucs2 = Uconv.euctou2(euc)
Converts an EUC-JP string into a UTF-16 string.
euc = Uconv.u8toeuc(utf8)
Converts a UTF-8 string into an EUC-JP string. This is
equal to u16toeuc(u8tou16(utf8)).
utf8 = Uconv.euctou8(euc)
Converts an EUC-JP string into a UTF-8 string. This is
equal to u16tou8(euctou16(euc)).
sjis = Uconv.u16tosjis(utf16)
sjis = Uconv.u2tosjis(ucs2)
Converts a UTF-16 string into an CP932 string. If
"Uconv.unknown_unicode_handler" function is not defined,
the character that cannot be converted is converted into '?'.
utf16 = Uconv.sjistou16(sjis)
ucs2 = Uconv.sjistou2(sjis)
Converts an CP932 string into a UTF-16 string.
sjis = Uconv.u8tosjis(utf8)
Converts a UTF-8 string into an CP932 string. This is
equal to u16tosjis(u8tou16(utf8)).
utf8 = Uconv.sjistou8(sjis)
Converts an CP932 string into a UTF-8 string. This is
equal to u16tou8(euctou16(sjis)).
Uconv.unknown_unicode_euc_handler = proc_obj
Version 0.6.0 or later.
When a UTF-16 or a UTF-8 string is converted into an
EUC-JP string, this function is called if the character
that cannot converted is detected.
proc_obj = proc {|unicode| euc_str }
The parameter is a Unicode character code in
integer. You must return a string. This variable is not
defined initially.
This variable is thread-local.
Uconv.unknown_unicode_sjis_handler = proc_obj
Version 0.6.0 or later.
When a UTF-16 or a UTF-8 string is converted into a
CP932 string, this function is called if the
character that cannot converted is detected.
proc_obj = proc {|unicode| sjis_str }
The parameter is a Unicode character code in
integer. You must return a string. This function is not
defined initially.
This variable is thread-local.
Uconv.unknown_euc_handler = proc_obj
Version 0.6.0 or later.
When an EUC-JP string is converted into a UTF-16 or UTF-8
string, this function was called if the undefined
character by JIS X 0208 or JIS X 0212 is detected.
proc_obj = proc {|euc_str| unicode }
The parameter is a EUC-JP string (1..3 bytes).
You must return a Unicode value in 31 bit integer.
This variable is thread-local.
Uconv.unknown_sjis_handler = proc_obj
Version 0.6.0 or later.
When an CP932 string is converted into a UTF-16 or UTF-8
string, this function was called if the undefined
character by CP932 is detected.
proc_obj = proc {|sjis_str| unicode }
The parameter is a CP932 string (1 byte or 2 bytes).
You must return a Unicode value in 31 bit integer.
This variable is thread-local.
Uconv.euc_hook = proc_obj
Version 0.6.0 or later.
Uconv.sjis_hook = proc_obj
Version 0.6.0 or later.
Uconv.unicode_euc_hook = proc_obj
Version 0.6.0 or later.
Uconv.unicode_sjis_hook = proc_obj
Version 0.6.0 or later.
euc = Uconv.unknown_unicode_handler(unicode)
** deprecated **
When a UTF-16 or a UTF-8 string is converted into an EUC-JP
or CP932 string, this function is called if the
character that cannot converted is detected. The
parameter is a Unicode character code in integer. You
must return a string. This function is not defined
initially.
euc = Uconv.unknown_unicode_euc_handler(unicode)
When a UTF-16 or a UTF-8 string is converted into an EUC-JP
string, this function is called if the
character that cannot converted is detected. The
parameter is a Unicode character code in integer. You
must return a string. This function is not defined
initially.
sjis = Uconv.unknown_unicode_sjis_handler(unicode)
When a UTF-16 or a UTF-8 string is converted into a
CP932 string, this function is called if the
character that cannot converted is detected. The
parameter is a Unicode character code in integer. You
must return a string. This function is not defined
initially.
unicode = Uconv.unknown_euc_handler(euc)
When an EUC-JP string is converted into a UTF-16 or UTF-8
string, this function was called if the undefined
character by JIS X 0208 or JIS X 0212 is detected.
The parameter is a EUC-JP string (1..3 bytes).
You must return a Unicode value in 31 bit integer.
unicode = Uconv.unknown_sjis_handler(sjis)
When an CP932 string is converted into a UTF-16 or UTF-8
string, this function was called if the undefined
character by CP932 is detected. The parameter is a
CP932 string (1 byte or 2 bytes).
You must return a Unicode value in 31 bit integer.
flag = Uconv::eliminate_zwnbsp
Uconv::eliminate_zwnbsp = flag
Gets/sets ZWNBSP elimination flag. Flag must be true or false.
It is true in the initial state. If true, u4tou8 and
u16tou8 functions eliminate all ZWNBSPs, if false, they
preserve all ZWNBSPs.
This variable is thread-local on version 0.6.0 or later.
flag = Uconv::shortest
Uconv::shortest = flag
Gets/sets the shortest form flag. Flag must be true or false.
It is true in the initial state. If true, u8to*
functions raise a exception when the UTF-8 string is not
the shortest form.
This variable is thread-local on version 0.6.0 or later.
char = Uconv::replace_invalid
Uconv::replace_invalid(char)
Ges/Sets the replacement character for the invalid byte
sequence in UTF-8, UTF-16, UCS-4 strings. If nil, the
invalid byte stream raises a exception. If a non-nil
integer, it is replaced by the replacement
character. The initial replacement character is nil.
This variable is thread-local on version 0.6.0 or later.
- Copying
This extension module is copyrighted free software by
Yoshida Masato.
You can redistribute it and/or modify it under the same term
as Ruby.
- Author
Yoshida Masato <yoshidam@yoshidam.net>
- History
Aug 15, 2011 version 0.6.0 thread-local
default to --enable-fullwidth-reverse-solidus
Jan 3, 2010 version 0.5.3 Ruby 1.9.1
Aug 23, 2004 version 0.5.2 pre-conversion hook for Win32
Aug 19, 2004 version 0.5.1 u2s, s2u, shift_jis-2004
Aug 16, 2004 version 0.5.0 pre-conversion hook, euc-jis-2004, eucjp-open
Jul 18, 2004 version 0.4.13 fixes array index check
Mar 12, 2003 version 0.4.12 for ruby 1.8.0
Oct 3, 2002 version 0.4.11 adds --enable-compat-win32api for
Win32API compatible CP932 table
Sep 4, 2002 version 0.4.10 fixes memory leaks
Feb 10, 2002 version 0.4.9 adds replace_invalid
Dec 10, 2001 version 0.4.8 supports the tainted status
Nov 23, 2001 version 0.4.7 checks non-shortest form UTF-8
and changes Exception into Uconv::Error
Mar 4, 2001 version 0.4.6 fixes s2u_conv
and adds USE_WIN32API
Jan 30, 2001 version 0.4.5 fixes u2s_conv
and changes USC/CP932 conversion table
Apr 18, 2000 version 0.4.4 SJIS to UCS conversion bug
Mar 11, 2000 version 0.4.3 Eliminates non-constant initializers
Nov 23, 1999 version 0.4.2 Appends eliminate_zwnbsp flag
Replace ustring library
Nov 5, 1999 version 0.4.0 Supports CP932
Mar 29, 1999 version 0.3.1 Removes xmallocs
Feb 22, 1999 version 0.3.0 Supports UCS-4 and UTF-16
Jan 13, 1999 version 0.2.2 Supports Japanese supplement characters
Aug 15, 1998 version 0.2.1 Appends this README file
Jul 24, 1998 version 0.2
Jul 8, 1998 version 0.1
|