File: README

package info (click to toggle)
ruby-uconv 0.6.1-3
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, buster, sid, stretch
  • size: 4,716 kB
  • ctags: 197
  • sloc: ansic: 161,247; ruby: 44,420; makefile: 2
file content (350 lines) | stat: -rw-r--r-- 11,647 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
	     Unicode Conversion Module for Ruby
			version 0.6.1

		       Yoshida Masato



- Introduction

This is the module to convert ISO/IEC 10646 (Unicode) string
and Japanese string each other.

Supported character encodings are UCS-4, UTF-16, UTF-8,
EUC-JP, CP932 (a variant of Shift_JIS for Japanese Windows).

This cannot detect character encoding automatically.

Note that EUC-JP conversion table has been changed.


- Install

This can work with ruby-1.6. I recommend you to use
ruby-1.6.7 or later.

Extract this package.

  gzip -dc < uconv-0.2.tar.gz | tar xvf -
  cd uconv

If you do not need EUC-JP or CP932 conversion, you can
undefine USE_EUC or USE_SJIS in extconf.rb to reduce the size of this
module. On Windows System, you can define USE_WIN32API in extconf.rb
to use Win32 encoding conversion API.

And make and install usually.
For example, when Ruby supports dynamic linking on your OS,

  ruby extconf.rb
  make
  make install

or using gem,

  gem build uconv.gemspac
  gem install uconv -- --enable-compat-win32api

-- Options of extconf.rb

  * --enable-euc [default]
    --disable-euc
  * --enable-sjis [default]
    --disable-sjis
  * --enable-win32api
    --disable-win32api [default]
  * --enable-fullwidth-reverse-solidus [default]
    --disable-fullwidth-reverse-solidus
  * --enable-compat-win32api [default]
    --disable-compat-win32api
  * --enable-thread-local [default]
    --disable-thread-local
  * --enable-utf-32 [default]
    --disable-utf-32


- Usage

If you do not link this module with Ruby statically, 

  require "uconv"

before using.


- Module Function

  UTF-16 and UCS-4 strings must be little-endian without
  using u16swap (u2swap) and u4swap.

  The functions that had treated USC-2 now can treat UTF-16.

  All ZERO WIDTH NO-BREAK SPACE (U+FEFF) are regarded as
  BYTE ORDER MARK (BOM) and deleted in some functions.

  The function matrix is the following.

             |               dest
             |  EUC-JP    CP932     UTF-8    UTF-16    UCS-4
    ---------+------------------------------------------------
       EUC-JP|  \         -         euctou8  euctou16  -
    s  CP932 |  -         \         sjistou8 sjistou16 -
    r  UTF-8 |  u8toeuc   u8tosjis  \        u8tou16   u8tou4
    c  UTF-16|  u16toeuc  u16tosjis u16tou8  u16swap   u16tou4
       USC-4 |  -         -         u4tou8   u4tou16   u4swap


  utf16 = Uconv.u16swap(utf16)
  ucs2 = Uconv.u2swap(ucs2)
  utf16 = Uconv.u16swap!(utf16)
  ucs2 = Uconv.u2swap!(ucs2)
    Byte-swaps a UTF-16 string. The little-endian string is
    converted to the big-endian string.
    Bang functions change the the parameter string directly.

  ucs4 = Uconv.u4swap(ucs4)
  ucs4 = Uconv.u4swap!(ucs4)
    Byte-swaps a UCS-4 string. The 1234-ordered string is
    converted into the 4321-ordered string.
    Bang function changes the the parameter string directly.

  utf16 = Uconv.u8tou16(utf8)
  ucs2 = Uconv.u8tou2(utf8)
    Converts a UTF-8 string into an UTF-16 string. The
    Illegal UTF-8 sequence raises the exception. The
    character except for a range from U-00000000 to
    U-0010FFFF also raises the exception.

  utf8 = Uconv.u16tou8(utf16)
  utf8 = Uconv.u2tou8(ucs2)
    Converts a UTF-16 string into a UTF-8 string. ZWNBSPs
    (U+FEFF) are deleted in default. Illegal surrogate pair
    raises the exception.

  utf8 = Uconv.u4tou8(ucs4)
    Converts a UTF-16 string into a UTF-8 string. ZWNBSPs
    (U+FEFF) are deleted in default.

  ucs4 = Uconv.u8tou4(utf8)
    Converts a UTF-8 string into an UCS-4 string. The Illegal
    UTF-8 sequence raises the exception. 

  utf16 = Uconv.u4tou16(ucs4)
    Converts a UTF-8 string into an UTF-16 string. The
    character except for a range from U-00000000 to
    U-0010FFFF also raises the exception.

  ucs = Uconv.u16tou4(utf16)
    Converts a UTF-16 string into a UTF-8 string. Illegal
    surrogate pair raises the exception.

  euc  = Uconv.u16toeuc(utf16)
  euc  = Uconv.u2toeuc(ucs2)
    Converts a UTF-16 string into an EUC-JP string. If
    "Uconv.unknown_unicode_handler" function is not defined,
    the character that cannot be converted is converted into '?'.

  utf16 = Uconv.euctou16(euc)
  ucs2 = Uconv.euctou2(euc)
    Converts an EUC-JP string into a UTF-16 string.

  euc  = Uconv.u8toeuc(utf8)
    Converts a UTF-8 string into an EUC-JP string. This is
    equal to u16toeuc(u8tou16(utf8)).

  utf8 = Uconv.euctou8(euc)
    Converts an EUC-JP string into a UTF-8 string. This is
    equal to u16tou8(euctou16(euc)).

  sjis  = Uconv.u16tosjis(utf16)
  sjis  = Uconv.u2tosjis(ucs2)
    Converts a UTF-16 string into an CP932 string. If
    "Uconv.unknown_unicode_handler" function is not defined,
    the character that cannot be converted is converted into '?'.

  utf16 = Uconv.sjistou16(sjis)
  ucs2 = Uconv.sjistou2(sjis)
    Converts an CP932 string into a UTF-16 string. 

  sjis  = Uconv.u8tosjis(utf8)
    Converts a UTF-8 string into an CP932 string. This is
    equal to u16tosjis(u8tou16(utf8)).

  utf8 = Uconv.sjistou8(sjis)
    Converts an CP932 string into a UTF-8 string. This is
    equal to u16tou8(euctou16(sjis)).
 
  Uconv.unknown_unicode_euc_handler = proc_obj
    Version 0.6.0 or later.
    When a UTF-16 or a UTF-8 string is converted into an
    EUC-JP string, this function is called if the character
    that cannot converted is detected.

      proc_obj = proc {|unicode| euc_str }

    The parameter is a Unicode character code in
    integer. You must return a string. This variable is not
    defined initially.
    This variable is thread-local.

  Uconv.unknown_unicode_sjis_handler = proc_obj
    Version 0.6.0 or later.
    When a UTF-16 or a UTF-8 string is converted into a
    CP932 string, this function is called if the
    character that cannot converted is detected.

      proc_obj = proc {|unicode| sjis_str }

    The parameter is a Unicode character code in
    integer. You must return a string. This function is not
    defined initially.
    This variable is thread-local.

  Uconv.unknown_euc_handler = proc_obj
    Version 0.6.0 or later.
    When an EUC-JP string is converted into a UTF-16 or UTF-8
    string, this function was called if the undefined
    character by JIS X 0208 or JIS X 0212 is detected. 

      proc_obj = proc {|euc_str| unicode }

    The parameter is a EUC-JP string (1..3 bytes).
    You must return a Unicode value in 31 bit integer.
    This variable is thread-local.

  Uconv.unknown_sjis_handler = proc_obj
    Version 0.6.0 or later.
    When an CP932 string is converted into a UTF-16 or UTF-8
    string, this function was called if the undefined
    character by CP932 is detected.

      proc_obj = proc {|sjis_str| unicode }

    The parameter is a CP932 string (1 byte or 2 bytes).
    You must return a Unicode value in 31 bit integer.
    This variable is thread-local.

  Uconv.euc_hook = proc_obj
    Version 0.6.0 or later.

  Uconv.sjis_hook = proc_obj
    Version 0.6.0 or later.

  Uconv.unicode_euc_hook = proc_obj
    Version 0.6.0 or later.

  Uconv.unicode_sjis_hook = proc_obj
    Version 0.6.0 or later.

  euc = Uconv.unknown_unicode_handler(unicode)
    ** deprecated **

    When a UTF-16 or a UTF-8 string is converted into an EUC-JP
    or CP932 string, this function is called if the
    character that cannot converted is detected. The
    parameter is a Unicode character code in integer. You
    must return a string. This function is not defined
    initially.

  euc = Uconv.unknown_unicode_euc_handler(unicode)
    When a UTF-16 or a UTF-8 string is converted into an EUC-JP
    string, this function is called if the
    character that cannot converted is detected. The
    parameter is a Unicode character code in integer. You
    must return a string. This function is not defined
    initially.

  sjis = Uconv.unknown_unicode_sjis_handler(unicode)
    When a UTF-16 or a UTF-8 string is converted into a
    CP932 string, this function is called if the
    character that cannot converted is detected. The
    parameter is a Unicode character code in integer. You
    must return a string. This function is not defined
    initially.

  unicode = Uconv.unknown_euc_handler(euc)
    When an EUC-JP string is converted into a UTF-16 or UTF-8
    string, this function was called if the undefined
    character by JIS X 0208 or JIS X 0212 is detected. 
    The parameter is a EUC-JP string (1..3 bytes).
    You must return a Unicode value in 31 bit integer.

  unicode = Uconv.unknown_sjis_handler(sjis)
    When an CP932 string is converted into a UTF-16 or UTF-8
    string, this function was called if the undefined
    character by CP932 is detected. The parameter is a
    CP932 string (1 byte or 2 bytes).
    You must return a Unicode value in 31 bit integer.

  flag = Uconv::eliminate_zwnbsp
  Uconv::eliminate_zwnbsp = flag
    Gets/sets ZWNBSP elimination flag. Flag must be true or false.
    It is true in the initial state. If true, u4tou8 and
    u16tou8 functions eliminate all ZWNBSPs, if false, they
    preserve all ZWNBSPs.
    This variable is thread-local on version 0.6.0 or later.

  flag = Uconv::shortest
  Uconv::shortest = flag
    Gets/sets the shortest form flag. Flag must be true or false.
    It is true in the initial state. If true, u8to*
    functions raise a exception when the UTF-8 string is not
    the shortest form.
    This variable is thread-local on version 0.6.0 or later.

  char = Uconv::replace_invalid
  Uconv::replace_invalid(char)
    Ges/Sets the replacement character for the invalid byte
    sequence in UTF-8, UTF-16, UCS-4 strings. If nil, the
    invalid byte stream raises a exception. If a non-nil
    integer, it is replaced by the replacement
    character. The initial replacement character is nil.
    This variable is thread-local on version 0.6.0 or later.


- Copying

This extension module is copyrighted free software by
Yoshida Masato.

You can redistribute it and/or modify it under the same term
as Ruby.


- Author

 Yoshida Masato <yoshidam@yoshidam.net>


- History

 Aug 15, 2011 version 0.6.0 thread-local
                            default to --enable-fullwidth-reverse-solidus
 Jan  3, 2010 version 0.5.3 Ruby 1.9.1
 Aug 23, 2004 version 0.5.2 pre-conversion hook for Win32
 Aug 19, 2004 version 0.5.1 u2s, s2u, shift_jis-2004
 Aug 16, 2004 version 0.5.0  pre-conversion hook, euc-jis-2004, eucjp-open
 Jul 18, 2004 version 0.4.13 fixes array index check
 Mar 12, 2003 version 0.4.12 for ruby 1.8.0
 Oct  3, 2002 version 0.4.11 adds --enable-compat-win32api for
                             Win32API compatible CP932 table
 Sep  4, 2002 version 0.4.10 fixes memory leaks
 Feb 10, 2002 version 0.4.9 adds replace_invalid
 Dec 10, 2001 version 0.4.8 supports the tainted status
 Nov 23, 2001 version 0.4.7 checks non-shortest form UTF-8
                            and changes Exception into Uconv::Error
 Mar  4, 2001 version 0.4.6 fixes s2u_conv
                            and adds USE_WIN32API
 Jan 30, 2001 version 0.4.5 fixes u2s_conv
                            and changes USC/CP932 conversion table
 Apr 18, 2000 version 0.4.4 SJIS to UCS conversion bug
 Mar 11, 2000 version 0.4.3 Eliminates non-constant initializers
 Nov 23, 1999 version 0.4.2 Appends eliminate_zwnbsp flag
                            Replace ustring library
 Nov  5, 1999 version 0.4.0 Supports CP932
 Mar 29, 1999 version 0.3.1 Removes xmallocs
 Feb 22, 1999 version 0.3.0 Supports UCS-4 and UTF-16
 Jan 13, 1999 version 0.2.2 Supports Japanese supplement characters
 Aug 15, 1998 version 0.2.1 Appends this README file
 Jul 24, 1998 version 0.2
 Jul  8, 1998 version 0.1