File: README.md

package info (click to toggle)
ruby-charlock-holmes 0.7.3%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 1,776 kB
  • ctags: 99
  • sloc: ruby: 435; ansic: 303; lisp: 237; cpp: 101; sh: 21; makefile: 2
file content (111 lines) | stat: -rw-r--r-- 3,331 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# CharlockHolmes

Character encoding detecting library for Ruby using [ICU](http://site.icu-project.org/)

## Usage

First you'll need to require it

``` ruby
require 'charlock_holmes'
```

## Encoding detection

``` ruby
contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100, :type => :text}

# optionally there will be a :language key as well, but
# that's mostly only returned for legacy encodings like ISO-8859-1
```

NOTE: `CharlockHolmes::EncodingDetector.detect` will return `nil` if it was unable to find an encoding.

For binary content, `:type` will be set to `:binary`

Though it's more efficient to reuse once detector instance:

``` ruby
detector = CharlockHolmes::EncodingDetector.new

detection1 = detector.detect(File.read('test.xml'))
detection2 = detector.detect(File.read('test2.json'))

# and so on...
```

### String monkey patch

Alternatively, you can just use the `detect_encoding` method on the `String` class

``` ruby
require 'charlock_holmes/string'

contents = File.read('test.xml')

detection = contents.detect_encoding
```

### Ruby 1.9 specific

NOTE: This method only exists on Ruby 1.9+

If you want to use this library to detect and set the encoding flag on strings, you can use the `detect_encoding!` method on the `String` class

``` ruby
require 'charlock_holmes/string'

contents = File.read('test.xml')

# this will detect and set the encoding of `contents`, then return self
contents.detect_encoding!
```

## Transcoding

Being able to detect the encoding of some arbitrary content is nice, but what you probably want is to be able to transcode that content into an encoding your application is using.

``` ruby
content = File.read('test2.txt')
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
```

The first parameter is the content to transcode, the second is the source encoding (the encoding the content is assumed to be in), and the third parameter is the destination encoding.

## Installing

If the traditional `gem install charlock_holmes` doesn't work, you may need to specify the path to
your installation of ICU using the `--with-icu-dir` option during the gem install or by configuring Bundler to
pass those arguments to Gem:

Configure Bundler to always use the correct arguments when installing:

    bundle config build.charlock_holmes --with-icu-dir=/path/to/installed/icu4c

Using Gem to install directly without Bundler:

    gem install charlock_holmes -- --with-icu-dir=/path/to/installed/icu4c


### Homebrew

If you're installing on Mac OS X then using [Homebrew](http://mxcl.github.com/homebrew/) is
the easiest way to install ICU.

However, be warned; it is a Keg-Only (see [homedir issue #167](https://github.com/mxcl/homebrew/issues/167)
for more info) install meaning RubyGems won't find it when installing without specifying `--with-icu-dir`

To install ICU with Homebrew:

    brew install icu4c

Configure Bundler to always use the correct arguments when installing:

    bundle config build.charlock_holmes --with-icu-dir=/usr/local/opt/icu4c

Using Gem to install directly without Bundler:

    gem install charlock_holmes -- --with-icu-dir=/usr/local/opt/icu4c