1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
|
# Unicode Transforms
[](https://hackage.haskell.org/package/unicode-transforms)
[](https://travis-ci.com/composewell/unicode-transforms)
[](https://ci.appveyor.com/project/harendra-kumar/unicode-transforms)
[](https://coveralls.io/github/composewell/unicode-transforms?branch=master)
Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).
## What is normalization?
Unicode characters with adornments (e.g. Á) can be represented in two different
forms, as a single composed character (U+00C1 = Á) or as multiple decomposed
characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte
sequences but for humans they have exactly the same visual appearance.
A regular byte comparison may tell that two strings are different even though
they might be equivalent. We need to convert both the strings in a
[`normalized`](http://unicode.org/reports/tr15/) form using the [Unicode
Character Database](http://www.unicode.org/Public/UCD/latest/) before we can
compare them for equivalence. For example:
```
>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True
```
## Performance
Normalization performance comparison of this package (v0.3.7) with
the [text-icu](http://hackage.haskell.org/package/text-icu) package
using the [ICU C++ library](http://site.icu-project.org/download)
version ICU4C 65.1 on macOS. The benchmarks compare the time taken in
milliseconds to normalize files in different languages and normalization
forms using both the packages. In most cases `unicode-transforms`
outperforms ICU.
```
Benchmark unicode-transforms(ms) ICU(ms) % Diff
--------------- ---------------------- ------- --------
NFKD/Korean 7.78 37.10 +376.87
NFD/Korean 7.86 37.06 +371.50
NFKD/Vietnamese 6.85 12.48 +82.20
NFKD/Deutsch 2.17 3.55 +63.30
NFKD/English 1.71 2.78 +62.30
NFKC/Korean 4.77 7.65 +60.28
NFD/Deutsch 2.24 3.53 +57.41
NFD/English 1.76 2.77 +57.32
NFC/Vietnamese 10.66 16.63 +56.00
NFKC/Vietnamese 10.95 16.58 +51.43
NFD/Devanagari 6.48 8.68 +34.10
NFC/Devanagari 6.77 8.49 +25.48
NFD/AllChars 6.18 7.41 +19.91
NFD/Japanese 7.80 9.20 +17.99
NFKC/Devanagari 7.33 8.48 +15.74
NFKD/Japanese 8.71 10.05 +15.39
NFD/Vietnamese 5.94 6.83 +14.99
NFKD/Devanagari 7.59 8.68 +14.27
NFKD/AllChars 9.80 10.66 +8.82
NFKC/Deutsch 3.21 3.18 -0.72
NFC/Korean 4.62 4.38 -5.35
NFKC/English 2.21 2.06 -6.88
NFC/English 2.19 2.04 -7.21
NFKC/AllChars 14.67 9.75 -50.51
NFC/Deutsch 3.02 1.95 -54.39
NFKC/Japanese 12.46 5.42 -129.93
NFC/AllChars 9.72 3.58 -171.63
NFC/Japanese 11.90 3.04 -292.04
```
## Talks
* Talks: [Functional Conf 2018 Video](https://www.youtube.com/watch?v=aJvwORrBJ0o) | [Functional Conf 2018 Slides](https://www.slideshare.net/HarendraKumar10/high-performance-haskell)
## Contributing
Please use https://github.com/harendra-kumar/unicode-transforms to raise
issues, or send pull requests.
|