File: README.md

package info (click to toggle)
haskell-unicode-transforms 0.4.0.1-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 3,312 kB
  • sloc: haskell: 786; sh: 15; makefile: 7
file content (77 lines) | stat: -rw-r--r-- 4,006 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Unicode Transforms

[![Hackage](https://img.shields.io/hackage/v/unicode-transforms.svg?style=flat)](https://hackage.haskell.org/package/unicode-transforms)
[![Build Status](https://travis-ci.com/composewell/unicode-transforms.svg?branch=master)](https://travis-ci.com/composewell/unicode-transforms)
[![Windows Build status](https://ci.appveyor.com/api/projects/status/5wov8m1m0asvbv32?svg=true)](https://ci.appveyor.com/project/harendra-kumar/unicode-transforms)
[![Coverage Status](https://coveralls.io/repos/composewell/unicode-transforms/badge.svg?branch=master&service=github)](https://coveralls.io/github/composewell/unicode-transforms?branch=master)

Fast Unicode 14.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

## What is normalization?

Unicode characters with adornments (e.g. Á) can be represented in two different
forms, as a single composed character (U+00C1 = Á) or as multiple decomposed
characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte
sequences but for humans they have exactly the same visual appearance.

A regular byte comparison may tell that two strings are different even though
they might be equivalent. We need to convert both the strings in a
[`normalized`](http://unicode.org/reports/tr15/) form using the [Unicode
Character Database](http://www.unicode.org/Public/UCD/latest/) before we can
compare them for equivalence. For example:
```
>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"
True
```

## Performance

Normalization performance comparison of this package (v0.3.7) with
the [text-icu](http://hackage.haskell.org/package/text-icu) package
using the [ICU C++ library](http://site.icu-project.org/download)
version ICU4C 65.1 on macOS. The benchmarks compare the time taken in
milliseconds to normalize files in different languages and normalization
forms using both the packages. In most cases `unicode-transforms`
outperforms ICU.

```
Benchmark       unicode-transforms(ms) ICU(ms)    % Diff
--------------- ---------------------- -------   --------
NFKD/Korean                       7.78   37.10    +376.87
NFD/Korean                        7.86   37.06    +371.50
NFKD/Vietnamese                   6.85   12.48     +82.20
NFKD/Deutsch                      2.17    3.55     +63.30
NFKD/English                      1.71    2.78     +62.30
NFKC/Korean                       4.77    7.65     +60.28
NFD/Deutsch                       2.24    3.53     +57.41
NFD/English                       1.76    2.77     +57.32
NFC/Vietnamese                   10.66   16.63     +56.00
NFKC/Vietnamese                  10.95   16.58     +51.43
NFD/Devanagari                    6.48    8.68     +34.10
NFC/Devanagari                    6.77    8.49     +25.48
NFD/AllChars                      6.18    7.41     +19.91
NFD/Japanese                      7.80    9.20     +17.99
NFKC/Devanagari                   7.33    8.48     +15.74
NFKD/Japanese                     8.71   10.05     +15.39
NFD/Vietnamese                    5.94    6.83     +14.99
NFKD/Devanagari                   7.59    8.68     +14.27
NFKD/AllChars                     9.80   10.66      +8.82
NFKC/Deutsch                      3.21    3.18      -0.72
NFC/Korean                        4.62    4.38      -5.35
NFKC/English                      2.21    2.06      -6.88
NFC/English                       2.19    2.04      -7.21
NFKC/AllChars                    14.67    9.75     -50.51
NFC/Deutsch                       3.02    1.95     -54.39
NFKC/Japanese                    12.46    5.42    -129.93
NFC/AllChars                      9.72    3.58    -171.63
NFC/Japanese                     11.90    3.04    -292.04
```

## Talks

* Talks: [Functional Conf 2018 Video](https://www.youtube.com/watch?v=aJvwORrBJ0o) | [Functional Conf 2018 Slides](https://www.slideshare.net/HarendraKumar10/high-performance-haskell)

## Contributing
Please use https://github.com/harendra-kumar/unicode-transforms to raise
issues, or send pull requests.