File: README.md

package info (click to toggle)
haskell-unicode-data 0.6.0-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 1,004 kB
  • sloc: haskell: 26,075; makefile: 3
file content (162 lines) | stat: -rw-r--r-- 4,708 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# README

`unicode-data` provides Haskell APIs to efficiently access the Unicode
character database. [Performance](#performance) is the primary goal in the
design of this package.

The Haskell data structures are generated programmatically from the
[Unicode character database](https://www.unicode.org/ucd/) (UCD) files.
The latest Unicode version supported by this library is
[`15.1.0`](https://www.unicode.org/versions/Unicode15.1.0/).

Please see the
[Haddock documentation](https://hackage.haskell.org/package/unicode-data)
for reference documentation.

## Performance

`unicode-data` is up to _5 times faster_ than `base` ≤ 4.17 (see
[partial integration to `base`](#partial-integration-of-unicode-data-into-base)).

The following benchmark compares the time taken in milliseconds to process all
the Unicode code points (except surrogates, private use areas and unassigned),
for `base-4.16` (GHC 9.2.6) and this package (v0.4).
Machine: 8 × AMD Ryzen 5 2500U on Linux.

```
All
  Unicode.Char.Case.Compat
    isLower
      base:           OK (1.19s)
        17.1 ms ± 241 μs
      unicode-data:   OK (0.52s)
        3.58 ms ± 125 μs, 0.21x
    isUpper
      base:           OK (0.63s)
        17.5 ms ± 359 μs
      unicode-data:   OK (1.02s)
        3.58 ms ±  48 μs, 0.21x
    toLower
      base:           OK (0.59s)
        16.3 ms ± 524 μs
      unicode-data:   OK (0.80s)
        5.63 ms ± 129 μs, 0.35x
    toTitle
      base:           OK (3.91s)
        14.9 ms ± 427 μs
      unicode-data:   OK (2.84s)
        5.31 ms ±  37 μs, 0.36x
    toUpper
      base:           OK (2.12s)
        15.4 ms ± 234 μs
      unicode-data:   OK (0.86s)
        5.80 ms ± 159 μs, 0.38x
  Unicode.Char.General
    generalCategory
      base:           OK (1.16s)
        16.6 ms ± 534 μs
      unicode-data:   OK (0.62s)
        4.14 ms ± 103 μs, 0.25x
    isAlphaNum
      base:           OK (0.62s)
        17.1 ms ± 655 μs
      unicode-data:   OK (0.97s)
        3.59 ms ±  51 μs, 0.21x
    isControl
      base:           OK (0.63s)
        17.6 ms ± 494 μs
      unicode-data:   OK (0.57s)
        3.59 ms ±  90 μs, 0.20x
    isMark
      base:           OK (0.34s)
        17.6 ms ± 695 μs
      unicode-data:   OK (1.00s)
        3.59 ms ±  67 μs, 0.20x
    isPrint
      base:           OK (1.22s)
        17.7 ms ± 492 μs
      unicode-data:   OK (1.92s)
        3.56 ms ±  27 μs, 0.20x
    isPunctuation
      base:           OK (2.23s)
        16.6 ms ± 619 μs
      unicode-data:   OK (1.05s)
        3.60 ms ±  52 μs, 0.22x
    isSeparator
      base:           OK (1.15s)
        16.6 ms ± 439 μs
      unicode-data:   OK (0.49s)
        3.60 ms ±  85 μs, 0.22x
    isSymbol
      base:           OK (2.11s)
        16.1 ms ± 553 μs
      unicode-data:   OK (1.05s)
        3.58 ms ±  62 μs, 0.22x
  Unicode.Char.General.Compat
    isAlpha
      base:           OK (0.58s)
        17.2 ms ± 502 μs
      unicode-data:   OK (1.02s)
        3.58 ms ±  50 μs, 0.21x
    isLetter
      base:           OK (8.57s)
        16.4 ms ± 553 μs
      unicode-data:   OK (1.05s)
        3.58 ms ±  79 μs, 0.22x
    isSpace
      base:           OK (1.09s)
        7.56 ms ± 159 μs
      unicode-data:   OK (0.97s)
        3.58 ms ±  46 μs, 0.47x
  Unicode.Char.Numeric.Compat
    isNumber
      base:           OK (0.58s)
        15.7 ms ± 462 μs
      unicode-data:   OK (0.58s)
        3.58 ms ± 107 μs, 0.23x
```

### Partial integration of `unicode-data` into `base`

Since `base` 4.18, `unicode-data` has been
_partially_ [integrated to GHC](https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8072),
so there should be no relevant difference. However, using `unicode-data` allows
to select the _exact_ version of Unicode to support, therefore not relying on
the version supported by GHC.

## Unicode database version update

To update the Unicode version please update the version number in
`ucd.sh`.

To download the Unicode database, run `ucd.sh download` from the top
level directory of the repo to fetch the database in `./ucd`.

```
$ ./ucd.sh download
```

To generate the Haskell data structure files from the downloaded database
files, run `ucd.sh generate` from the top level directory of the repo.

```
$ ./ucd.sh generate
```

## Running property doctests

Temporarily add `QuickCheck` to build depends of library.

```
$ cabal build
$ cabal-docspec --check-properties --property-variables c
```

## Licensing

`unicode-data` is an [open source](https://github.com/composewell/unicode-data)
project available under a liberal [Apache-2.0 license](LICENSE).

## Contributing

As an open project we welcome contributions.