1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
|
# README
`unicode-data` provides Haskell APIs to efficiently access the Unicode
character database. [Performance](#performance) is the primary goal in the
design of this package.
The Haskell data structures are generated programmatically from the
[Unicode character database](https://www.unicode.org/ucd/) (UCD) files.
The latest Unicode version supported by this library is
[`15.0.0`](https://www.unicode.org/versions/Unicode15.0.0/).
Please see the
[Haddock documentation](https://hackage.haskell.org/package/unicode-data)
for reference documentation.
## Performance
`unicode-data` is up to _5 times faster_ than `base`.
The following benchmark compares the time taken in milliseconds to process all
the Unicode code points for `base-4.16` (GHC 9.2.1) and this package (v0.3).
Machine: 8 × AMD Ryzen 5 2500U on Linux.
```
All
Unicode.Char.Case.Compat
isLower
base: OK (1.53s)
24 ms ± 3.8 ms
unicode-data: OK (2.25s)
4.4 ms ± 88 μs, 0.19x
isUpper
base: OK (1.50s)
24 ms ± 450 μs
unicode-data: OK (2.37s)
4.7 ms ± 200 μs, 0.19x
toLower
base: OK (1.40s)
22 ms ± 1.8 ms
unicode-data: OK (1.89s)
7.2 ms ± 297 μs, 0.32x
toTitle
base: OK (1.25s)
20 ms ± 2.0 ms
unicode-data: OK (1.65s)
6.4 ms ± 509 μs, 0.32x
toUpper
base: OK (1.26s)
20 ms ± 2.5 ms
unicode-data: OK (1.72s)
6.8 ms ± 335 μs, 0.34x
Unicode.Char.General
generalCategory
base: OK (2.02s)
134 ms ± 1.6 ms
unicode-data: OK (1.75s)
116 ms ± 1.6 ms, 0.87x
isAlphaNum
base: OK (1.53s)
24 ms ± 1.7 ms
unicode-data: OK (2.16s)
4.2 ms ± 29 μs, 0.18x
isControl
base: OK (1.47s)
23 ms ± 2.6 ms
unicode-data: OK (2.23s)
4.4 ms ± 22 μs, 0.19x
isMark
base: OK (1.47s)
23 ms ± 624 μs
unicode-data: OK (2.28s)
4.5 ms ± 48 μs, 0.19x
isPrint
base: OK (1.53s)
25 ms ± 2.4 ms
unicode-data: OK (2.27s)
4.4 ms ± 50 μs, 0.18x
isPunctuation
base: OK (1.51s)
24 ms ± 459 μs
unicode-data: OK (2.24s)
4.4 ms ± 25 μs, 0.18x
isSeparator
base: OK (1.52s)
24 ms ± 407 μs
unicode-data: OK (2.43s)
4.8 ms ± 94 μs, 0.20x
isSymbol
base: OK (1.49s)
24 ms ± 863 μs
unicode-data: OK (1.34s)
5.2 ms ± 92 μs, 0.22x
Unicode.Char.General.Compat
isAlpha
base: OK (1.46s)
23 ms ± 322 μs
unicode-data: OK (2.14s)
4.1 ms ± 36 μs, 0.18x
isLetter
base: OK (1.44s)
22 ms ± 640 μs
unicode-data: OK (2.17s)
4.3 ms ± 58 μs, 0.19x
isSpace
base: OK (1.44s)
11 ms ± 1.2 ms
unicode-data: OK (1.36s)
5.3 ms ± 243 μs, 0.49x
Unicode.Char.Numeric
isNumber
base: OK (1.52s)
24 ms ± 368 μs
unicode-data: OK (2.41s)
4.7 ms ± 41 μs, 0.19x
```
## Unicode database version update
To update the Unicode version please update the version number in
`ucd.sh`.
To download the Unicode database, run `ucd.sh download` from the top
level directory of the repo to fetch the database in `./ucd`.
```
$ ./ucd.sh download
```
To generate the Haskell data structure files from the downloaded database
files, run `ucd.sh generate` from the top level directory of the repo.
```
$ ./ucd.sh generate
```
## Running property doctests
Temporarily add `QuickCheck` to build depends of library.
```
$ cabal build
$ cabal-docspec --check-properties --property-variables c
```
## Licensing
`unicode-data` is an [open source](https://github.com/composewell/unicode-data)
project available under a liberal [Apache-2.0 license](LICENSE).
## Contributing
As an open project we welcome contributions.
|