File: README.md

package info (click to toggle)
rust-ucd-generate 0.2.3-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, sid
  • size: 2,312 kB
  • sloc: sh: 50; makefile: 2
file content (133 lines) | stat: -rw-r--r-- 4,798 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
ucd-generate
============
A command line tool to generate Unicode tables in Rust source code. Tables
can typically be generated in one of three formats: a sorted sequence of
character ranges, a
[finite state transducer](https://github.com/BurntSushi/fst)
or a compressed trie. Full support for name canonicalization is also provided.
This tool also supports serializing regular expressions as DFAs using the
[regex-automata](https://github.com/BurntSushi/regex-automata)
crate.

[![Linux build status](https://api.travis-ci.org/BurntSushi/ucd-generate.png)](https://travis-ci.org/BurntSushi/ucd-generate)
[![](http://meritbadge.herokuapp.com/ucd-generate)](https://crates.io/crates/ucd-generate)


### Installation

Since this is mostly intended as a developer tool for use while writing Rust
programs, the principle method of installation is from crates.io:

```
$ cargo install ucd-generate
ucd-generate --help
```


### Example

This somewhat arbitrary example shows the output of generating tables for
three properties, and representing them as normal Rust character literal
ranges.

To run the example, you need to download the Unicode Character Database (UCD):

```
$ mkdir /tmp/ucd-10.0.0
$ cd /tmp/ucd-10.0.0
$ curl -LO https://www.unicode.org/Public/zipped/10.0.0/UCD.zip
$ unzip UCD.zip
```

Now tell `ucd-generate` what you want and point it to the directory created
above:

```
$ ucd-generate property-bool /tmp/ucd-10.0.0 --include Hyphen,Dash,Quotation_Mark --chars
```

And the output, which is valid Rust source code:

```rust
// DO NOT EDIT THIS FILE. IT WAS AUTOMATICALLY GENERATED BY:
//
//  ucd-generate property-bool /tmp/ucd-10.0.0 --include Hyphen,Dash,Quotation_Mark --chars
//
// ucd-generate is available on crates.io.

pub const BY_NAME: &'static [(&'static str, &'static [(char, char)])] = &[
  ("Dash", DASH), ("Hyphen", HYPHEN), ("Quotation_Mark", QUOTATION_MARK),
];

pub const DASH: &'static [(char, char)] = &[
  ('-', '-'), ('֊', '֊'), ('־', '־'), ('᐀', '᐀'), ('᠆', '᠆'),
  ('‐', '―'), ('⁓', '⁓'), ('⁻', '⁻'), ('₋', '₋'),
  ('−', '−'), ('⸗', '⸗'), ('⸚', '⸚'), ('⸺', '⸻'),
  ('⹀', '⹀'), ('〜', '〜'), ('〰', '〰'), ('゠', '゠'),
  ('︱', '︲'), ('﹘', '﹘'), ('﹣', '﹣'), ('-', '-'),
];

pub const HYPHEN: &'static [(char, char)] = &[
  ('-', '-'), ('\u{ad}', '\u{ad}'), ('֊', '֊'), ('᠆', '᠆'),
  ('‐', '‑'), ('⸗', '⸗'), ('・', '・'), ('﹣', '﹣'),
  ('-', '-'), ('・', '・'),
];

pub const QUOTATION_MARK: &'static [(char, char)] = &[
  ('\"', '\"'), ('\'', '\''), ('«', '«'), ('»', '»'), ('‘', '‟'),
  ('‹', '›'), ('⹂', '⹂'), ('「', '』'), ('〝', '〟'),
  ('﹁', '﹄'), ('"', '"'), (''', '''), ('「', '」'),
];
```


### Contributing

The `ucd-generate` tool doesn't have any specific design goals, other than to
collect Unicode table generation tasks. If you need `ucd-generate` to do
something and it's reasonably straight-forward to add, then just submitting a
PR would be great. Otherwise, file an issue and we can discuss.


### Future work

This tool is by no means is exhaustive. In fact, it's not even close to
exhaustive, and it may never be. For the most part, the intent of this tool
is to collect virtually any kind of Unicode generation task. In theory, this
would ideally replace the hodge podge collection of Python programs that is
responsible for this task today in various Unicode crates.

It is likely, and perhaps desirable, that this tool will eventually be
deprecated in favor of a more complete project like
[UNIC](https://github.com/behnam/rust-unic).
The `ucd-generate` tool was born out of desire to add more principled Unicode
support to Rust's regex crate, and it was much easier to develop this
out-of-band for my specific requirements.

Finally, the structures generated by this tool may not be optimal. In
particular, I strongly suspect that the trie set generator could be improved
dramatically.


### Sub-crates

This repository is home to three sub-crates:

* [`ucd-parse`](ucd-parse) - A crate for parsing UCD files into
  structured data.
* [`ucd-trie`](ucd-trie) - Auxiliary type for handling the trie
  set table format emitted by `ucd-generate`. This crate has a `no_std` mode.
* [`ucd-util`](ucd-util) - A purposely small crate for Unicode
  auxiliary functions. This includes things like symbol or character name
  canonicalization, ideograph name generation and helper functions for
  searching property name and value tables.


### License

This project is licensed under either of
 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
   http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
   http://opensource.org/licenses/MIT)
at your option.