File: README.md

package info (click to toggle)
unicode-segmentation-rs 0.2.0-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 228 kB
  • sloc: python: 338; makefile: 18
file content (248 lines) | stat: -rw-r--r-- 7,417 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# unicode-segmentation-rs

Python bindings for the Rust [unicode-segmentation](https://docs.rs/unicode-segmentation/) and [unicode-width](https://docs.rs/unicode-width/) crates, providing Unicode text segmentation and width calculation according to Unicode standards.

## Features

- **Grapheme Cluster Segmentation**: Split text into user-perceived characters
- **Word Segmentation**: Split text into words according to Unicode rules
- **Sentence Segmentation**: Split text into sentences
- **Display Width Calculation**: Get the display width of text (for terminal/monospace display)
- **Gettext PO Wrapping**: Wrap text for gettext PO files with proper handling of escape sequences and CJK characters

## Installation

### From PyPI

```bash
uv pip install unicode-segmentation-rs
```

### From source

```bash
# Install maturin
pip install maturin

# Build and install the package
maturin develop --release
```

## Usage

```python
import unicode_segmentation_rs

# Grapheme clusters (user-perceived characters)
text = "Hello ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ World"
clusters = unicode_segmentation_py.graphemes(text, is_extended=True)
print(clusters)  # ['H', 'e', 'l', 'l', 'o', ' ', '๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ', ' ', 'W', 'o', 'r', 'l', 'd']

# Get grapheme clusters with their byte indices
indices = unicode_segmentation_py.grapheme_indices(text, is_extended=True)
print(indices)  # [(0, 'H'), (1, 'e'), ...]

# Word boundaries (includes punctuation and whitespace)
text = "Hello, world!"
words = unicode_segmentation_py.split_word_bounds(text)
print(words)  # ['Hello', ',', ' ', 'world', '!']

# Unicode words (excludes punctuation and whitespace)
words = unicode_segmentation_py.unicode_words(text)
print(words)  # ['Hello', 'world']

# Word indices
indices = unicode_segmentation_py.split_word_bound_indices(text)
print(indices)  # [(0, 'Hello'), (5, ','), ...]

# Sentence segmentation
text = "Hello world. How are you? I'm fine."
sentences = unicode_segmentation_py.unicode_sentences(text)
print(sentences)  # ['Hello world. ', 'How are you? ', "I'm fine."]

# Display width calculation
text = "Hello ไธ–็•Œ"
width = unicode_segmentation_py.text_width(text)
print(width)  # 10 (Hello=5, space=1, ไธ–=2, ็•Œ=2, but depends on terminal)

# Character width
print(unicode_segmentation_py.text_width('A'))    # Some(1)
print(unicode_segmentation_py.text_width('ไธ–'))   # Some(2)
print(unicode_segmentation_py.text_width('\t'))   # None (control character)
```

## Examples

### Grapheme Cluster Segmentation

```python
import unicode_segmentation_rs

# Complex emojis and combining characters
text = "Hello ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ เคจเคฎเคธเฅเคคเฅ‡"
print(f"Text: {text}")
print(f"Graphemes: {unicode_segmentation_py.graphemes(text, is_extended=True)}")
print(f"Length (graphemes): {len(unicode_segmentation_py.graphemes(text, is_extended=True))}")
print(f"Length (chars): {len(text)}")

# With indices
print("Grapheme indices:")
for idx, cluster in unicode_segmentation_py.grapheme_indices(text, is_extended=True):
    print(f"  {idx:3d}: {cluster!r}")
```

### Word Segmentation

```python
text = "Hello, world! How are you?"
print(f"Text: {text}")
print(f"Word bounds: {unicode_segmentation_py.split_word_bounds(text)}")
print(f"Unicode words: {unicode_segmentation_py.unicode_words(text)}")

# With indices
print("Word boundary indices:")
for idx, word in unicode_segmentation_py.split_word_bound_indices(text):
    print(f"  {idx:3d}: {word!r}")
```

### Sentence Segmentation

```python
text = "Hello world. How are you? I'm fine, thanks! What about you?"
print(f"Text: {text}")
sentences = unicode_segmentation_py.unicode_sentences(text)
print("Sentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"  {i}. {sentence!r}")
```

### Multilingual Examples

```python
# Arabic
arabic = "ู…ุฑุญุจุง ุจูƒ. ูƒูŠู ุญุงู„ูƒุŸ"
print(f"Arabic: {arabic}")
print(f"Sentences: {unicode_segmentation_py.unicode_sentences(arabic)}")

# Japanese
japanese = "ใ“ใ‚“ใซใกใฏใ€‚ใŠๅ…ƒๆฐ—ใงใ™ใ‹๏ผŸ"
print(f"Japanese: {japanese}")
print(f"Sentences: {unicode_segmentation_py.unicode_sentences(japanese)}")

# Mixed languages
mixed = "Helloไธ–็•Œ! This is a testๆ–‡็ซ ."
print(f"Mixed: {mixed}")
print(f"Words: {unicode_segmentation_py.unicode_words(mixed)}")
```

### Display Width Calculation

```python
examples = [
    "Hello",
    "ไธ–็•Œ",
    "Hello ไธ–็•Œ",
    "ใ“ใ‚“ใซใกใฏ",
    "๐ŸŽ‰๐ŸŽŠ",
    "Tab\there",
]

for text in examples:
    width = unicode_segmentation_py.text_width(text)
    width_cjk = unicode_segmentation_py.text_width_cjk(text)
    print(f"Text: {text!r:20} Width: {width:2} CJK: {width_cjk:2} Chars: {len(text):2}")

# Character widths
chars = ['a', 'A', '1', ' ', 'ไธ–', '็•Œ', 'ใ‚', '๐ŸŽ‰', '\t', '\n']
for c in chars:
    w = unicode_segmentation_py.text_width(c)
    w_cjk = unicode_segmentation_py.text_width_cjk(c)
    w_str = str(w) if w is not None else "None"
    w_cjk_str = str(w_cjk) if w_cjk is not None else "None"
    print(f"  {c!r:6} width: {w_str:4} cjk: {w_cjk_str:4}")
```

### Gettext PO File Wrapping

```python
# Wrap text for PO files (default width is 77 characters)
text = "This is a long translation string that needs to be wrapped appropriately for a gettext PO file"
lines = unicode_segmentation_rs.gettext_wrap(text, 77)
for i, line in enumerate(lines, 1):
    print(f"Line {i}: {line}")

# Wrapping with CJK characters
text = "This translation contains ไธญๆ–‡ๅญ—็ฌฆ (Chinese characters) and should wrap correctly"
lines = unicode_segmentation_rs.gettext_wrap(text, 40)
for line in lines:
    width = unicode_segmentation_rs.text_width(line)
    print(f"[{width:2d} cols] {line}")

# Escape sequences are preserved
text = "This has\\nline breaks\\tand tabs"
lines = unicode_segmentation_rs.gettext_wrap(text, 20)
print(lines)
```

## API Reference

### `graphemes(text: str, is_extended: bool) -> list[str]`

Split a string into grapheme clusters. Set `is_extended=True` for extended grapheme clusters (recommended).

### `grapheme_indices(text: str, is_extended: bool) -> list[tuple[int, str]]`

Split a string into grapheme clusters with their byte indices.

### `split_word_bounds(text: str) -> list[str]`

Split a string at word boundaries (includes punctuation and whitespace).

### `split_word_bound_indices(text: str) -> list[tuple[int, str]]`

Split a string at word boundaries with byte indices.

### `unicode_words(text: str) -> list[str]`

Get Unicode words from a string (excludes punctuation and whitespace).

### `unicode_sentences(text: str) -> list[str]`

Split a string into sentences according to Unicode rules.

### `text_width(text: str) -> int`

Get the display width of a string in columns (as it would appear in a terminal). East Asian characters typically take 2 columns.

### `gettext_wrap(text: str, width: int) -> list[str]`

Wrap text for gettext PO files. This function follows gettext's wrapping behavior:

- Never breaks escape sequences (`\n`, `\"`, etc.)
- Prefers breaking after spaces
- Handles CJK characters with proper width calculation
- Breaks long words only when necessary

## Building for Distribution

```bash
# Build wheel
maturin build --release

# Build and publish to PyPI
maturin publish
```

## Running Tests

```bash
# Install test dependencies
pip install pytest

# Run tests
pytest tests/
```

## License

This project follows the same license as the underlying unicode-segmentation crate.