1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248
|
# unicode-segmentation-rs
Python bindings for the Rust [unicode-segmentation](https://docs.rs/unicode-segmentation/) and [unicode-width](https://docs.rs/unicode-width/) crates, providing Unicode text segmentation and width calculation according to Unicode standards.
## Features
- **Grapheme Cluster Segmentation**: Split text into user-perceived characters
- **Word Segmentation**: Split text into words according to Unicode rules
- **Sentence Segmentation**: Split text into sentences
- **Display Width Calculation**: Get the display width of text (for terminal/monospace display)
- **Gettext PO Wrapping**: Wrap text for gettext PO files with proper handling of escape sequences and CJK characters
## Installation
### From PyPI
```bash
uv pip install unicode-segmentation-rs
```
### From source
```bash
# Install maturin
pip install maturin
# Build and install the package
maturin develop --release
```
## Usage
```python
import unicode_segmentation_rs
# Grapheme clusters (user-perceived characters)
text = "Hello ๐จโ๐ฉโ๐งโ๐ฆ World"
clusters = unicode_segmentation_py.graphemes(text, is_extended=True)
print(clusters) # ['H', 'e', 'l', 'l', 'o', ' ', '๐จโ๐ฉโ๐งโ๐ฆ', ' ', 'W', 'o', 'r', 'l', 'd']
# Get grapheme clusters with their byte indices
indices = unicode_segmentation_py.grapheme_indices(text, is_extended=True)
print(indices) # [(0, 'H'), (1, 'e'), ...]
# Word boundaries (includes punctuation and whitespace)
text = "Hello, world!"
words = unicode_segmentation_py.split_word_bounds(text)
print(words) # ['Hello', ',', ' ', 'world', '!']
# Unicode words (excludes punctuation and whitespace)
words = unicode_segmentation_py.unicode_words(text)
print(words) # ['Hello', 'world']
# Word indices
indices = unicode_segmentation_py.split_word_bound_indices(text)
print(indices) # [(0, 'Hello'), (5, ','), ...]
# Sentence segmentation
text = "Hello world. How are you? I'm fine."
sentences = unicode_segmentation_py.unicode_sentences(text)
print(sentences) # ['Hello world. ', 'How are you? ', "I'm fine."]
# Display width calculation
text = "Hello ไธ็"
width = unicode_segmentation_py.text_width(text)
print(width) # 10 (Hello=5, space=1, ไธ=2, ็=2, but depends on terminal)
# Character width
print(unicode_segmentation_py.text_width('A')) # Some(1)
print(unicode_segmentation_py.text_width('ไธ')) # Some(2)
print(unicode_segmentation_py.text_width('\t')) # None (control character)
```
## Examples
### Grapheme Cluster Segmentation
```python
import unicode_segmentation_rs
# Complex emojis and combining characters
text = "Hello ๐จโ๐ฉโ๐งโ๐ฆ เคจเคฎเคธเฅเคคเฅ"
print(f"Text: {text}")
print(f"Graphemes: {unicode_segmentation_py.graphemes(text, is_extended=True)}")
print(f"Length (graphemes): {len(unicode_segmentation_py.graphemes(text, is_extended=True))}")
print(f"Length (chars): {len(text)}")
# With indices
print("Grapheme indices:")
for idx, cluster in unicode_segmentation_py.grapheme_indices(text, is_extended=True):
print(f" {idx:3d}: {cluster!r}")
```
### Word Segmentation
```python
text = "Hello, world! How are you?"
print(f"Text: {text}")
print(f"Word bounds: {unicode_segmentation_py.split_word_bounds(text)}")
print(f"Unicode words: {unicode_segmentation_py.unicode_words(text)}")
# With indices
print("Word boundary indices:")
for idx, word in unicode_segmentation_py.split_word_bound_indices(text):
print(f" {idx:3d}: {word!r}")
```
### Sentence Segmentation
```python
text = "Hello world. How are you? I'm fine, thanks! What about you?"
print(f"Text: {text}")
sentences = unicode_segmentation_py.unicode_sentences(text)
print("Sentences:")
for i, sentence in enumerate(sentences, 1):
print(f" {i}. {sentence!r}")
```
### Multilingual Examples
```python
# Arabic
arabic = "ู
ุฑุญุจุง ุจู. ููู ุญุงููุ"
print(f"Arabic: {arabic}")
print(f"Sentences: {unicode_segmentation_py.unicode_sentences(arabic)}")
# Japanese
japanese = "ใใใซใกใฏใใๅ
ๆฐใงใใ๏ผ"
print(f"Japanese: {japanese}")
print(f"Sentences: {unicode_segmentation_py.unicode_sentences(japanese)}")
# Mixed languages
mixed = "Helloไธ็! This is a testๆ็ซ ."
print(f"Mixed: {mixed}")
print(f"Words: {unicode_segmentation_py.unicode_words(mixed)}")
```
### Display Width Calculation
```python
examples = [
"Hello",
"ไธ็",
"Hello ไธ็",
"ใใใซใกใฏ",
"๐๐",
"Tab\there",
]
for text in examples:
width = unicode_segmentation_py.text_width(text)
width_cjk = unicode_segmentation_py.text_width_cjk(text)
print(f"Text: {text!r:20} Width: {width:2} CJK: {width_cjk:2} Chars: {len(text):2}")
# Character widths
chars = ['a', 'A', '1', ' ', 'ไธ', '็', 'ใ', '๐', '\t', '\n']
for c in chars:
w = unicode_segmentation_py.text_width(c)
w_cjk = unicode_segmentation_py.text_width_cjk(c)
w_str = str(w) if w is not None else "None"
w_cjk_str = str(w_cjk) if w_cjk is not None else "None"
print(f" {c!r:6} width: {w_str:4} cjk: {w_cjk_str:4}")
```
### Gettext PO File Wrapping
```python
# Wrap text for PO files (default width is 77 characters)
text = "This is a long translation string that needs to be wrapped appropriately for a gettext PO file"
lines = unicode_segmentation_rs.gettext_wrap(text, 77)
for i, line in enumerate(lines, 1):
print(f"Line {i}: {line}")
# Wrapping with CJK characters
text = "This translation contains ไธญๆๅญ็ฌฆ (Chinese characters) and should wrap correctly"
lines = unicode_segmentation_rs.gettext_wrap(text, 40)
for line in lines:
width = unicode_segmentation_rs.text_width(line)
print(f"[{width:2d} cols] {line}")
# Escape sequences are preserved
text = "This has\\nline breaks\\tand tabs"
lines = unicode_segmentation_rs.gettext_wrap(text, 20)
print(lines)
```
## API Reference
### `graphemes(text: str, is_extended: bool) -> list[str]`
Split a string into grapheme clusters. Set `is_extended=True` for extended grapheme clusters (recommended).
### `grapheme_indices(text: str, is_extended: bool) -> list[tuple[int, str]]`
Split a string into grapheme clusters with their byte indices.
### `split_word_bounds(text: str) -> list[str]`
Split a string at word boundaries (includes punctuation and whitespace).
### `split_word_bound_indices(text: str) -> list[tuple[int, str]]`
Split a string at word boundaries with byte indices.
### `unicode_words(text: str) -> list[str]`
Get Unicode words from a string (excludes punctuation and whitespace).
### `unicode_sentences(text: str) -> list[str]`
Split a string into sentences according to Unicode rules.
### `text_width(text: str) -> int`
Get the display width of a string in columns (as it would appear in a terminal). East Asian characters typically take 2 columns.
### `gettext_wrap(text: str, width: int) -> list[str]`
Wrap text for gettext PO files. This function follows gettext's wrapping behavior:
- Never breaks escape sequences (`\n`, `\"`, etc.)
- Prefers breaking after spaces
- Handles CJK characters with proper width calculation
- Breaks long words only when necessary
## Building for Distribution
```bash
# Build wheel
maturin build --release
# Build and publish to PyPI
maturin publish
```
## Running Tests
```bash
# Install test dependencies
pip install pytest
# Run tests
pytest tests/
```
## License
This project follows the same license as the underlying unicode-segmentation crate.
|