1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
|
# pyuca: Python Unicode Collation Algorithm implementation
[](https://travis-ci.org/jtauber/pyuca)
[](https://coveralls.io/r/jtauber/pyuca?branch=master)

[](https://zenodo.org/badge/latestdoi/3769/jtauber/pyuca)
[](http://joss.theoj.org/papers/10.21105/joss.00021)
This is a Python implementation of the
[Unicode Collation Algorithm (UCA)](http://unicode.org/reports/tr10/). It
passes 100% of the UCA conformance tests for Unicode 5.2.0 (Python 2.7),
Unicode 6.3.0 (Python 3.3+), Unicode 8.0.0 (Python 3.5+), Unicode 9.0.0
(Python 3.6+), and Unicode 10.0.0 (Python 3.7+) with a variable-weighting
setting of Non-ignorable.
## What do you use it for?
In short, sorting non-English strings properly.
The core of the algorithm involves multi-level comparison. For example,
``café`` comes before ``caff`` because at the primary level, the accent is
ignored and the first word is treated as if it were ``cafe``. The secondary
level (which considers accents) only applies then to words that are equivalent
at the primary level.
The Unicode Collation Algorithm and pyuca also support contraction and
expansion. **Contraction** is where multiple letters are treated as a single
unit. In Spanish, ``ch`` is treated as a letter coming between ``c`` and ``d``
so that, for example, words beginning ``ch`` should sort after all other words
beginnings with ``c``. **Expansion** is where a single letter is treated as
though it were multiple letters. In German, ``ä`` is sorted as if it were
``ae``, i.e. after ``ad`` but before ``af``.
## How to use it
Here is how to use the ``pyuca`` module.
pip install pyuca
Usage example:
from pyuca import Collator
c = Collator()
assert sorted(["cafe", "caff", "café"]) == ["cafe", "caff", "café"]
assert sorted(["cafe", "caff", "café"], key=c.sort_key) == ["cafe", "café", "caff"]
``Collator`` can also take an optional filename for specifying a custom
collation element table.
You can also import collators for specific Unicode versions,
e.g. `from pyuca.collator import Collator_8_0_0`.
But just `from pyuca import Collator` will ensure that the collator version
matches the version of `unicodata` provided by the standard library for your
version of Python.
## How to cite it
Tauber, J. K. (2016). pyuca: a Python implementation of the Unicode Collation Algorithm. The Journal of Open Source Software. DOI: 10.21105/joss.00021
## License
Python code is made available under an MIT license (see `LICENSE`).
`allkeys.txt` is made available under the similar license defined in
`LICENSE-allkeys`.
## Contacting the Developer
If you have any problems, questions or suggestions, it's best to file an issue
on GitHub although you can also contact me at jtauber@jtauber.com.
For more of my work on linguistics and Ancient Greek, see
<http://jktauber.com/>.
|