1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
|

# mutf-8
This package contains simple pure-python as well as C encoders and decoders for
the MUTF-8 character encoding. In most cases, you can also parse the even-rarer
CESU-8.
These days, you'll most likely encounter MUTF-8 when working on files or
protocols related to the JVM. Strings in a Java `.class` file are encoded using
MUTF-8, strings passed by the JNI, as well as strings exported by the object
serializer.
This library was extracted from [Lawu][], a Python library for working with JVM
class files.
## 🎉 Installation
Install the package from PyPi:
```
pip install mutf8
```
Binary wheels are available for the following:
| | py3.6 | py3.7 | py3.8 | py3.9 |
| ---------------- | ----- | ----- | ----- | ----- |
| OS X (x86_64) | y | y | y | y |
| Windows (x86_64) | y | y | y | y |
| Linux (x86_64) | y | y | y | y |
If binary wheels are not available, it will attempt to build the C extension
from source with any C99 compiler. If it could not build, it will fall back
to a pure-python version.
## Usage
Encoding and decoding is simple:
```python
from mutf8 import encode_modified_utf8, decode_modified_utf8
unicode = decode_modified_utf8(byte_like_object)
bytes = encode_modified_utf8(unicode)
```
This module *does not* register itself globally as a codec, since importing
should be side-effect-free.
## 📈 Benchmarks
The C extension is significantly faster - often 20x to 40x faster.
<!-- BENCHMARK START -->
### MUTF-8 Decoding
| Name | Min (μs) | Max (μs) | StdDev | Ops |
|------------------------------|------------|------------|----------|---------------|
| cmutf8-decode_modified_utf8 | 0.00009 | 0.00080 | 0.00000 | 9957678.56358 |
| pymutf8-decode_modified_utf8 | 0.00190 | 0.06040 | 0.00000 | 450455.96019 |
### MUTF-8 Encoding
| Name | Min (μs) | Max (μs) | StdDev | Ops |
|------------------------------|------------|------------|----------|----------------|
| cmutf8-encode_modified_utf8 | 0.00008 | 0.00151 | 0.00000 | 11897361.05101 |
| pymutf8-encode_modified_utf8 | 0.00180 | 0.16650 | 0.00000 | 474390.98091 |
<!-- BENCHMARK END -->
## C Extension
The C extension is optional. If a binary package is not available, or a C
compiler is not present, the pure-python version will be used instead. If you
want to ensure you're using the C version, import it directly:
```python
from mutf8.cmutf8 import decode_modified_utf8
decode_modified_utf(b'\xED\xA1\x80\xED\xB0\x80')
```
[Lawu]: https://github.com/tktech/lawu
|