File: CHANGELOG.md

package info (click to toggle)
tokenizers 0.20.3%2Bdfsg-1
links: PTS, VCS
area: main
in suites: experimental
size: 5,480 kB
sloc: python: 4,499; javascript: 419; makefile: 124
file content (512 lines) | stat: -rw-r--r-- 22,977 bytes
# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.13.2] 

- [#1096] Python 3.11 support

## [0.13.1] 

- [#1072] Fixing Roberta type ids.

## [0.13.0] 

- [#956] PyO3 version upgrade
- [#1055] M1 automated builds
- [#1008] `Decoder` is now a composable trait, but without being backward incompatible
- [#1047, #1051, #1052] `Processor` is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward
  compatibility, the code is different enough that we cannot be exactly sure.

## [0.12.1] 

- [#938] **Reverted breaking change**. https://github.com/huggingface/transformers/issues/16520

## [0.12.0] YANKED

Bump minor version because of a breaking change.

- [#938] [REVERTED IN 0.12.1] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)

- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
- [#962] Fix tests for python 3.10
- [#961] Added link for Ruby port of `tokenizers`

## [0.11.6]

- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
- [#916] Deserializing faster `added_tokens` by loading them in batch.

## [0.11.5]

- [#895] Build `python 3.10` wheels.

## [0.11.4]

- [#884] Fixing bad deserialization following inclusion of a default for Punctuation

## [0.11.3]

- [#882] Fixing Punctuation deserialize without argument.
- [#868] Fixing missing direction in TruncationParams
- [#860] Adding TruncationSide to TruncationParams

## [0.11.0]

### Fixed

- [#585] Conda version should now work on old CentOS
- [#844] Fixing interaction between `is_pretokenized` and `trim_offsets`.
- [#851] Doc links

### Added
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
- [#845]: Documentation for `Decoders`.

### Changed
- [#850]: Added a feature gate to enable disabling `http` features
- [#718]: Fix `WordLevel` tokenizer determinism during training
- [#762]: Add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
- [#770]: Improved documentation for `UnigramTrainer`
- [#780]: Add `Tokenizer.from_pretrained` to load tokenizers from the Hugging Face Hub
- [#793]: Saving a pretty JSON file by default when saving a tokenizer

## [0.10.3]

### Fixed
- [#686]: Fix SPM conversion process for whitespace deduplication
- [#707]: Fix stripping strings containing Unicode characters

### Added
- [#693]: Add a CTC Decoder for Wave2Vec models

### Removed
- [#714]: Removed support for Python 3.5

## [0.10.2]

### Fixed
- [#652]: Fix offsets for `Precompiled` corner case
- [#656]: Fix BPE `continuing_subword_prefix`
- [#674]: Fix `Metaspace` serialization problems

## [0.10.1]

### Fixed
- [#616]: Fix SentencePiece tokenizers conversion
- [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
- [#618]: Fix Normalizer.normalize with `PyNormalizedStringRefMut`
- [#620]: Fix serialization/deserialization for overlapping models
- [#621]: Fix `ByteLevel` instantiation from a previously saved state (using `__getstate__()`)

## [0.10.0]

### Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a `WordLevelTrainer` used to train a `WordLevel` model
- [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with `datasets`
- [#590]: Add getters/setters for components on BaseTokenizer
- [#574]: Add `fust_unk` option to SentencePieceBPETokenizer

### Changed
- [#509]: Automatically stubbing the `.pyi` files
- [#519]: Each `Model` can return its associated `Trainer` with `get_trainer()`
- [#530]: The various attributes on each component can be get/set (ie.
`tokenizer.model.dropout = 0.1`)
- [#538]: The API Reference has been improved and is now up-to-date.

### Fixed
- [#519]: During training, the `Model` is now trained in-place. This fixes several bugs that were
forcing to reload the `Model` after a training.
- [#539]: Fix `BaseTokenizer` enable_truncation docstring

## [0.9.4]

### Fixed
- [#492]: Fix `from_file` on `BertWordPieceTokenizer`
- [#498]: Fix the link to download `sentencepiece_model_pb2.py`
- [#500]: Fix a typo in the docs quicktour

### Changed
- [#506]: Improve Encoding mappings for pairs of sequence

## [0.9.3]

### Fixed
- [#470]: Fix hanging error when training with custom component
- [#476]: TemplateProcessing serialization is now deterministic
- [#481]: Fix SentencePieceBPETokenizer.from_files

### Added
- [#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
- [#480]: Unigram now accepts an `initial_alphabet` and handles `special_tokens` correctly

## [0.9.2]

### Fixed
- [#464]: Fix a problem with RobertaProcessing being deserialized as BertProcessing

## [0.9.1]

### Fixed
- [#459]: Fix a problem with deserialization

## [0.9.0]

### Fixed
- [#362]: Fix training deadlock with Python components.
- [#363]: Fix a crash when calling `.train` with some non-existent files
- [#355]: Remove a lot of possible crashes
- [#389]: Improve truncation (crash and consistency)

### Added
- [#379]: Add the ability to call `encode`/`encode_batch` with numpy arrays
- [#292]: Support for the Unigram algorithm
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
- [#403]: Add `TemplateProcessing` `PostProcessor`.
- [#420]: Ability to fuse the "unk" token in BPE.

### Changed
- [#360]: Lots of improvements related to words/alignment tracking
- [#426]: Improvements on error messages thanks to PyO3 0.12

## [0.8.1]

### Fixed
- [#333]: Fix deserialization of `AddedToken`, where the content was not restored properly

### Changed
- [#329]: Improved warning and behavior when we detect a fork
- [#330]: BertNormalizer now keeps the same behavior than the original implementation when
`strip_accents` is not specified.

## [0.8.0]

### Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word.
- Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
- With the serialization comes the compatibility with `Pickle`! The Tokenizer, all of its components,
Encodings, everything can be pickled!
- Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with `multiprocessing`, even when using the `fork` start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with `multiprocessing`. This version now allows to
disable the parallelism, and will warn you if this is necessary.
- And a lot of other improvements, and fixes.

### Fixed
- [#286]: Fix various crash when training a BPE model
- [#309]: Fixed a few bugs related to additional vocabulary/tokens

### Added
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
- [#273]: `Tokenizer` and its parts are now pickable
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
`enable_padding(pad_to_multiple_of=8)` for example.
- [#298]: Ability to get the currently set truncation/padding params
- [#311]: Ability to enable/disable the parallelism using the `TOKENIZERS_PARALLELISM` environment
variable. This is especially usefull when using `multiprocessing` capabilities, with the `fork`
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)

### Changed
- Improved errors generated during truncation: When the provided max length is too low are
now handled properly.
- [#249] `encode` and `encode_batch` now accept pre-tokenized inputs. When the input is pre-tokenized,
the argument `is_pretokenized=True` must be specified.
- [#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file
- [#280]: Use `onig` for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2
- [#309]: Improved the management of the additional vocabulary. This introduces an option
`normalized`, controlling whether a token should be extracted from the normalized version of the
input text.

## [0.7.0]

### Changed
- Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long.
- [#193]: `encode` and `encode_batch` now take a new optional argument, specifying whether we
should add the special tokens. This is activated by default.
- [#197]: `original_str` and `normalized_str` have been removed from the `Encoding` returned by
`encode` and `encode_batch`. This brings a reduction of 70% of the memory footprint.
- [#197]: The offsets provided on `Encoding` are now relative to the original string, and not the
normalized one anymore.
- The added token given to `add_special_tokens` or `add_tokens` on a `Tokenizer`, or while using
`train(special_tokens=...)` can now be instances of `AddedToken` to provide more control over these
tokens.
- [#136]: Updated Pyo3 version
- [#136]: Static methods `Model.from_files` and `Model.empty` are removed in favor of using
constructors.
- [#239]: `CharBPETokenizer` now corresponds to OpenAI GPT BPE implementation by default.

### Added
- [#188]: `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token.
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
- [#236]: `RobertaProcessing` also handles trimming the offsets.
- [#234]: New alignment mappings on the `Encoding`. Provide methods to easily convert between `char`
or `word` (input space) and `token` (output space).
- `post_process` can be called on the `Tokenizer`
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
`get_vocab(with_added_tokens: bool)`
- [#136] Models can now be instantiated through object constructors.

### Fixed
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
	- when `add_prefix_space=True`
	- [#156]: when a Unicode character gets split-up in multiple byte-level characters
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if
not advised, but that's not the question).
- [#205]: Trim the decoded string in `BPEDecoder` used by `CharBPETokenizer`

### How to migrate
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are
using `ByteLevelBPETokenizer`, this option is disabled by default (`trim_offsets=False`).
- `BertWordPieceTokenizer` option to `add_special_tokens` must now be given to `encode` or
`encode_batch`
- Access to the `original_str` on the `Encoding` has been removed. The original string is the input
of `encode` so it didn't make sense to keep it here.
- No need to call `original_str.offsets(offsets[N])` to convert offsets to the original string. They
are now relative to the original string by default.
- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
`normalize(sequence)` on the `Tokenizer`
- Change `Model.from_files` and `Model.empty` to use constructor. The model constructor should take
the same arguments as the old methods. (ie `BPE(vocab, merges)` or `BPE()`)
- If you were using the `CharBPETokenizer` and want to keep the same behavior as before, set
`bert_normalizer=False` and `split_on_whitespace_only=True`.

## [0.6.0]

### Changed
- [#165]: Big improvements in speed for BPE (Both training and tokenization)

### Fixed
- [#160]: Some default tokens were missing from `BertWordPieceTokenizer`
- [#156]: There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got
split up in multiple bytes.
- [#174]: The `longest_first` truncation strategy had a bug

## [0.5.2]
- [#163]: Do not open all files directly while training

### Fixed
- We introduced a bug related to the saving of the WordPiece model in 0.5.1: The `vocab.txt` file
was named `vocab.json`. This is now fixed.
- The `WordLevel` model was also saving its vocabulary to the wrong format.

## [0.5.1]

### Changed
- `name` argument is now optional when saving a `Model`'s vocabulary. When the name is not
specified, the files get a more generic naming, like `vocab.json` or `merges.txt`.

## [0.5.0]

### Changed
- [#145]: `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding
- [#149]: `ByteLevelBPETokenizer` now has `dropout`.
- `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers.
(Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
- [#139]: Expose `__len__` on `Encoding`
- Improved padding performances.

### Added
- Added a new `Strip` normalizer

### Fixed
- [#145]: Decoding was buggy on `BertWordPieceTokenizer`.
- [#152]: Some documentation and examples were still using the old `BPETokenizer`

### How to migrate
- Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of
`do_lowercase`.

## [0.4.2]

### Fixed
- [#137]: Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from
being trained.

## [0.4.1]

### Fixed
- [#134]: Fix a bug related to the punctuation in BertWordPieceTokenizer

## [0.4.0]

### Changed
- [#131]: Replaced all .new() class methods by a proper __new__ implementation
- Improved typings

### How to migrate
- Remove all `.new` on all classe instanciations

## [0.3.0]

### Changed
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets
truncated, we provide a list of overflowing `Encoding` that are ready to be processed by a language
model, just as the main `Encoding`.
- Provide mapping to the original string offsets using:
```
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
```
- [#99]: Exposed the vocabulary size on all tokenizers

### Added
- Added `CharDelimiterSplit`: a new `PreTokenizer` that allows splitting sequences on the given
delimiter (Works like `.split(delimiter)`)
- Added `WordLevel`: a new model that simply maps `tokens` to their `ids`.

### Fixed
- Fix a bug with IndexableString
- Fix a bug with truncation

### How to migrate
- Rename `BPETokenizer` to `CharBPETokenizer`
- `Encoding.overflowing` is now a List instead of a `Optional[Encoding]`

## [0.2.1]

### Fixed
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5

[#1096]: https://github.com/huggingface/tokenizers/pull/1096
[#1072]: https://github.com/huggingface/tokenizers/pull/1072
[#956]: https://github.com/huggingface/tokenizers/pull/956
[#1008]: https://github.com/huggingface/tokenizers/pull/1008
[#1009]: https://github.com/huggingface/tokenizers/pull/1009
[#1047]: https://github.com/huggingface/tokenizers/pull/1047
[#1055]: https://github.com/huggingface/tokenizers/pull/1055
[#1051]: https://github.com/huggingface/tokenizers/pull/1051
[#1052]: https://github.com/huggingface/tokenizers/pull/1052
[#938]: https://github.com/huggingface/tokenizers/pull/938
[#939]: https://github.com/huggingface/tokenizers/pull/939
[#952]: https://github.com/huggingface/tokenizers/pull/952
[#954]: https://github.com/huggingface/tokenizers/pull/954
[#962]: https://github.com/huggingface/tokenizers/pull/962
[#961]: https://github.com/huggingface/tokenizers/pull/961
[#960]: https://github.com/huggingface/tokenizers/pull/960
[#919]: https://github.com/huggingface/tokenizers/pull/919
[#916]: https://github.com/huggingface/tokenizers/pull/916
[#895]: https://github.com/huggingface/tokenizers/pull/895
[#884]: https://github.com/huggingface/tokenizers/pull/884
[#882]: https://github.com/huggingface/tokenizers/pull/882
[#868]: https://github.com/huggingface/tokenizers/pull/868
[#860]: https://github.com/huggingface/tokenizers/pull/860
[#850]: https://github.com/huggingface/tokenizers/pull/850
[#844]: https://github.com/huggingface/tokenizers/pull/844
[#845]: https://github.com/huggingface/tokenizers/pull/845
[#851]: https://github.com/huggingface/tokenizers/pull/851
[#585]: https://github.com/huggingface/tokenizers/pull/585
[#793]: https://github.com/huggingface/tokenizers/pull/793
[#780]: https://github.com/huggingface/tokenizers/pull/780
[#770]: https://github.com/huggingface/tokenizers/pull/770
[#762]: https://github.com/huggingface/tokenizers/pull/762
[#718]: https://github.com/huggingface/tokenizers/pull/718
[#714]: https://github.com/huggingface/tokenizers/pull/714
[#707]: https://github.com/huggingface/tokenizers/pull/707
[#693]: https://github.com/huggingface/tokenizers/pull/693
[#686]: https://github.com/huggingface/tokenizers/pull/686
[#674]: https://github.com/huggingface/tokenizers/pull/674
[#657]: https://github.com/huggingface/tokenizers/pull/657
[#656]: https://github.com/huggingface/tokenizers/pull/656
[#652]: https://github.com/huggingface/tokenizers/pull/652
[#621]: https://github.com/huggingface/tokenizers/pull/621
[#620]: https://github.com/huggingface/tokenizers/pull/620
[#618]: https://github.com/huggingface/tokenizers/pull/618
[#617]: https://github.com/huggingface/tokenizers/pull/617
[#616]: https://github.com/huggingface/tokenizers/pull/616
[#590]: https://github.com/huggingface/tokenizers/pull/590
[#574]: https://github.com/huggingface/tokenizers/pull/574
[#544]: https://github.com/huggingface/tokenizers/pull/544
[#542]: https://github.com/huggingface/tokenizers/pull/542
[#539]: https://github.com/huggingface/tokenizers/pull/539
[#538]: https://github.com/huggingface/tokenizers/pull/538
[#533]: https://github.com/huggingface/tokenizers/pull/533
[#530]: https://github.com/huggingface/tokenizers/pull/530
[#519]: https://github.com/huggingface/tokenizers/pull/519
[#509]: https://github.com/huggingface/tokenizers/pull/509
[#508]: https://github.com/huggingface/tokenizers/pull/508
[#506]: https://github.com/huggingface/tokenizers/pull/506
[#500]: https://github.com/huggingface/tokenizers/pull/500
[#498]: https://github.com/huggingface/tokenizers/pull/498
[#492]: https://github.com/huggingface/tokenizers/pull/492
[#481]: https://github.com/huggingface/tokenizers/pull/481
[#480]: https://github.com/huggingface/tokenizers/pull/480
[#477]: https://github.com/huggingface/tokenizers/pull/477
[#476]: https://github.com/huggingface/tokenizers/pull/476
[#470]: https://github.com/huggingface/tokenizers/pull/470
[#464]: https://github.com/huggingface/tokenizers/pull/464
[#459]: https://github.com/huggingface/tokenizers/pull/459
[#420]: https://github.com/huggingface/tokenizers/pull/420
[#417]: https://github.com/huggingface/tokenizers/pull/417
[#416]: https://github.com/huggingface/tokenizers/pull/416
[#403]: https://github.com/huggingface/tokenizers/pull/403
[#394]: https://github.com/huggingface/tokenizers/pull/394
[#389]: https://github.com/huggingface/tokenizers/pull/389
[#379]: https://github.com/huggingface/tokenizers/pull/379
[#378]: https://github.com/huggingface/tokenizers/pull/378
[#363]: https://github.com/huggingface/tokenizers/pull/363
[#362]: https://github.com/huggingface/tokenizers/pull/362
[#360]: https://github.com/huggingface/tokenizers/pull/360
[#355]: https://github.com/huggingface/tokenizers/pull/355
[#333]: https://github.com/huggingface/tokenizers/pull/333
[#330]: https://github.com/huggingface/tokenizers/pull/330
[#329]: https://github.com/huggingface/tokenizers/pull/329
[#311]: https://github.com/huggingface/tokenizers/pull/311
[#309]: https://github.com/huggingface/tokenizers/pull/309
[#292]: https://github.com/huggingface/tokenizers/pull/292
[#289]: https://github.com/huggingface/tokenizers/pull/289
[#286]: https://github.com/huggingface/tokenizers/pull/286
[#280]: https://github.com/huggingface/tokenizers/pull/280
[#276]: https://github.com/huggingface/tokenizers/pull/276
[#273]: https://github.com/huggingface/tokenizers/pull/273
[#272]: https://github.com/huggingface/tokenizers/pull/272
[#249]: https://github.com/huggingface/tokenizers/pull/249
[#239]: https://github.com/huggingface/tokenizers/pull/239
[#236]: https://github.com/huggingface/tokenizers/pull/236
[#234]: https://github.com/huggingface/tokenizers/pull/234
[#208]: https://github.com/huggingface/tokenizers/pull/208
[#205]: https://github.com/huggingface/tokenizers/issues/205
[#197]: https://github.com/huggingface/tokenizers/pull/197
[#193]: https://github.com/huggingface/tokenizers/pull/193
[#190]: https://github.com/huggingface/tokenizers/pull/190
[#188]: https://github.com/huggingface/tokenizers/pull/188
[#187]: https://github.com/huggingface/tokenizers/issues/187
[#175]: https://github.com/huggingface/tokenizers/issues/175
[#174]: https://github.com/huggingface/tokenizers/issues/174
[#165]: https://github.com/huggingface/tokenizers/pull/165
[#163]: https://github.com/huggingface/tokenizers/issues/163
[#160]: https://github.com/huggingface/tokenizers/issues/160
[#156]: https://github.com/huggingface/tokenizers/pull/156
[#152]: https://github.com/huggingface/tokenizers/issues/152
[#149]: https://github.com/huggingface/tokenizers/issues/149
[#145]: https://github.com/huggingface/tokenizers/issues/145
[#139]: https://github.com/huggingface/tokenizers/issues/139
[#137]: https://github.com/huggingface/tokenizers/issues/137
[#134]: https://github.com/huggingface/tokenizers/issues/134
[#131]: https://github.com/huggingface/tokenizers/issues/131
[#99]: https://github.com/huggingface/tokenizers/pull/99