1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512
|
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.13.2]
- [#1096] Python 3.11 support
## [0.13.1]
- [#1072] Fixing Roberta type ids.
## [0.13.0]
- [#956] PyO3 version upgrade
- [#1055] M1 automated builds
- [#1008] `Decoder` is now a composable trait, but without being backward incompatible
- [#1047, #1051, #1052] `Processor` is now a composable trait, but without being backward incompatible
Both trait changes warrant a "major" number since, despite best efforts to not break backward
compatibility, the code is different enough that we cannot be exactly sure.
## [0.12.1]
- [#938] **Reverted breaking change**. https://github.com/huggingface/transformers/issues/16520
## [0.12.0] YANKED
Bump minor version because of a breaking change.
- [#938] [REVERTED IN 0.12.1] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)
- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
- [#962] Fix tests for python 3.10
- [#961] Added link for Ruby port of `tokenizers`
## [0.11.6]
- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
- [#916] Deserializing faster `added_tokens` by loading them in batch.
## [0.11.5]
- [#895] Build `python 3.10` wheels.
## [0.11.4]
- [#884] Fixing bad deserialization following inclusion of a default for Punctuation
## [0.11.3]
- [#882] Fixing Punctuation deserialize without argument.
- [#868] Fixing missing direction in TruncationParams
- [#860] Adding TruncationSide to TruncationParams
## [0.11.0]
### Fixed
- [#585] Conda version should now work on old CentOS
- [#844] Fixing interaction between `is_pretokenized` and `trim_offsets`.
- [#851] Doc links
### Added
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
- [#845]: Documentation for `Decoders`.
### Changed
- [#850]: Added a feature gate to enable disabling `http` features
- [#718]: Fix `WordLevel` tokenizer determinism during training
- [#762]: Add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
- [#770]: Improved documentation for `UnigramTrainer`
- [#780]: Add `Tokenizer.from_pretrained` to load tokenizers from the Hugging Face Hub
- [#793]: Saving a pretty JSON file by default when saving a tokenizer
## [0.10.3]
### Fixed
- [#686]: Fix SPM conversion process for whitespace deduplication
- [#707]: Fix stripping strings containing Unicode characters
### Added
- [#693]: Add a CTC Decoder for Wave2Vec models
### Removed
- [#714]: Removed support for Python 3.5
## [0.10.2]
### Fixed
- [#652]: Fix offsets for `Precompiled` corner case
- [#656]: Fix BPE `continuing_subword_prefix`
- [#674]: Fix `Metaspace` serialization problems
## [0.10.1]
### Fixed
- [#616]: Fix SentencePiece tokenizers conversion
- [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
- [#618]: Fix Normalizer.normalize with `PyNormalizedStringRefMut`
- [#620]: Fix serialization/deserialization for overlapping models
- [#621]: Fix `ByteLevel` instantiation from a previously saved state (using `__getstate__()`)
## [0.10.0]
### Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a `WordLevelTrainer` used to train a `WordLevel` model
- [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with `datasets`
- [#590]: Add getters/setters for components on BaseTokenizer
- [#574]: Add `fust_unk` option to SentencePieceBPETokenizer
### Changed
- [#509]: Automatically stubbing the `.pyi` files
- [#519]: Each `Model` can return its associated `Trainer` with `get_trainer()`
- [#530]: The various attributes on each component can be get/set (ie.
`tokenizer.model.dropout = 0.1`)
- [#538]: The API Reference has been improved and is now up-to-date.
### Fixed
- [#519]: During training, the `Model` is now trained in-place. This fixes several bugs that were
forcing to reload the `Model` after a training.
- [#539]: Fix `BaseTokenizer` enable_truncation docstring
## [0.9.4]
### Fixed
- [#492]: Fix `from_file` on `BertWordPieceTokenizer`
- [#498]: Fix the link to download `sentencepiece_model_pb2.py`
- [#500]: Fix a typo in the docs quicktour
### Changed
- [#506]: Improve Encoding mappings for pairs of sequence
## [0.9.3]
### Fixed
- [#470]: Fix hanging error when training with custom component
- [#476]: TemplateProcessing serialization is now deterministic
- [#481]: Fix SentencePieceBPETokenizer.from_files
### Added
- [#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
- [#480]: Unigram now accepts an `initial_alphabet` and handles `special_tokens` correctly
## [0.9.2]
### Fixed
- [#464]: Fix a problem with RobertaProcessing being deserialized as BertProcessing
## [0.9.1]
### Fixed
- [#459]: Fix a problem with deserialization
## [0.9.0]
### Fixed
- [#362]: Fix training deadlock with Python components.
- [#363]: Fix a crash when calling `.train` with some non-existent files
- [#355]: Remove a lot of possible crashes
- [#389]: Improve truncation (crash and consistency)
### Added
- [#379]: Add the ability to call `encode`/`encode_batch` with numpy arrays
- [#292]: Support for the Unigram algorithm
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
- [#403]: Add `TemplateProcessing` `PostProcessor`.
- [#420]: Ability to fuse the "unk" token in BPE.
### Changed
- [#360]: Lots of improvements related to words/alignment tracking
- [#426]: Improvements on error messages thanks to PyO3 0.12
## [0.8.1]
### Fixed
- [#333]: Fix deserialization of `AddedToken`, where the content was not restored properly
### Changed
- [#329]: Improved warning and behavior when we detect a fork
- [#330]: BertNormalizer now keeps the same behavior than the original implementation when
`strip_accents` is not specified.
## [0.8.0]
### Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word.
- Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
- With the serialization comes the compatibility with `Pickle`! The Tokenizer, all of its components,
Encodings, everything can be pickled!
- Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with `multiprocessing`, even when using the `fork` start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with `multiprocessing`. This version now allows to
disable the parallelism, and will warn you if this is necessary.
- And a lot of other improvements, and fixes.
### Fixed
- [#286]: Fix various crash when training a BPE model
- [#309]: Fixed a few bugs related to additional vocabulary/tokens
### Added
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
- [#273]: `Tokenizer` and its parts are now pickable
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
`enable_padding(pad_to_multiple_of=8)` for example.
- [#298]: Ability to get the currently set truncation/padding params
- [#311]: Ability to enable/disable the parallelism using the `TOKENIZERS_PARALLELISM` environment
variable. This is especially usefull when using `multiprocessing` capabilities, with the `fork`
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)
### Changed
- Improved errors generated during truncation: When the provided max length is too low are
now handled properly.
- [#249] `encode` and `encode_batch` now accept pre-tokenized inputs. When the input is pre-tokenized,
the argument `is_pretokenized=True` must be specified.
- [#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file
- [#280]: Use `onig` for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2
- [#309]: Improved the management of the additional vocabulary. This introduces an option
`normalized`, controlling whether a token should be extracted from the normalized version of the
input text.
## [0.7.0]
### Changed
- Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long.
- [#193]: `encode` and `encode_batch` now take a new optional argument, specifying whether we
should add the special tokens. This is activated by default.
- [#197]: `original_str` and `normalized_str` have been removed from the `Encoding` returned by
`encode` and `encode_batch`. This brings a reduction of 70% of the memory footprint.
- [#197]: The offsets provided on `Encoding` are now relative to the original string, and not the
normalized one anymore.
- The added token given to `add_special_tokens` or `add_tokens` on a `Tokenizer`, or while using
`train(special_tokens=...)` can now be instances of `AddedToken` to provide more control over these
tokens.
- [#136]: Updated Pyo3 version
- [#136]: Static methods `Model.from_files` and `Model.empty` are removed in favor of using
constructors.
- [#239]: `CharBPETokenizer` now corresponds to OpenAI GPT BPE implementation by default.
### Added
- [#188]: `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token.
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
- [#236]: `RobertaProcessing` also handles trimming the offsets.
- [#234]: New alignment mappings on the `Encoding`. Provide methods to easily convert between `char`
or `word` (input space) and `token` (output space).
- `post_process` can be called on the `Tokenizer`
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
`get_vocab(with_added_tokens: bool)`
- [#136] Models can now be instantiated through object constructors.
### Fixed
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
- when `add_prefix_space=True`
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if
not advised, but that's not the question).
- [#205]: Trim the decoded string in `BPEDecoder` used by `CharBPETokenizer`
### How to migrate
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are
using `ByteLevelBPETokenizer`, this option is disabled by default (`trim_offsets=False`).
- `BertWordPieceTokenizer` option to `add_special_tokens` must now be given to `encode` or
`encode_batch`
- Access to the `original_str` on the `Encoding` has been removed. The original string is the input
of `encode` so it didn't make sense to keep it here.
- No need to call `original_str.offsets(offsets[N])` to convert offsets to the original string. They
are now relative to the original string by default.
- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
`normalize(sequence)` on the `Tokenizer`
- Change `Model.from_files` and `Model.empty` to use constructor. The model constructor should take
the same arguments as the old methods. (ie `BPE(vocab, merges)` or `BPE()`)
- If you were using the `CharBPETokenizer` and want to keep the same behavior as before, set
`bert_normalizer=False` and `split_on_whitespace_only=True`.
## [0.6.0]
### Changed
- [#165]: Big improvements in speed for BPE (Both training and tokenization)
### Fixed
- [#160]: Some default tokens were missing from `BertWordPieceTokenizer`
- [#156]: There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got
split up in multiple bytes.
- [#174]: The `longest_first` truncation strategy had a bug
## [0.5.2]
- [#163]: Do not open all files directly while training
### Fixed
- We introduced a bug related to the saving of the WordPiece model in 0.5.1: The `vocab.txt` file
was named `vocab.json`. This is now fixed.
- The `WordLevel` model was also saving its vocabulary to the wrong format.
## [0.5.1]
### Changed
- `name` argument is now optional when saving a `Model`'s vocabulary. When the name is not
specified, the files get a more generic naming, like `vocab.json` or `merges.txt`.
## [0.5.0]
### Changed
- [#145]: `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding
- [#149]: `ByteLevelBPETokenizer` now has `dropout`.
- `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers.
(Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
- [#139]: Expose `__len__` on `Encoding`
- Improved padding performances.
### Added
- Added a new `Strip` normalizer
### Fixed
- [#145]: Decoding was buggy on `BertWordPieceTokenizer`.
- [#152]: Some documentation and examples were still using the old `BPETokenizer`
### How to migrate
- Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of
`do_lowercase`.
## [0.4.2]
### Fixed
- [#137]: Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from
being trained.
## [0.4.1]
### Fixed
- [#134]: Fix a bug related to the punctuation in BertWordPieceTokenizer
## [0.4.0]
### Changed
- [#131]: Replaced all .new() class methods by a proper __new__ implementation
- Improved typings
### How to migrate
- Remove all `.new` on all classe instanciations
## [0.3.0]
### Changed
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets
truncated, we provide a list of overflowing `Encoding` that are ready to be processed by a language
model, just as the main `Encoding`.
- Provide mapping to the original string offsets using:
```
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
```
- [#99]: Exposed the vocabulary size on all tokenizers
### Added
- Added `CharDelimiterSplit`: a new `PreTokenizer` that allows splitting sequences on the given
delimiter (Works like `.split(delimiter)`)
- Added `WordLevel`: a new model that simply maps `tokens` to their `ids`.
### Fixed
- Fix a bug with IndexableString
- Fix a bug with truncation
### How to migrate
- Rename `BPETokenizer` to `CharBPETokenizer`
- `Encoding.overflowing` is now a List instead of a `Optional[Encoding]`
## [0.2.1]
### Fixed
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5
[#1096]: https://github.com/huggingface/tokenizers/pull/1096
[#1072]: https://github.com/huggingface/tokenizers/pull/1072
[#956]: https://github.com/huggingface/tokenizers/pull/956
[#1008]: https://github.com/huggingface/tokenizers/pull/1008
[#1009]: https://github.com/huggingface/tokenizers/pull/1009
[#1047]: https://github.com/huggingface/tokenizers/pull/1047
[#1055]: https://github.com/huggingface/tokenizers/pull/1055
[#1051]: https://github.com/huggingface/tokenizers/pull/1051
[#1052]: https://github.com/huggingface/tokenizers/pull/1052
[#938]: https://github.com/huggingface/tokenizers/pull/938
[#939]: https://github.com/huggingface/tokenizers/pull/939
[#952]: https://github.com/huggingface/tokenizers/pull/952
[#954]: https://github.com/huggingface/tokenizers/pull/954
[#962]: https://github.com/huggingface/tokenizers/pull/962
[#961]: https://github.com/huggingface/tokenizers/pull/961
[#960]: https://github.com/huggingface/tokenizers/pull/960
[#919]: https://github.com/huggingface/tokenizers/pull/919
[#916]: https://github.com/huggingface/tokenizers/pull/916
[#895]: https://github.com/huggingface/tokenizers/pull/895
[#884]: https://github.com/huggingface/tokenizers/pull/884
[#882]: https://github.com/huggingface/tokenizers/pull/882
[#868]: https://github.com/huggingface/tokenizers/pull/868
[#860]: https://github.com/huggingface/tokenizers/pull/860
[#850]: https://github.com/huggingface/tokenizers/pull/850
[#844]: https://github.com/huggingface/tokenizers/pull/844
[#845]: https://github.com/huggingface/tokenizers/pull/845
[#851]: https://github.com/huggingface/tokenizers/pull/851
[#585]: https://github.com/huggingface/tokenizers/pull/585
[#793]: https://github.com/huggingface/tokenizers/pull/793
[#780]: https://github.com/huggingface/tokenizers/pull/780
[#770]: https://github.com/huggingface/tokenizers/pull/770
[#762]: https://github.com/huggingface/tokenizers/pull/762
[#718]: https://github.com/huggingface/tokenizers/pull/718
[#714]: https://github.com/huggingface/tokenizers/pull/714
[#707]: https://github.com/huggingface/tokenizers/pull/707
[#693]: https://github.com/huggingface/tokenizers/pull/693
[#686]: https://github.com/huggingface/tokenizers/pull/686
[#674]: https://github.com/huggingface/tokenizers/pull/674
[#657]: https://github.com/huggingface/tokenizers/pull/657
[#656]: https://github.com/huggingface/tokenizers/pull/656
[#652]: https://github.com/huggingface/tokenizers/pull/652
[#621]: https://github.com/huggingface/tokenizers/pull/621
[#620]: https://github.com/huggingface/tokenizers/pull/620
[#618]: https://github.com/huggingface/tokenizers/pull/618
[#617]: https://github.com/huggingface/tokenizers/pull/617
[#616]: https://github.com/huggingface/tokenizers/pull/616
[#590]: https://github.com/huggingface/tokenizers/pull/590
[#574]: https://github.com/huggingface/tokenizers/pull/574
[#544]: https://github.com/huggingface/tokenizers/pull/544
[#542]: https://github.com/huggingface/tokenizers/pull/542
[#539]: https://github.com/huggingface/tokenizers/pull/539
[#538]: https://github.com/huggingface/tokenizers/pull/538
[#533]: https://github.com/huggingface/tokenizers/pull/533
[#530]: https://github.com/huggingface/tokenizers/pull/530
[#519]: https://github.com/huggingface/tokenizers/pull/519
[#509]: https://github.com/huggingface/tokenizers/pull/509
[#508]: https://github.com/huggingface/tokenizers/pull/508
[#506]: https://github.com/huggingface/tokenizers/pull/506
[#500]: https://github.com/huggingface/tokenizers/pull/500
[#498]: https://github.com/huggingface/tokenizers/pull/498
[#492]: https://github.com/huggingface/tokenizers/pull/492
[#481]: https://github.com/huggingface/tokenizers/pull/481
[#480]: https://github.com/huggingface/tokenizers/pull/480
[#477]: https://github.com/huggingface/tokenizers/pull/477
[#476]: https://github.com/huggingface/tokenizers/pull/476
[#470]: https://github.com/huggingface/tokenizers/pull/470
[#464]: https://github.com/huggingface/tokenizers/pull/464
[#459]: https://github.com/huggingface/tokenizers/pull/459
[#420]: https://github.com/huggingface/tokenizers/pull/420
[#417]: https://github.com/huggingface/tokenizers/pull/417
[#416]: https://github.com/huggingface/tokenizers/pull/416
[#403]: https://github.com/huggingface/tokenizers/pull/403
[#394]: https://github.com/huggingface/tokenizers/pull/394
[#389]: https://github.com/huggingface/tokenizers/pull/389
[#379]: https://github.com/huggingface/tokenizers/pull/379
[#378]: https://github.com/huggingface/tokenizers/pull/378
[#363]: https://github.com/huggingface/tokenizers/pull/363
[#362]: https://github.com/huggingface/tokenizers/pull/362
[#360]: https://github.com/huggingface/tokenizers/pull/360
[#355]: https://github.com/huggingface/tokenizers/pull/355
[#333]: https://github.com/huggingface/tokenizers/pull/333
[#330]: https://github.com/huggingface/tokenizers/pull/330
[#329]: https://github.com/huggingface/tokenizers/pull/329
[#311]: https://github.com/huggingface/tokenizers/pull/311
[#309]: https://github.com/huggingface/tokenizers/pull/309
[#292]: https://github.com/huggingface/tokenizers/pull/292
[#289]: https://github.com/huggingface/tokenizers/pull/289
[#286]: https://github.com/huggingface/tokenizers/pull/286
[#280]: https://github.com/huggingface/tokenizers/pull/280
[#276]: https://github.com/huggingface/tokenizers/pull/276
[#273]: https://github.com/huggingface/tokenizers/pull/273
[#272]: https://github.com/huggingface/tokenizers/pull/272
[#249]: https://github.com/huggingface/tokenizers/pull/249
[#239]: https://github.com/huggingface/tokenizers/pull/239
[#236]: https://github.com/huggingface/tokenizers/pull/236
[#234]: https://github.com/huggingface/tokenizers/pull/234
[#208]: https://github.com/huggingface/tokenizers/pull/208
[#205]: https://github.com/huggingface/tokenizers/issues/205
[#197]: https://github.com/huggingface/tokenizers/pull/197
[#193]: https://github.com/huggingface/tokenizers/pull/193
[#190]: https://github.com/huggingface/tokenizers/pull/190
[#188]: https://github.com/huggingface/tokenizers/pull/188
[#187]: https://github.com/huggingface/tokenizers/issues/187
[#175]: https://github.com/huggingface/tokenizers/issues/175
[#174]: https://github.com/huggingface/tokenizers/issues/174
[#165]: https://github.com/huggingface/tokenizers/pull/165
[#163]: https://github.com/huggingface/tokenizers/issues/163
[#160]: https://github.com/huggingface/tokenizers/issues/160
[#156]: https://github.com/huggingface/tokenizers/pull/156
[#152]: https://github.com/huggingface/tokenizers/issues/152
[#149]: https://github.com/huggingface/tokenizers/issues/149
[#145]: https://github.com/huggingface/tokenizers/issues/145
[#139]: https://github.com/huggingface/tokenizers/issues/139
[#137]: https://github.com/huggingface/tokenizers/issues/137
[#134]: https://github.com/huggingface/tokenizers/issues/134
[#131]: https://github.com/huggingface/tokenizers/issues/131
[#99]: https://github.com/huggingface/tokenizers/pull/99
|