File: CHANGELOG.md

package info (click to toggle)
tokenizers 0.20.3%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: experimental
  • size: 5,480 kB
  • sloc: python: 4,499; javascript: 419; makefile: 124
file content (512 lines) | stat: -rw-r--r-- 22,977 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.13.2] 

- [#1096] Python 3.11 support

## [0.13.1] 

- [#1072] Fixing Roberta type ids.

## [0.13.0] 

- [#956] PyO3 version upgrade
- [#1055] M1 automated builds
- [#1008] `Decoder` is now a composable trait, but without being backward incompatible
- [#1047, #1051, #1052] `Processor` is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward
  compatibility, the code is different enough that we cannot be exactly sure.

## [0.12.1] 

- [#938] **Reverted breaking change**. https://github.com/huggingface/transformers/issues/16520

## [0.12.0] YANKED

Bump minor version because of a breaking change.

- [#938] [REVERTED IN 0.12.1] **Breaking change**. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
- [#939] Making the regex in `ByteLevel` pre_tokenizer optional (necessary for BigScience)

- [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
- [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
- [#962] Fix tests for python 3.10
- [#961] Added link for Ruby port of `tokenizers`

## [0.11.6]

- [#919] Fixing single_word AddedToken. (regression from 0.11.2)
- [#916] Deserializing faster `added_tokens` by loading them in batch.

## [0.11.5]

- [#895] Build `python 3.10` wheels.

## [0.11.4]

- [#884] Fixing bad deserialization following inclusion of a default for Punctuation

## [0.11.3]

- [#882] Fixing Punctuation deserialize without argument.
- [#868] Fixing missing direction in TruncationParams
- [#860] Adding TruncationSide to TruncationParams

## [0.11.0]

### Fixed

- [#585] Conda version should now work on old CentOS
- [#844] Fixing interaction between `is_pretokenized` and `trim_offsets`.
- [#851] Doc links

### Added
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
- [#845]: Documentation for `Decoders`.

### Changed
- [#850]: Added a feature gate to enable disabling `http` features
- [#718]: Fix `WordLevel` tokenizer determinism during training
- [#762]: Add a way to specify the unknown token in `SentencePieceUnigramTokenizer`
- [#770]: Improved documentation for `UnigramTrainer`
- [#780]: Add `Tokenizer.from_pretrained` to load tokenizers from the Hugging Face Hub
- [#793]: Saving a pretty JSON file by default when saving a tokenizer

## [0.10.3]

### Fixed
- [#686]: Fix SPM conversion process for whitespace deduplication
- [#707]: Fix stripping strings containing Unicode characters

### Added
- [#693]: Add a CTC Decoder for Wave2Vec models

### Removed
- [#714]: Removed support for Python 3.5

## [0.10.2]

### Fixed
- [#652]: Fix offsets for `Precompiled` corner case
- [#656]: Fix BPE `continuing_subword_prefix`
- [#674]: Fix `Metaspace` serialization problems

## [0.10.1]

### Fixed
- [#616]: Fix SentencePiece tokenizers conversion
- [#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)
- [#618]: Fix Normalizer.normalize with `PyNormalizedStringRefMut`
- [#620]: Fix serialization/deserialization for overlapping models
- [#621]: Fix `ByteLevel` instantiation from a previously saved state (using `__getstate__()`)

## [0.10.0]

### Added
- [#508]: Add a Visualizer for notebooks to help understand how the tokenizers work
- [#519]: Add a `WordLevelTrainer` used to train a `WordLevel` model
- [#533]: Add support for conda builds
- [#542]: Add Split pre-tokenizer to easily split using a pattern
- [#544]: Ability to train from memory. This also improves the integration with `datasets`
- [#590]: Add getters/setters for components on BaseTokenizer
- [#574]: Add `fust_unk` option to SentencePieceBPETokenizer

### Changed
- [#509]: Automatically stubbing the `.pyi` files
- [#519]: Each `Model` can return its associated `Trainer` with `get_trainer()`
- [#530]: The various attributes on each component can be get/set (ie.
`tokenizer.model.dropout = 0.1`)
- [#538]: The API Reference has been improved and is now up-to-date.

### Fixed
- [#519]: During training, the `Model` is now trained in-place. This fixes several bugs that were
forcing to reload the `Model` after a training.
- [#539]: Fix `BaseTokenizer` enable_truncation docstring

## [0.9.4]

### Fixed
- [#492]: Fix `from_file` on `BertWordPieceTokenizer`
- [#498]: Fix the link to download `sentencepiece_model_pb2.py`
- [#500]: Fix a typo in the docs quicktour

### Changed
- [#506]: Improve Encoding mappings for pairs of sequence

## [0.9.3]

### Fixed
- [#470]: Fix hanging error when training with custom component
- [#476]: TemplateProcessing serialization is now deterministic
- [#481]: Fix SentencePieceBPETokenizer.from_files

### Added
- [#477]: UnicodeScripts PreTokenizer to avoid merges between various scripts
- [#480]: Unigram now accepts an `initial_alphabet` and handles `special_tokens` correctly

## [0.9.2]

### Fixed
- [#464]: Fix a problem with RobertaProcessing being deserialized as BertProcessing

## [0.9.1]

### Fixed
- [#459]: Fix a problem with deserialization

## [0.9.0]

### Fixed
- [#362]: Fix training deadlock with Python components.
- [#363]: Fix a crash when calling `.train` with some non-existent files
- [#355]: Remove a lot of possible crashes
- [#389]: Improve truncation (crash and consistency)

### Added
- [#379]: Add the ability to call `encode`/`encode_batch` with numpy arrays
- [#292]: Support for the Unigram algorithm
- [#378], [#394], [#416], [#417]: Many new Normalizer and PreTokenizer
- [#403]: Add `TemplateProcessing` `PostProcessor`.
- [#420]: Ability to fuse the "unk" token in BPE.

### Changed
- [#360]: Lots of improvements related to words/alignment tracking
- [#426]: Improvements on error messages thanks to PyO3 0.12

## [0.8.1]

### Fixed
- [#333]: Fix deserialization of `AddedToken`, where the content was not restored properly

### Changed
- [#329]: Improved warning and behavior when we detect a fork
- [#330]: BertNormalizer now keeps the same behavior than the original implementation when
`strip_accents` is not specified.

## [0.8.0]

### Highlights of this release
- We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word.
- Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
- With the serialization comes the compatibility with `Pickle`! The Tokenizer, all of its components,
Encodings, everything can be pickled!
- Training a tokenizer is now even faster (up to 5-10x) than before!
- Compatibility with `multiprocessing`, even when using the `fork` start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with `multiprocessing`. This version now allows to
disable the parallelism, and will warn you if this is necessary.
- And a lot of other improvements, and fixes.

### Fixed
- [#286]: Fix various crash when training a BPE model
- [#309]: Fixed a few bugs related to additional vocabulary/tokens

### Added
- [#272]: Serialization of the `Tokenizer` and all the parts (`PreTokenizer`, `Normalizer`, ...).
This adds some methods to easily save/load an entire tokenizer (`from_str`, `from_file`).
- [#273]: `Tokenizer` and its parts are now pickable
- [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with
`enable_padding(pad_to_multiple_of=8)` for example.
- [#298]: Ability to get the currently set truncation/padding params
- [#311]: Ability to enable/disable the parallelism using the `TOKENIZERS_PARALLELISM` environment
variable. This is especially usefull when using `multiprocessing` capabilities, with the `fork`
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)

### Changed
- Improved errors generated during truncation: When the provided max length is too low are
now handled properly.
- [#249] `encode` and `encode_batch` now accept pre-tokenized inputs. When the input is pre-tokenized,
the argument `is_pretokenized=True` must be specified.
- [#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file
- [#280]: Use `onig` for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2
- [#309]: Improved the management of the additional vocabulary. This introduces an option
`normalized`, controlling whether a token should be extracted from the normalized version of the
input text.

## [0.7.0]

### Changed
- Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long.
- [#193]: `encode` and `encode_batch` now take a new optional argument, specifying whether we
should add the special tokens. This is activated by default.
- [#197]: `original_str` and `normalized_str` have been removed from the `Encoding` returned by
`encode` and `encode_batch`. This brings a reduction of 70% of the memory footprint.
- [#197]: The offsets provided on `Encoding` are now relative to the original string, and not the
normalized one anymore.
- The added token given to `add_special_tokens` or `add_tokens` on a `Tokenizer`, or while using
`train(special_tokens=...)` can now be instances of `AddedToken` to provide more control over these
tokens.
- [#136]: Updated Pyo3 version
- [#136]: Static methods `Model.from_files` and `Model.empty` are removed in favor of using
constructors.
- [#239]: `CharBPETokenizer` now corresponds to OpenAI GPT BPE implementation by default.

### Added
- [#188]: `ByteLevel` is also a `PostProcessor` now and handles trimming the offsets if activated.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token.
It has been added to `ByteLevelBPETokenizer` but it is off by default (`trim_offsets=False`).
- [#236]: `RobertaProcessing` also handles trimming the offsets.
- [#234]: New alignment mappings on the `Encoding`. Provide methods to easily convert between `char`
or `word` (input space) and `token` (output space).
- `post_process` can be called on the `Tokenizer`
- [#208]: Ability to retrieve the vocabulary from the `Tokenizer` with
`get_vocab(with_added_tokens: bool)`
- [#136] Models can now be instantiated through object constructors.

### Fixed
- [#193]: Fix some issues with the offsets being wrong with the `ByteLevel` BPE:
	- when `add_prefix_space=True`
	- [#156]: when a Unicode character gets split-up in multiple byte-level characters
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if
not advised, but that's not the question).
- [#205]: Trim the decoded string in `BPEDecoder` used by `CharBPETokenizer`

### How to migrate
- Add the `ByteLevel` `PostProcessor` to your byte-level BPE tokenizers if relevant. If you are
using `ByteLevelBPETokenizer`, this option is disabled by default (`trim_offsets=False`).
- `BertWordPieceTokenizer` option to `add_special_tokens` must now be given to `encode` or
`encode_batch`
- Access to the `original_str` on the `Encoding` has been removed. The original string is the input
of `encode` so it didn't make sense to keep it here.
- No need to call `original_str.offsets(offsets[N])` to convert offsets to the original string. They
are now relative to the original string by default.
- Access to the `normalized_str` on the `Encoding` has been removed. Can be retrieved by calling
`normalize(sequence)` on the `Tokenizer`
- Change `Model.from_files` and `Model.empty` to use constructor. The model constructor should take
the same arguments as the old methods. (ie `BPE(vocab, merges)` or `BPE()`)
- If you were using the `CharBPETokenizer` and want to keep the same behavior as before, set
`bert_normalizer=False` and `split_on_whitespace_only=True`.

## [0.6.0]

### Changed
- [#165]: Big improvements in speed for BPE (Both training and tokenization)

### Fixed
- [#160]: Some default tokens were missing from `BertWordPieceTokenizer`
- [#156]: There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got
split up in multiple bytes.
- [#174]: The `longest_first` truncation strategy had a bug

## [0.5.2]
- [#163]: Do not open all files directly while training

### Fixed
- We introduced a bug related to the saving of the WordPiece model in 0.5.1: The `vocab.txt` file
was named `vocab.json`. This is now fixed.
- The `WordLevel` model was also saving its vocabulary to the wrong format.

## [0.5.1]

### Changed
- `name` argument is now optional when saving a `Model`'s vocabulary. When the name is not
specified, the files get a more generic naming, like `vocab.json` or `merges.txt`.

## [0.5.0]

### Changed
- [#145]: `BertWordPieceTokenizer` now cleans up some tokenization artifacts while decoding
- [#149]: `ByteLevelBPETokenizer` now has `dropout`.
- `do_lowercase` has been changed to `lowercase` for consistency between the different tokenizers.
(Especially `ByteLevelBPETokenizer` and `CharBPETokenizer`)
- [#139]: Expose `__len__` on `Encoding`
- Improved padding performances.

### Added
- Added a new `Strip` normalizer

### Fixed
- [#145]: Decoding was buggy on `BertWordPieceTokenizer`.
- [#152]: Some documentation and examples were still using the old `BPETokenizer`

### How to migrate
- Use `lowercase` when initializing `ByteLevelBPETokenizer` or `CharBPETokenizer` instead of
`do_lowercase`.

## [0.4.2]

### Fixed
- [#137]: Fix a bug in the class `WordPieceTrainer` that prevented `BertWordPieceTokenizer` from
being trained.

## [0.4.1]

### Fixed
- [#134]: Fix a bug related to the punctuation in BertWordPieceTokenizer

## [0.4.0]

### Changed
- [#131]: Replaced all .new() class methods by a proper __new__ implementation
- Improved typings

### How to migrate
- Remove all `.new` on all classe instanciations

## [0.3.0]

### Changed
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets
truncated, we provide a list of overflowing `Encoding` that are ready to be processed by a language
model, just as the main `Encoding`.
- Provide mapping to the original string offsets using:
```
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
```
- [#99]: Exposed the vocabulary size on all tokenizers

### Added
- Added `CharDelimiterSplit`: a new `PreTokenizer` that allows splitting sequences on the given
delimiter (Works like `.split(delimiter)`)
- Added `WordLevel`: a new model that simply maps `tokens` to their `ids`.

### Fixed
- Fix a bug with IndexableString
- Fix a bug with truncation

### How to migrate
- Rename `BPETokenizer` to `CharBPETokenizer`
- `Encoding.overflowing` is now a List instead of a `Optional[Encoding]`

## [0.2.1]

### Fixed
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5

[#1096]: https://github.com/huggingface/tokenizers/pull/1096
[#1072]: https://github.com/huggingface/tokenizers/pull/1072
[#956]: https://github.com/huggingface/tokenizers/pull/956
[#1008]: https://github.com/huggingface/tokenizers/pull/1008
[#1009]: https://github.com/huggingface/tokenizers/pull/1009
[#1047]: https://github.com/huggingface/tokenizers/pull/1047
[#1055]: https://github.com/huggingface/tokenizers/pull/1055
[#1051]: https://github.com/huggingface/tokenizers/pull/1051
[#1052]: https://github.com/huggingface/tokenizers/pull/1052
[#938]: https://github.com/huggingface/tokenizers/pull/938
[#939]: https://github.com/huggingface/tokenizers/pull/939
[#952]: https://github.com/huggingface/tokenizers/pull/952
[#954]: https://github.com/huggingface/tokenizers/pull/954
[#962]: https://github.com/huggingface/tokenizers/pull/962
[#961]: https://github.com/huggingface/tokenizers/pull/961
[#960]: https://github.com/huggingface/tokenizers/pull/960
[#919]: https://github.com/huggingface/tokenizers/pull/919
[#916]: https://github.com/huggingface/tokenizers/pull/916
[#895]: https://github.com/huggingface/tokenizers/pull/895
[#884]: https://github.com/huggingface/tokenizers/pull/884
[#882]: https://github.com/huggingface/tokenizers/pull/882
[#868]: https://github.com/huggingface/tokenizers/pull/868
[#860]: https://github.com/huggingface/tokenizers/pull/860
[#850]: https://github.com/huggingface/tokenizers/pull/850
[#844]: https://github.com/huggingface/tokenizers/pull/844
[#845]: https://github.com/huggingface/tokenizers/pull/845
[#851]: https://github.com/huggingface/tokenizers/pull/851
[#585]: https://github.com/huggingface/tokenizers/pull/585
[#793]: https://github.com/huggingface/tokenizers/pull/793
[#780]: https://github.com/huggingface/tokenizers/pull/780
[#770]: https://github.com/huggingface/tokenizers/pull/770
[#762]: https://github.com/huggingface/tokenizers/pull/762
[#718]: https://github.com/huggingface/tokenizers/pull/718
[#714]: https://github.com/huggingface/tokenizers/pull/714
[#707]: https://github.com/huggingface/tokenizers/pull/707
[#693]: https://github.com/huggingface/tokenizers/pull/693
[#686]: https://github.com/huggingface/tokenizers/pull/686
[#674]: https://github.com/huggingface/tokenizers/pull/674
[#657]: https://github.com/huggingface/tokenizers/pull/657
[#656]: https://github.com/huggingface/tokenizers/pull/656
[#652]: https://github.com/huggingface/tokenizers/pull/652
[#621]: https://github.com/huggingface/tokenizers/pull/621
[#620]: https://github.com/huggingface/tokenizers/pull/620
[#618]: https://github.com/huggingface/tokenizers/pull/618
[#617]: https://github.com/huggingface/tokenizers/pull/617
[#616]: https://github.com/huggingface/tokenizers/pull/616
[#590]: https://github.com/huggingface/tokenizers/pull/590
[#574]: https://github.com/huggingface/tokenizers/pull/574
[#544]: https://github.com/huggingface/tokenizers/pull/544
[#542]: https://github.com/huggingface/tokenizers/pull/542
[#539]: https://github.com/huggingface/tokenizers/pull/539
[#538]: https://github.com/huggingface/tokenizers/pull/538
[#533]: https://github.com/huggingface/tokenizers/pull/533
[#530]: https://github.com/huggingface/tokenizers/pull/530
[#519]: https://github.com/huggingface/tokenizers/pull/519
[#509]: https://github.com/huggingface/tokenizers/pull/509
[#508]: https://github.com/huggingface/tokenizers/pull/508
[#506]: https://github.com/huggingface/tokenizers/pull/506
[#500]: https://github.com/huggingface/tokenizers/pull/500
[#498]: https://github.com/huggingface/tokenizers/pull/498
[#492]: https://github.com/huggingface/tokenizers/pull/492
[#481]: https://github.com/huggingface/tokenizers/pull/481
[#480]: https://github.com/huggingface/tokenizers/pull/480
[#477]: https://github.com/huggingface/tokenizers/pull/477
[#476]: https://github.com/huggingface/tokenizers/pull/476
[#470]: https://github.com/huggingface/tokenizers/pull/470
[#464]: https://github.com/huggingface/tokenizers/pull/464
[#459]: https://github.com/huggingface/tokenizers/pull/459
[#420]: https://github.com/huggingface/tokenizers/pull/420
[#417]: https://github.com/huggingface/tokenizers/pull/417
[#416]: https://github.com/huggingface/tokenizers/pull/416
[#403]: https://github.com/huggingface/tokenizers/pull/403
[#394]: https://github.com/huggingface/tokenizers/pull/394
[#389]: https://github.com/huggingface/tokenizers/pull/389
[#379]: https://github.com/huggingface/tokenizers/pull/379
[#378]: https://github.com/huggingface/tokenizers/pull/378
[#363]: https://github.com/huggingface/tokenizers/pull/363
[#362]: https://github.com/huggingface/tokenizers/pull/362
[#360]: https://github.com/huggingface/tokenizers/pull/360
[#355]: https://github.com/huggingface/tokenizers/pull/355
[#333]: https://github.com/huggingface/tokenizers/pull/333
[#330]: https://github.com/huggingface/tokenizers/pull/330
[#329]: https://github.com/huggingface/tokenizers/pull/329
[#311]: https://github.com/huggingface/tokenizers/pull/311
[#309]: https://github.com/huggingface/tokenizers/pull/309
[#292]: https://github.com/huggingface/tokenizers/pull/292
[#289]: https://github.com/huggingface/tokenizers/pull/289
[#286]: https://github.com/huggingface/tokenizers/pull/286
[#280]: https://github.com/huggingface/tokenizers/pull/280
[#276]: https://github.com/huggingface/tokenizers/pull/276
[#273]: https://github.com/huggingface/tokenizers/pull/273
[#272]: https://github.com/huggingface/tokenizers/pull/272
[#249]: https://github.com/huggingface/tokenizers/pull/249
[#239]: https://github.com/huggingface/tokenizers/pull/239
[#236]: https://github.com/huggingface/tokenizers/pull/236
[#234]: https://github.com/huggingface/tokenizers/pull/234
[#208]: https://github.com/huggingface/tokenizers/pull/208
[#205]: https://github.com/huggingface/tokenizers/issues/205
[#197]: https://github.com/huggingface/tokenizers/pull/197
[#193]: https://github.com/huggingface/tokenizers/pull/193
[#190]: https://github.com/huggingface/tokenizers/pull/190
[#188]: https://github.com/huggingface/tokenizers/pull/188
[#187]: https://github.com/huggingface/tokenizers/issues/187
[#175]: https://github.com/huggingface/tokenizers/issues/175
[#174]: https://github.com/huggingface/tokenizers/issues/174
[#165]: https://github.com/huggingface/tokenizers/pull/165
[#163]: https://github.com/huggingface/tokenizers/issues/163
[#160]: https://github.com/huggingface/tokenizers/issues/160
[#156]: https://github.com/huggingface/tokenizers/pull/156
[#152]: https://github.com/huggingface/tokenizers/issues/152
[#149]: https://github.com/huggingface/tokenizers/issues/149
[#145]: https://github.com/huggingface/tokenizers/issues/145
[#139]: https://github.com/huggingface/tokenizers/issues/139
[#137]: https://github.com/huggingface/tokenizers/issues/137
[#134]: https://github.com/huggingface/tokenizers/issues/134
[#131]: https://github.com/huggingface/tokenizers/issues/131
[#99]: https://github.com/huggingface/tokenizers/pull/99