1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
|
# subtile-ocr
`subtile-ocr` is a blazingly fast and accurate DVD `VobSub` to SRT subtitle conversion tool.
It's started as a fork of [vobsubocr](https://github.com/elizagamedev/vobsubocr).
## Background
DVD subtitles are unfortunately encoded essentially as a series of images. This
presents problems when needing a text representation of the subtitle, e.g. for
language learning. `subtile-ocr` can alleviate this problem by generating SRT
subtitles from an input `VobSub` file, leveraging the power of
[Tesseract](https://github.com/tesseract-ocr/tesseract).
## Installation
Install the latest release with cargo:
```sh
cargo install subtile-ocr
```
Or alternatively, install the development version from git:
```sh
cargo install --git https://github.com/gwen-lg/subtile-ocr
```
You will need to have Tesseract's development libraries installed; see the
[leptess readme](https://github.com/houqp/leptess) for more details. If you use
Nix, the provided shell.nix provides an environment with all of the necessary
dependencies.
## Usage
```sh
# Convert simplified Chinese vobsub subtitles and print them to stdout.
subtile-ocr -l chi_sim shrek_chi.idx
# Convert English vobsub subtitles and write them to a file named "shrek_eng.srt".
subtile-ocr -l eng -o shrek_eng.srt shrek_eng.idx
```
We can also specify more advanced configuration options for Tesseract with `-c`.
```sh
# Convert subtitles and blacklist the specified characters from being (mistakenly) recognized.
subtile-ocr -l eng -c tessedit_char_blacklist='|\/`_~' shrek_eng.idx
```
## How does it work/compare to similar tools?
The most comparable tool to `subtile-ocr` is
[VobSub2SRT](https://github.com/ruediger/VobSub2SRT), but `subtile-ocr` has
significantly better output, especially for non-English languages, mainly
because `VobSub2SRT` does not do much preprocessing of the image at all before
sending it to Tesseract. For example, Tesseract 4.0 expects black text on a
white background, which `VobSub2SRT` does not guarantee, but `subtile-ocr` does.
Additionally, `subtile-ocr` splits each line into separate images to take
advantage of page segmentation method 7, which greatly improves accuracy of
non-English languages in particular.
Official documentation on how to improve accuracy of Tesseract output can be
viewed [here](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html).
## Miscellaneous Notes
From my understanding, the `chi_sim` and `chi_tra` Tesseract models work on both
simplified and traditional Chinese text, but automatically convert said text to
their respective forms.
|