1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
|
17/09/2023 - 0.8.5:
- Add build-system.build-backend to pyproject.toml
16/09/2023 - 0.8.4:
- Fix LineBoxBuilder: Take into account headers and footers too, not just the
body.
- switch from setup.py to pyproject.toml
- switch from tox to pytest
26/06/2022 - 0.8.3:
- Workaround https://github.com/pypa/setuptools_scm/issues/727
15/04/2022 - 0.8.2:
- Add support for Tesseravt 5 + Linux
- Fix file descriptor leak (thanks to oda)
05/12/2021 - 0.8.1:
- Make the dependency on setuptools_scm optional
01/01/2020 - 0.8.0:
- Replaced libtesseract.image_to_pdf() by an object-oriented API that allows
creating PDF with more than 1 page (thanks to Matthias Kraus).
- Tesseract 4 + sys.frozen=True: Fix TESSDATA_PREFIX: starting with
Tesseract 4, the path must include tessdata/
22/06/2019 - 0.7.2:
- Fix setup.py on Windows
22/06/2019 - 0.7.1:
- tesseract.can_detect_orientation(): only returns True if 'osd' date files are
installed
- setup.py: Fix installation in MSYS2
12/05/2019 - 0.7:
- Drop support for Python <= 2.7
- Fix: Make sure the builder objects can be used to parse box files
even if Tesseract is not installed.
- PyOCR version is now automatically set in the module by setuptools_scm
instead of PyOCR's Makefile (except on Windows)
- Tesseract: optim: keep the get_version() in memory instead of calling
Tesseract everytime (get_version() by psm_parameter() which is called each
time a box file is parsed ...)
18/02/2019 - 0.6:
- Complete rewrite of unit tests (thanks to Thomas Perret)
- Libtesseract 4.0: Fix segfault when running orientation detection
(thanks to Marián Skrip)
- Libtesseract 4.0: Add a workaround: Tesseract need the locale to be set to
'C' (thanks to Thomas Perret)
- Libtesseract: Specify DPI of the image to Libtesseract (thanks to Thomas
Perret)
- Tesseract 4.0: Improve Tesseract version parsing
09/04/2018 - 0.5.3:
- Really fix tesseract 4.0 support (thanks to David Martin)
- Tests: switch from nose to pytest (thanks to Elliott Sales de Andrade)
25/07/2018 - 0.5.2:
- Fix tesseract 4.0 support: Use option '--psm' instead of '-psm'
- tesseract.detection_orientation(): Fix exception generation
01/03/2017 - 0.5.1:
- libtesseract/Windows: Add possible DLL names for libtesseract
- libtesseract: Keep track of library-loading errors in
pyocr.libtesseract.lib_load_errors (useful for debugging)
- Build method has been changed: Use now "make install" instead of
"python3 ./setup.py install"
- cosmetic: builders/WordHTMLParser: Message "OCR confidence not found"
floods the logs when working with old documents --> switch to debug
instead of info.
14/12/2017 - 0.5:
- Tesseract/Libtesseract + LineBoxBuilder: Add confidence scores to
every word boxes and to hOCR files (thanks to Adriano Pagano)
- Tesseract 4 (shell): Add '--oem 0' to specify legacy model when doing
orientation detection as orientation detection does not work yet with
Tesseract 4 (thanks to Adriano Pagano)
- Libtesseract: Fix multi-language support
- Tesseract (shell) + Windows: Never let the cmd window appear
- Libtesseract: Implements image_to_pdf() (thanks to Marian Skrip)
- Libtesseract: Hide debug messages (thanks to Ashish Kulkarni)
13/05/2017 - 0.4.7:
- Tesseract 4.00.00alpha:
- Version parsing: Ignore suffix (so '4.00.00alpha' == (4, 0, 0))
- Libtesseract: Load libtesseract.so.4 instead of libtesseract.so.3 if
available
- Support for Tesseract 3.05.00:
- Builders: Split field 'tess_conf' into 'tess_flags' and 'tess_conf'
- Libtesseract: If available, use TessBaseAPIDetectOrientationScript()
instead of TessBaseAPIDetectOS
- Libtesseract: Workaround: Prevents possible segfault in image_to_string()
when the target language is not available
26/01/2017 - 0.4.6:
- hOCR outputs: Generate valid XHTML files
10/01/2017 - 0.4.5:
- Clean up exceptions raised when OCR fails:
- Now, all tools raise only exceptions inheriting from
pyocr.PyocrException
- There is now one and only one TesseractError (shared between
pyocr.libtesseract and pyocr.tesseract)
08/12/2016 - 0.4.4:
- Fix Python 2.7 support (broken import)
06/12/2016 - 0.4.3:
- (temporary) Use tesseract-sh by default instead of libtesseract. Some
people have reported crashes with Paperwork+libtesseract. It needs more
stress-testing
- DigitBuilder is now available in 'pyocr.builders' (can be used
with libtesseract and cuneiform)
- New builder: DigitLineBoxBuilder
- Windows: Fix pyinstaller packaging suport: env variable TESSDATA_PREFIX
wasn't set correctly
- Windows: Tesseract-sh: Prevent CMD windows from appearing
05/10/2016 - 0.4.2:
* Tesseract: orientation detection: Ignore errors printed by libleptonic
on stderr (thanks to TeisD)
* Tesseract: Fix support of dev builds (thanks to Fjup)
* Libtesseract: Fix support of dev builds (thanks to Jakub Semerák)
* Tesseract: Use '--list-langs' to get the available languages instead of
looking for the data directory (thanks to Bernhard Liebl)
06/04/2016 - 0.4.1:
* Disable 'libtesseract' with Tesseract <= 3.03. It tends to segfault.
Libtesseract: Disable it with Tesseract <= 3.03. It tends to segfault.
Note: the segfault may not actually be related to Libtesseract. It may be due to other things in Debian stable (jessie).
Anyway, Paperwork cannot work on Debian stable because of that --> disabled just to be safe
13/03/2013 - 0.4.0:
* New module: 'libtesseract'. Use the C API of Tesseract for OCR.
This module is more efficient and cleaner than the old 'tesseract' module
(no more fork + exec + sh, less image manipulation, etc).
Note that with this module the images are just loaded and uncompressed
by Pillow. With 'tesseract', they were loaded, uncompressed, re-compressed
and saved by Pillow, then be reloaded by Leptonica. So the results may
vary slightly.
* Tesseract: Add support for Win32
* Tesseract: Fix orientation detection for version >= 3.04.01
0.3.1:
* tesseract.detect_orientation(): Use a temporary file instead of stdin
to transmit the image to Tesseract. Tesseract 3.04 doesn't support
stdin + "-psm 0" (regression ?)
* tesseract.detect_orientation(): Improve output parsing reliability
* optim: Avoid unnecessary convert to RGB and allow using image formats
different from PNG
* TextBuilder + Cuneiform: add extra settings for Cuneiform
(cuneiform_dotmatrix, cuneiform_fax=False, cuneiform_singlecolumn)
0.3.0:
* New API: pyocr.<tool>.can_detect_orientation() and
pyocr.<tool>.detect_orientation()
0.2.4:
* Tesseract : add digit-only support
* Tesseract : add support for Tesseract subsets of layout analysis (-psm)
0.2.3:
* Strip the alpha channel from images before running the OCR. It's basically
useless and can prevent the tool from working correctly.
* Make hOCR parsing more resistant (handle extra data around box positions)
* Fix: Take into account that new versions of Tesseract uses the file
extension .hocr instead of .html
0.2.2:
* Fix Python 3 support
* Add support for Tesseract on Heroku
0.2.1:
* Make it possible to use 'import pyocr' instead of 'from pyocr import pyocr'.
'from pyocr import pyocr' still works but is obsolete.
* Fix dependency list: depends on Pillow (it's untested with PIL)
* Fix pyocr.VERSION
0.2.0:
* Python 3.x support
0.1.2:
* Tesseract: Fix version parsing
* Tesseract: Fix Tesseract 3.02.01's hOCR format support
0.1.1:
* hOCR: Parse lines as well as words
* tesseract.get_available_languages() : Fix fedora support
* Fix UTF-8 support
|