File: ChangeLog

package info (click to toggle)
python-pyocr 0.8.5-2
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 548 kB
sloc: python: 4,921; makefile: 90; sh: 3
file content (199 lines) | stat: -rw-r--r-- 7,491 bytes
17/09/2023 - 0.8.5:
- Add build-system.build-backend to pyproject.toml

16/09/2023 - 0.8.4:
- Fix LineBoxBuilder: Take into account headers and footers too, not just the
  body.
- switch from setup.py to pyproject.toml
- switch from tox to pytest

26/06/2022 - 0.8.3:
- Workaround https://github.com/pypa/setuptools_scm/issues/727

15/04/2022 - 0.8.2:
- Add support for Tesseravt 5 + Linux
- Fix file descriptor leak (thanks to oda)

05/12/2021 - 0.8.1:
- Make the dependency on setuptools_scm optional

01/01/2020 - 0.8.0:
- Replaced libtesseract.image_to_pdf() by an object-oriented API that allows
  creating PDF with more than 1 page (thanks to Matthias Kraus).
- Tesseract 4 + sys.frozen=True: Fix TESSDATA_PREFIX: starting with
  Tesseract 4, the path must include tessdata/

22/06/2019 - 0.7.2:
- Fix setup.py on Windows

22/06/2019 - 0.7.1:
- tesseract.can_detect_orientation(): only returns True if 'osd' date files are
  installed
- setup.py: Fix installation in MSYS2

12/05/2019 - 0.7:
- Drop support for Python <= 2.7
- Fix: Make sure the builder objects can be used to parse box files
  even if Tesseract is not installed.
- PyOCR version is now automatically set in the module by setuptools_scm
  instead of PyOCR's Makefile (except on Windows)
- Tesseract: optim: keep the get_version() in memory instead of calling
  Tesseract everytime (get_version() by psm_parameter() which is called each
  time a box file is parsed ...)

18/02/2019 - 0.6:
- Complete rewrite of unit tests (thanks to Thomas Perret)
- Libtesseract 4.0: Fix segfault when running orientation detection
  (thanks to Marián Skrip)
- Libtesseract 4.0: Add a workaround: Tesseract need the locale to be set to
  'C' (thanks to Thomas Perret)
- Libtesseract: Specify DPI of the image to Libtesseract (thanks to Thomas
  Perret)
- Tesseract 4.0: Improve Tesseract version parsing

09/04/2018 - 0.5.3:
- Really fix tesseract 4.0 support (thanks to David Martin)
- Tests: switch from nose to pytest (thanks to Elliott Sales de Andrade)

25/07/2018 - 0.5.2:
- Fix tesseract 4.0 support: Use option '--psm' instead of '-psm'
- tesseract.detection_orientation(): Fix exception generation

01/03/2017 - 0.5.1:
- libtesseract/Windows: Add possible DLL names for libtesseract
- libtesseract: Keep track of library-loading errors in
  pyocr.libtesseract.lib_load_errors (useful for debugging)
- Build method has been changed: Use now "make install" instead of
  "python3 ./setup.py install"
- cosmetic: builders/WordHTMLParser: Message "OCR confidence not found"
  floods the logs when working with old documents --> switch to debug
  instead of info.

14/12/2017 - 0.5:
- Tesseract/Libtesseract + LineBoxBuilder: Add confidence scores to
  every word boxes and to hOCR files (thanks to Adriano Pagano)
- Tesseract 4 (shell): Add '--oem 0' to specify legacy model when doing
  orientation detection as orientation detection does not work yet with
  Tesseract 4 (thanks to Adriano Pagano)
- Libtesseract: Fix multi-language support
- Tesseract (shell) + Windows: Never let the cmd window appear
- Libtesseract: Implements image_to_pdf() (thanks to Marian Skrip)
- Libtesseract: Hide debug messages (thanks to Ashish Kulkarni)

13/05/2017 - 0.4.7:
- Tesseract 4.00.00alpha:
  - Version parsing: Ignore suffix (so '4.00.00alpha' == (4, 0, 0))
  - Libtesseract: Load libtesseract.so.4 instead of libtesseract.so.3 if
    available
- Support for Tesseract 3.05.00:
  - Builders: Split field 'tess_conf' into 'tess_flags' and 'tess_conf'
  - Libtesseract: If available, use TessBaseAPIDetectOrientationScript()
    instead of TessBaseAPIDetectOS
- Libtesseract: Workaround: Prevents possible segfault in image_to_string()
  when the target language is not available

26/01/2017 - 0.4.6:
- hOCR outputs: Generate valid XHTML files

10/01/2017 - 0.4.5:
- Clean up exceptions raised when OCR fails:
  - Now, all tools raise only exceptions inheriting from
    pyocr.PyocrException
  - There is now one and only one TesseractError (shared between
    pyocr.libtesseract and pyocr.tesseract)

08/12/2016 - 0.4.4:
- Fix Python 2.7 support (broken import)

06/12/2016 - 0.4.3:
- (temporary) Use tesseract-sh by default instead of libtesseract. Some
  people have reported crashes with Paperwork+libtesseract. It needs more
  stress-testing
- DigitBuilder is now available in 'pyocr.builders' (can be used
  with libtesseract and cuneiform)
- New builder: DigitLineBoxBuilder
- Windows: Fix pyinstaller packaging suport: env variable TESSDATA_PREFIX
  wasn't set correctly
- Windows: Tesseract-sh: Prevent CMD windows from appearing

05/10/2016 - 0.4.2:
* Tesseract: orientation detection: Ignore errors printed by libleptonic
  on stderr (thanks to TeisD)
* Tesseract: Fix support of dev builds (thanks to Fjup)
* Libtesseract: Fix support of dev builds (thanks to Jakub Semerák)
* Tesseract: Use '--list-langs' to get the available languages instead of
  looking for the data directory (thanks to Bernhard Liebl)

06/04/2016 - 0.4.1:
* Disable 'libtesseract' with Tesseract <= 3.03. It tends to segfault.
  Libtesseract: Disable it with Tesseract <= 3.03. It tends to segfault.
  Note: the segfault may not actually be related to Libtesseract. It may be due to other things in Debian stable (jessie).
  Anyway, Paperwork cannot work on Debian stable because of that --> disabled just to be safe


13/03/2013 - 0.4.0:
* New module: 'libtesseract'. Use the C API of Tesseract for OCR.
  This module is more efficient and cleaner than the old 'tesseract' module
  (no more fork + exec + sh, less image manipulation, etc).
  Note that with this module the images are just loaded and uncompressed
  by Pillow. With 'tesseract', they were loaded, uncompressed, re-compressed
  and saved by Pillow, then be reloaded by Leptonica. So the results may
  vary slightly.
* Tesseract: Add support for Win32
* Tesseract: Fix orientation detection for version >= 3.04.01


0.3.1:
* tesseract.detect_orientation(): Use a temporary file instead of stdin
  to transmit the image to Tesseract. Tesseract 3.04 doesn't support
  stdin + "-psm 0" (regression ?)
* tesseract.detect_orientation(): Improve output parsing reliability
* optim: Avoid unnecessary convert to RGB and allow using image formats
  different from PNG
* TextBuilder + Cuneiform: add extra settings for Cuneiform
  (cuneiform_dotmatrix, cuneiform_fax=False, cuneiform_singlecolumn)


0.3.0:
* New API: pyocr.<tool>.can_detect_orientation() and
  pyocr.<tool>.detect_orientation()


0.2.4:
* Tesseract : add digit-only support
* Tesseract : add support for Tesseract subsets of layout analysis (-psm)


0.2.3:
* Strip the alpha channel from images before running the OCR. It's basically
  useless and can prevent the tool from working correctly.
* Make hOCR parsing more resistant (handle extra data around box positions)
* Fix: Take into account that new versions of Tesseract uses the file
  extension .hocr instead of .html


0.2.2:
* Fix Python 3 support
* Add support for Tesseract on Heroku


0.2.1:
* Make it possible to use 'import pyocr' instead of 'from pyocr import pyocr'.
  'from pyocr import pyocr' still works but is obsolete.
* Fix dependency list: depends on Pillow (it's untested with PIL)
* Fix pyocr.VERSION


0.2.0:
* Python 3.x support


0.1.2:
* Tesseract: Fix version parsing
* Tesseract: Fix Tesseract 3.02.01's hOCR format support


0.1.1:
* hOCR: Parse lines as well as words
* tesseract.get_available_languages() : Fix fedora support
* Fix UTF-8 support