1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381
|
v2.15.0 (13th August 2025)
- Overhaul sorbet types, moving from an external RBI file to inline comments in RBS syntax
- multiple PRs, but mainly https://github.com/yob/pdf-reader/pull/562
- See https://railsatscale.com/2025-04-23-rbs-support-for-sorbet/
- No impact expected for most users, but projects that use sorbet may find subtle changes in
the RBI file that is shipped with the gem
- Relax version requirements for dependency `afm`, allow 1.x (https://github.com/yob/pdf-reader/pull/557)
- Improve text positioning logic in some PDFs (https://github.com/yob/pdf-reader/pull/554)
- Multiple fixes for encrypted files
- Some files with passwords > 32 bytes long (https://github.com/yob/pdf-reader/pull/555)
- Some files that contain cipher text with a 16 byte IV and no further blocks (https://github.com/yob/pdf-reader/pull/561)
- Some files that encrypted data with no padding (https://github.com/yob/pdf-reader/pull/564)
- Add jruby 10 to CI matrix (https://github.com/yob/pdf-reader/pull/552)
v2.14.1 (4th February 2025)
- Fix issue in RBI signatures, introduced in v2.14.0(https://github.com/yob/pdf-reader/pull/550)
v2.14.0 (29th January 2025)
- Raise minimum supported ruby to 2.1 (https://github.com/yob/pdf-reader/pull/543)
- Add support for filtering to Page#text (https://github.com/yob/pdf-reader/pull/545)
v2.13.0 (2nd November 2024)
- Permit Ascii86 v1.0 and v2.0 (https://github.com/yob/pdf-reader/pull/539)
- Allow StringIO type for PDF::Reader input (https://github.com/yob/pdf-reader/pull/535)
v2.12.0 (26th December 2023)
- Fix a sorbet method signature (http://github.com/yob/pdf-reader/pull/512)
- Reduce allocations when parsing PDFs with hex strings (http://github.com/yob/pdf-reader/pull/528)
- Fix text extraction of some rare unicode codepoints (http://github.com/yob/pdf-reader/pull/529)
v2.11.0 (26th October 2022)
- Various bug fixes
- Expanded sorbet type annotations
v2.10.0 (12th May 2022)
- Various bug fixes
- Expanded sorbet type annotations
v2.9.2 (20th February 2022)
- Fix PDF::Reader::ObjectHash#page_references to return an Array of PDF::Reader::Reference (http://github.com/yob/pdf-reader/pull/444)
v2.9.1 (4th February 2022)
- Fix exception in Page#walk introduced in 2.9.0 (http://github.com/yob/pdf-reader/pull/442)
- Other small bug fixes
v2.9.0 (24th January 2022)
- Support additional encryption standards (http://github.com/yob/pdf-reader/pull/419)
- Return CropBox correctly from Page#rectangles (https://github.com/yob/pdf-reader/pull/420)
- For sorbet users, additional type annotations are included in the gem
v2.8.0 (28th Decemeber 2021)
- Add PDF::Reader::Page#runs for extracting text from a page with positioning metadata (http://github.com/yob/pdf-reader/pull/411)
- Add options to PDF::Reader::Page#text to make some behaviour configurable (http://github.com/yob/pdf-reader/pull/411)
- including extracting the text for only part of the page
- Improve text positioning and extraction for Type3 fonts (http://github.com/yob/pdf-reader/pull/412)
- Skip extracting text that is positioned outside the page (http://github.com/yob/pdf-reader/pull/413)
- Fix occasional crash when reading some streams (http://github.com/yob/pdf-reader/pull/405)
v2.7.0 (13th December 2021)
- Include RBI type files in the gem
- Downstream users of pdf-reader who also use sorbet *should* find many parts of the API will
now be typed checked by sorbet
- Fix glyph positioning in some rotation scenarios (http://github.com/yob/pdf-reader/pull/403)
- Improved text extraction on some rotated pages, and rotated text on normal pages
- Add PDF::Reader::Page#rectangles (http://github.com/yob/pdf-reader/pull/402)
- Returns page boxes (MediaBox, etc) with rotation applied, and as PORO rather than arrays of numbers
- Add PDF::Reader::Page#origin (http://github.com/yob/pdf-reader/pull/400)
- Add PDF::Reader::Page#{height,width} (http://github.com/yob/pdf-reader/pull/399)
- Overlap filter should only drop characters that overlap *and* match (http://github.com/yob/pdf-reader/pull/401)
v2.6.0 (12th November 2021)
- Text extraction improvements
- Improved text layout on pages with a variety of font sizes (http://github.com/yob/pdf-reader/pull/355)
- Fixed text positioning for some rotated pages (http://github.com/yob/pdf-reader/pull/356)
- Improved character width calculation for PDFs using built-in (non-embedded) ZapfDingbats (http://github.com/yob/pdf-reader/pull/373)
- Skip zero-width characters (http://github.com/yob/pdf-reader/pull/372)
- Performance improvements
- Reduced memory pressure when decoding TIFF images (http://github.com/yob/pdf-reader/pull/360)
- Optional dependency on ascii81_native gem for faster processing of files using the ascii85 filter (http://github.com/yob/pdf-reader/pull/359)
- Successfully parse more files
- Gracefully handle some non-spec compliant CR/LF issues (http://github.com/yob/pdf-reader/pull/364)
- Fix parsing of some escape sequences in content streams (http://github.com/yob/pdf-reader/pull/368)
- Increase the amount of junk bytes we detect and skip at the end of a file (382)
- Ignore "/Prev 0" in trailers (http://github.com/yob/pdf-reader/pull/383)
- Fix parsing of some inline images (BI ID EI tokens) (http://github.com/yob/pdf-reader/pull/389)
- Gracefully handle some xref tables that incorrectly start with 1 (http://github.com/yob/pdf-reader/pull/384)
v2.5.0 (6th June 2021)
- bump minimum ruby version to 2.0
- Correctly handle trascoding to UTF-8 from some fonts that use a difference table [#344](https://github.com/yob/pdf-reader/pull/344/)
- Fix some character spacing issues with the TJ operator [#343](https://github.com/yob/pdf-reader/pull/343)
- Fix crash with some encrypted PDFs [#348](https://github.com/yob/pdf-reader/pull/348/)
- Fix positions of text on some PDFs with pages rotated 90° [#350](https://github.com/yob/pdf-reader/pull/350/)
v2.4.2 (28th January 2021)
- relax ASCII85 dependency to allow 1.x
- improved support for decompressing objects with slightly malformed zlib data
v.2.4.1 (24th September 2020)
- Re-vendor font metrics from Adobe to clarify their license
v2.4.0 (21st November 2019)
- Optimise overlapping characters code introduced in 2.3.0. Text extraction of pages with
thousands of characters is still slower than it was in 2.2.1, but it might tolerable
for now. See https://github.com/yob/pdf-reader/pull/308 for details.
- Implement very basic font substitution for Type1 and TrueType fonts that aren't embedded
- Remove PDF::Hash class. It's been deprecated since 2010, and it's hard to believe anyone
is still using it.
- Several small bug fixes
v2.3.0 (7th November 2019)
- Text extraction now makes an effort to skip duplicate characters that overlap, a
common approach used for a fake "bold" effect, This will make text extraction a bit
slower - if that turns out to be an issue I'll look into further optimisations or
provide a toggle to turn it off
- Several small bug fixes
v2.2.1 (27th July 2019)
- Improve utf8 text extraction from CMaps that contain surrogate pair ligatures
v2.2.0 (18th December 2018)
- Support additional XRef Stream variants (thanks Stefan Wienert)
- Add frozen_strings pragma to reduce object allocations on ruby 2.3+
- various bug fixes
v2.1.0 (15th February 2018)
- Support extra encrypted PDF variants (thanks to Gyuchang Jun)
- various bug fixes
v2.0.0 (25th February 2017)
- various bug fixes
v2.0.0.beta1 (15th February 2017)
- BREAKING CHANGE: remove all methods that were deprecated in 1.0.0
- Bug: Support extra encrypted PDF variants (thanks to Gyuchang Jun)
- various bug fixes
v1.4.1 (2nd January 2017)
- improve compatibility with ruby 2.4 (thanks Akira Matsuda)
- various bug fixes
v1.4.0 (22nd February 2016)
- raise minimum ruby version to 1.9.3
- print warnings to stderr when deprecated methods are used. These methods have been
deprecated for 4 years, so hopefully few people are depending on them
- Fix exception when a non-breaking space (character 160) is used with a
built-in font (helvetica, etc)
- various bug fixes
v1.3.3 (7th April 2013)
- various bug fixes
v1.3.2 (26th February 2013)
- various bug fixes
v1.3.1 (12th February 2013)
- various bug fixes
v1.3.0 (30th December 2012)
- Numerous performance optimisations (thanks Alex Dowad)
- Improved text extraction (thanks Nathaniel Madura)
- Load less of the hashery gem to reduce core monkey patches
- various bug fixes
v1.2.0 (28th August 2012)
- Feature: correctly extract text using surrogate pairs and ligatures
(thanks Nathaniel Madura)
- Speed optimisation: cache tokenised Form XObjects to avoid re-parsing them
- Feature: support opening documents with some junk bytes prepended to file
(thanks Paul Gallagher)
- Acrobat does this, so it seemed reasonable to add support
v1.1.1 (9th May 2012)
- bugfix release to improve parsing of some PDFs
v1.1.0 (25th March 2012)
- new PageState class for handling common state tracking in page receivers
- see PageTextReceiver for example usage
- various bugfixes to support reading more PDF dialects
v1.0.0 (16th January 2012)
- support a new encryption variation
- bugfix in PageTextRender (thanks Paul Gallagher)
v1.0.0.rc1 (19th December 2011)
- performance optimisations (all by Bernerd Schaefer)
- some improvements to text extraction from form xobjects
- assume invalid font encodings are StandardEncoding
- use binary mode when opening PDFs to stop ruby being helpful and transcoding
bytes for us
v1.0.0.beta1 (6th October 2011)
- ensure inline images that contain "EI" are correctly parsed
(thanks Bernard Schaefer)
- fix parsing of inline image data
v0.12.0.alpha (28th August 2011)
- small breaking changes to the page-based API - it's alpha for a reason
- resource related methods on Page object return raw PDF objects
- if the caller wants the resources wrapped in a more convenient
Ruby object (like PDF::Reader::Font or PDF::Reader::FormXObject) will
need to do so themselves
- add support for RunLengthDecode filters (thanks Bernerd Schaefer)
- add support for standard PDF encryption (thanks Evan Brunner)
- add support for decoding stream with TIFF prediction
- new PDF::Reader::FormXObject class to simplify working with form XObjects
v0.11.0.alpha (19th July 2011)
- introduce experimental new page-based API
- old API is deprecated but will continue to work with no warnings
- add transparent caching of common objects to ObjectHash
v0.10.0 (6th July 2011)
- support multiple receivers within a single pass over a source file
- massive time saving when dealing with multiple receivers
v0.9.3 (2nd July 2011)
- add PDF::Reader::Reference#hash method
- improves behaviour of Reference objects when tehy're used as Hash keys
v0.9.2 (24th April 2011)
- add basic support for fonts with Identity-V encoding.
- bug: improve robustness of text extraction
- thanks to Evan Arnold for reporting
- bug: fix loading of nested resources on XObjects
- thanks to Samuel Williams for reporting
- bug: improve parsing of files with XRef object streams
v0.9.1 (21st December 2010)
- force gem to only install on ruby 1.8.7 or higher
- maintaining support for earlier versions takes more time than I have
available at the moment
- bug: fix parsing of obscure pdf name format
- bug: fix behaviour when loaded in conjunction with htmldoc gem
v0.9.0 (19th November 2010)
- support for pdf 1.5+ files that use object and xref streams
- support streams that use a flate filter with the predictor option
- ensure all content instructions are parsed when split over multiple stream
- thanks to Jack Rusher for reporting
- Various string parsing bug
- some character conversions to utf-8 were failing (thanks Andrea Barisani)
- hashes with nested hex strings were tokenising wronly (thanks Evan Arnold)
- escaping bug in tokenising of literal strings (thanks David Westerink)
- Fix a bug that prevented PDFs with white space after the EOF marker from loading
- thanks to Solomon White for reporting the issue
- Add support for de-filtering some LZW compressed streams
- thanks to Jose Ignacio Rubio Iradi for the patch
- some small speed improvements
- API CHANGE: PDF::Hash renamed to PDF::Reader::ObjectHash
- having a class named Hash was confusing for users
v0.8.6 (27th August 2010)
- new method: hash#page_references
- returns references to all page objects, gives rapid access to objects
for a given page
v0.8.5 (11th April 2010)
- fix a regression introduced in 0.8.4.
- Parameters passed to resource_font callback were inadvertently changed
v0.8.4 (30th March 2010)
- fix parsing of files that use Form XObjects
- thanks to Andrea Barisani for reporting the issue
- fix two issues that caused a small number of characters to convert to Unicode
incorrectly
- thanks to Andrea Barisani for reporting the issue
- require 'pdf-reader' now works a well as 'pdf/reader'
- good practice to have the require file match the gem name
- thanks to Chris O'Meara for highlighting this
v0.8.3 (14th February 2010)
- Fix a bug in tokenising of hex strings inside dictionaries
- Thanks to Brad Ediger for detecting the issue and proposing a solution
v0.8.2 (1st January 2010)
- Fix parsing of files that use Form XObjects behind an indirect reference
(thanks Cornelius Illi and Patrick Crosby)
- Rewrote Buffer class to fix various speed issues reported over the years
- On my sample file extracting full text reduced from 220 seconds to 9 seconds.
v0.8.1 (27th November 2009)
- Added PDF::Hash#version. Provides access to the source file PDF version
v0.8.0 (20th November 2009)
- Added PDF::Hash. It provides direct access to objects from a PDF file
with an API that emulates the standard Ruby hash
v0.7.7 (11th September 2009)
- Trigger callbacks contained in Form XObjects when we encounter them in a
content stream
- Fix inheritance of page resources to comply with section 3.6.2
v0.7.6 (28th August 2009)
- Various bug fixes that increase the files we can successfully parse
- Treat float and integer tokens differently (thanks Neil)
- Correctly handle PDFs where the Kids element of a Pages dict is an indirect
reference (thanks Rob Holland)
- Fix conversion of PDF strings to Ruby strings on 1.8.6 (thanks Andrès Koetsier)
- Fix decoding with ASCII85 and ASCIIHex filters (thanks Andrès Koetsier)
- Fix extracting inline images from content streams (thanks Andrès Koetsier)
- Fix extracting [ ] from content streams (thanks Christian Rishøj)
- Fix conversion of text to UTF8 when the cmap uses bfrange (thanks Federico Gonzalez Lutteroth)
v0.7.5 (27th August 2008)
- Fix a 1.8.7ism
v0.7.4 (7th August 2008)
- Raise a MalformedPDFError if a content stream contains an unterminated string
- Fix an bug that was causing an endless loop on some OSX systems
- valid strings were incorrectly thought to be unterminated
- thanks to Jeff Webb for playing email ping pong with me as I tracked this
issue down
v0.7.3 (11th June 2008)
- Add a high level way to get direct access to a PDF object, including a new executable: pdf_object
- Fix a hard loop bug caused by a content stream that is missing a final operator
- Significantly simplified the internal code for encoding conversions
- Fixes YACC parsing bug that occurs on Fedora 8's ruby VM
- New callbacks
- page_count
- pdf_version
- Fix a bug that prevented a font's BaseFont from being recorded correctly
v0.7.2 (20th May 2008)
- Throw an UnsupportedFeatureError if we try to open an encrypted/secure PDF
- Correctly handle page content instruction sets with trailing whitespace
- Represent PDF Streams with a new object, PDF::Reader::Stream
- their really wasn't any point in separating the stream content from it's associated dict. You need both
parts to correctly interpret the content
v0.7.1 (6th May 2008)
- Non-page strings (ie. metadata, etc) are now converted to UTF-8 more accurately
- Fixed a regression between 0.6.2 and 0.7 that prevented difference tables from being applied
correctly when translating text into UTF-8
v0.7 (6th May 2008)
- API INCOMPATIBLE CHANGE: any hashes that are passed to callbacks use symbols as keys instead of PDF::Reader::Name instances.
- Improved support for converting text in some PDF files to unicode
- Behave as expected if the Contents key in a Page Dict is a reference
- Include some basic metadata callbacks
- Don't interpret a comment token (%) inside a string as a comment
- Small fixes to improve 1.9 compatibility
- Improved our Zlib deflating to make it slightly more robust - still some more issues to work out though
- Throw an UnsupportedFeatureError if a pdf that uses XRef streams is opened
- Added an option to PDF::Reader#file and PDF::Reader#string to enable parsing of only parts of a PDF file(ie. only metadata, etc)
v0.6.2 (22nd March 2008)
- Catch low level errors when applying filters to a content stream and raise a MalformedPDFError instead.
- Added support for processing inline images
- Support for parsing XRef tables that have multiple subsections
- Added a few callbacks to improve the way we supply information on page resources
- Ignore whitespace in hex strings, as required by the spec (section 3.2.3)
- Use our "unknown character box" when a single character in an Identity-H string fails to decode
- Support ToUnicode CMaps that use the bfrange operator
- Tweaked tokenising code to ensure whitespace doesn't get in the way
v0.6.1 (12th March 2008)
- Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We
just replace each character with a little box.
- Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
NoMethodError.
- Added a method to RegisterReceiver that returns all occurrences of a callback
v0.6.0 (27th February 2008)
- all text is now transparently converted to UTF-8 before being passed to the callbacks.
before this version, text was just passed as a byte level copy of what was in the PDF file, which
was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
- Fonts that use a difference table are now handled correctly
- fixed some 1.9 incompatible syntax
- expanded RegisterReceiver class to record extra info
- expanded rspec coverage
- tweaked a README example
v0.5.1 (1st January 2008)
- Several documentation tweaks
- Improve support for parsing PDFs under windows (thanks to Jari Williamsson)
v0.5 (14th December 2007)
- Initial Release
|