1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
|
v1.4 : 15/05/2014
New feature:
- Added configuration variable config_unzip_opts. This removes dependency on
unzip program, and allows users to use unzipping programs like 7z, pkzipc,
winzip as well.
Updates:
- Fixed list numbering.
- Improved list/paragraph indentation and corresponding code.
- Updated README with brief guidance on how this utility can be used to recover
text from corrupted docx file.
v1.3 : 07/04/2014
New feature:
- Added support for handling lists (bullet, decimal, letter, roman) along
with (attempt at) indentation.
Updates:
- Added configuration variable config_twipsPerChar.
- Removed configuration variables: config_listIndent, config_exp_extra_deEscape.
- Text output omits deleted text. This matters in case changes are being
tracked in docx document.
- Text output omits non-document_text content marked by wp/wp14 tags.
v1.2 : 15/01/2012
New features:
- Perl script usage is extended to accept docx file from standard input. It also
works with input/output redirection now. Please refer to the documentation for
more information.
- Script files and configuration file can be installed in separate directories
on (non-Windows) systems using Makefile for installation.
- Linux Makefile also attempts to update the system configuration directory to
desired directory in installed Perl script.
- User specific and system wide configuration files can be maintained separately
even on windows.
Updates:
- "-h" has to be given as the first argument to Perl script to get usage help.
- Added new configuration variable "config_tempDir".
- Configuration file is uniformly looked for in current directory, user
configuration directory (APPDATA on Windows and HOME on non-Windows), system
configuration directory (same location as script files on Windows, /etc or as
set during installation on non-Windows systems) in the specified order.
- Documentation has been updated with usage examples and information on how
.docx file text content can directly be viewed using Vim and Emacs editors.
- Improved handling of special (non-text) characters, along with support for
more non-text characters like fractions.
- Fixed Bug #3463033: added ' and " to docx specific escape character
conversions.
- Fixed the wrong code that had got committed during earlier fixing of
nullDevice for Cygwin.
v1.1 : 11/12/2011
New features:
- Added a check for existence of unzip command.
- Configuration file is looked for in HOME directory as well.
Updates:
- Configuration variables now begin with config_ .
- Fixed bugs #3003903, #3082018 and #3082035.
- Fixed nulldevice for Cygwin.
- Superscripted cross-references are placed within [...] now.
v1.0 : 04/10/2009
New features:
- Input argument can also be a directory holding the unzipped content of .docx
file.
- Windows wrapper script, and support for using CakeCmd command line unzipper.
- Configuration file support for easy control over settings.
- Windows installation script.
Updates:
- Hyperlink is not displayed if hyperlink and hyperlinked text are same, even
though user has enabled hyperlink display.
- Improved handling of short line justification, capturing many cases that were
missed in earlier approach.
- Path names containing spaces are now handled.
Please refer to the updated documentation for more details.
v0.4 : 06/09/2009
New features: [suggestions from "Sergei Kulakov (sergei>AT<dewia>DOT<com)"].
- user can control display of hyperlink along with linked text.
- TOC related cleanup. TOC was not addressed so far.
Updates:
- many new character conversions (check the script code for details).
- character conversion mappings are now organised in a tabular form.
- currency characters are converted to respective full currency name.
- code tweaks to speedup the conversion process.
v0.3 : 23/09/2008
New features:
- center and right justification of text fitting in a line of (adjustible) 80
columns.
- indicating hyperlinked text along with the hyperlink.
- BSD makefile [Thanks to "Rene Maroufi" (info>AT<maroufi>DOT<net) for giving
guest access on an OpenBSD host for it].
Please refer to the release documentation for details.
- docx2txt.pl invocation has been changed a little,
- user involvement during installation is reduced.
- some suggestions on how Windows users can use this tool.
v0.2 : 15/08/2008
Docx text extraction can now be done in two ways (check version README for
further details).
- docx2txt.sh file.docx
- docx2txt.pl infile.docx outfile.txt
v0.1 : 10/08/2008
Initial Sourceforge release with attempts to handle following features during
text extraction.
- horizontal ruler, line breaks, paragraphs separation, tabs
- naive nested list formatting - assumed 8 level nesting, however if you want
to deal with further nesting, play comment-uncomment in perl script. :)
- capitalisation of text blocks i.e. in document.xml text is stored either as
lowercase or in mixed case, but in corresponding text files generated by
MSOffice it comes as all caps.
- character conversions (" ' < & > - ... etc.). Euro character is converted to
E, however you can change this behaviour by comment-uncomment in perl script.
|