v1.4 : 15/05/2014
- Added configuration variable config_unzip_opts. This removes dependency on
unzip program, and allows users to use unzipping programs like 7z, pkzipc,
winzip as well.
- Fixed list numbering.
- Improved list/paragraph indentation and corresponding code.
- Updated README with brief guidance on how this utility can be used to recover
text from corrupted docx file.
v1.3 : 07/04/2014
- Added support for handling lists (bullet, decimal, letter, roman) along
with (attempt at) indentation.
- Added configuration variable config_twipsPerChar.
- Removed configuration variables: config_listIndent, config_exp_extra_deEscape.
- Text output omits deleted text. This matters in case changes are being
tracked in docx document.
- Text output omits non-document_text content marked by wp/wp14 tags.
v1.2 : 15/01/2012
- Perl script usage is extended to accept docx file from standard input. It also
works with input/output redirection now. Please refer to the documentation for
- Script files and configuration file can be installed in separate directories
on (non-Windows) systems using Makefile for installation.
- Linux Makefile also attempts to update the system configuration directory to
desired directory in installed Perl script.
- User specific and system wide configuration files can be maintained separately
even on windows.
- "-h" has to be given as the first argument to Perl script to get usage help.
- Added new configuration variable "config_tempDir".
- Configuration file is uniformly looked for in current directory, user
configuration directory (APPDATA on Windows and HOME on non-Windows), system
configuration directory (same location as script files on Windows, /etc or as
set during installation on non-Windows systems) in the specified order.
- Documentation has been updated with usage examples and information on how
.docx file text content can directly be viewed using Vim and Emacs editors.
- Improved handling of special (non-text) characters, along with support for
more non-text characters like fractions.
- Fixed Bug #3463033: added ' and " to docx specific escape character
- Fixed the wrong code that had got committed during earlier fixing of
nullDevice for Cygwin.
v1.1 : 11/12/2011
- Added a check for existence of unzip command.
- Configuration file is looked for in HOME directory as well.
- Configuration variables now begin with config_ .
- Fixed bugs #3003903, #3082018 and #3082035.
- Fixed nulldevice for Cygwin.
- Superscripted cross-references are placed within [...] now.
v1.0 : 04/10/2009
- Input argument can also be a directory holding the unzipped content of .docx
- Windows wrapper script, and support for using CakeCmd command line unzipper.
- Configuration file support for easy control over settings.
- Windows installation script.
- Hyperlink is not displayed if hyperlink and hyperlinked text are same, even
though user has enabled hyperlink display.
- Improved handling of short line justification, capturing many cases that were
missed in earlier approach.
- Path names containing spaces are now handled.
Please refer to the updated documentation for more details.
v0.4 : 06/09/2009
New features: [suggestions from "Sergei Kulakov (sergei>AT<dewia>DOT<com)"].
- user can control display of hyperlink along with linked text.
- TOC related cleanup. TOC was not addressed so far.
- many new character conversions (check the script code for details).
- character conversion mappings are now organised in a tabular form.
- currency characters are converted to respective full currency name.
- code tweaks to speedup the conversion process.
v0.3 : 23/09/2008
- center and right justification of text fitting in a line of (adjustible) 80
- indicating hyperlinked text along with the hyperlink.
- BSD makefile [Thanks to "Rene Maroufi" (info>AT<maroufi>DOT<net) for giving
guest access on an OpenBSD host for it].
Please refer to the release documentation for details.
- docx2txt.pl invocation has been changed a little,
- user involvement during installation is reduced.
- some suggestions on how Windows users can use this tool.
v0.2 : 15/08/2008
Docx text extraction can now be done in two ways (check version README for
- docx2txt.sh file.docx
- docx2txt.pl infile.docx outfile.txt
v0.1 : 10/08/2008
Initial Sourceforge release with attempts to handle following features during
- horizontal ruler, line breaks, paragraphs separation, tabs
- naive nested list formatting - assumed 8 level nesting, however if you want
to deal with further nesting, play comment-uncomment in perl script. :)
- capitalisation of text blocks i.e. in document.xml text is stored either as
lowercase or in mixed case, but in corresponding text files generated by
MSOffice it comes as all caps.
- character conversions (" ' < & > - ... etc.). Euro character is converted to
E, however you can change this behaviour by comment-uncomment in perl script.