1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
|
docx2txt (http://docx2txt.sourceforge.net/) is a simple tool to generate
equivalent text files from (even corrupted) Microsoft .docx documents, with an
attempt towards preserving sufficient formatting and document information, and
appropriate character conversions for a good text experience.
Dependencies
------------
You will need to have following programs installed on your system for using this
tool.
Mandatory :
* PERL
* A commandline unzipping program that can silently extract single file from
zip archive to console/standard output/pipe.
Unzip, 7-Zip (7z), PKZip (pkzipc), and WinZip (wzunzip) are some such known
programs that serve the purpose when invoked with appropriate command/options.
Optional :
* Bash [Needed only if you want to use wrapper docx2txt.sh]
How to Use
----------
You can do the text conversion in different ways depending upon your usage
environment.
1. Using docx2txt.sh :
docx2txt.sh file.docx
OR
docx2txt.sh file
In both these cases output text will be saved in file.txt .
2. Using docx2txt.bat :
docx2txt.bat file.docx
OR
docx2txt.bat file
In both these cases output text will be saved in file.txt .
3. Using docx2txt.pl :
a. docx2txt.pl infile.docx outfile.txt
Use - as the name of output text file, to send extracted text to STDOUT,
that is, console.
b. docx2txt.pl file.docx
OR
docx2txt.pl file
In both these cases output text will be saved in file.txt .
Input can also be provided via STDIN (console) using - as the name of input
docx file. Moreover redirection of input/output is possible with this script,
making it feasible to invoke it in even more ways as illustrated below.
c. docx2txt.pl < infile.docx
In this case input is read from infile.docx and output is sent to STDOUT.
d. docx2txt.pl < infile.docx > outfile.txt
In this case input is read from infile.docx and output is sent to
outfile.txt .
e. cat infile.docx | ./docx2txt.pl
In this case content of infile.docx is read via STDIN and output is sent
to STDOUT.
f. cat infile.docx | ./docx2txt.pl - outfile.txt
In this case content of infile.docx is read via STDIN and output is sent
to outfile.txt .
Input argument in all the above cases can also be a directory holding the
unzipped content of a .docx file. This feature is particulary useful if you do
not have a commandline unzipping program as required in dependencies.
Usage help can be obtained by giving '-h' as the first argument to the script.
docx2txt.pl -h
Tune your Experience
--------------------
You can change following settings via docx2txt.config file that is looked for
- in the current directory,
- user configuration directory (APPDATA on Windows, HOME on non-Windows), and
- in the system configuration directory (same directory that holds the script
files on Windows, /etc or as set during installation on non-Windows),
in the specified order. In case script does not find any configuration file, it
continues with builtin default settings.
a. Path to unzip program, and relevant command/options to be passed to it [#]
b. Path to temp directory
c. Newline in output text file (Unix/Dos way)
d. Line width (used for short line justification)
e. Showing of hyperlink along with linked text
f. Twips per character, to obtain desired list indentation in text output
You can also adjust representative bullet indicators in docx2txt.pl, but be
careful while modifying it.
[#] Unzipping Program | Relevant Command/Options
------------------+------------------------------------------------------
unzip | -p
7z | e -so -tzip
pkzipc | -console -silent -translate=none -noarchiveextension
wzunzip | -c
Viewing the text content of Docx file in Editors and File browsers
------------------------------------------------------------------
1. MC (Midnight Commander)
-----------------------
You can add following binding in ~/.mc/bindings and view the text content of a
.docx file by hitting F3 key (assuming default key mappings) after moving the
cursor over concerned filename in mc pannel.
# Microsoft .docx Document
regex/\.(docx|DOCX|Docx)$
View=%view{ascii} docx2txt.pl %f -
2. VIm Editor
----------
You can add following lines in your ~/.vimrc to view the text content of a .docx
file directly when using vim.
"use docx2txt.pl to allow VIm to view the text content of a .docx file directly.
autocmd BufReadPre *.docx set ro
autocmd BufReadPost *.docx %!docx2txt.pl
Note that above .vimrc addition will allow you to view the text content of .docx
files specified as command line argument to vim, but not of those read using
":r file.docx".
Please refer to http://vimdoc.sourceforge.net/htmldoc/autocmd.html for more
information on autocmd.
3. Emacs Editor
------------
You can add following lines in your ~/.emacs file to view the text content of
a .docx file directly when using emacs.
(add-to-list 'auto-mode-alist '("\\.docx\\'" . docx2txt))
(defun docx2txt ()
"Run docx2txt on the entire buffer."
(shell-command-on-region (point-min) (point-max) "docx2txt.pl" t t))
Be warned that with above ~/.emacs code addition, if you happen to save the
buffer/file, it will overwrite the .docx file with the text content.
Please explore "Filters -- making things readable:" section at
http://www.emacswiki.org/emacs/CategoryExternalUtilities for more ways to view
.docx file text content directly in emacs.
Recovering text from corrupted .docx file
-----------------------------------------
A .docx file is a zip archive of a collection of XML files. Two kind of
corruptions - zip archive corruption, and component XML file(s) corruption, can
cause a common Docx Reader/Viewer to fail while reading the file.
The way docx2txt.pl extracts text content from .docx file, it is somewhat immune
to XML corruption, and can extract reasonable text content from even corrupted
XML file.
As for zip archive corruption, if you have an unzipper that can fix a corrupted
zip archive[#] and/or extract atleast required XML files from a corrupted zip
archive[*], you are ready to extract text from the corrupted .docx file. You may
temporarily need to rename .docx file as .zip file, if required by unzipper.
If this unzipper can extract specific files to pipe/standard output/console, you
can simply specify it in config file. Otherwise you can extract the archive
content in a directory, suitably named as per your need, and specify this
directory as the filename argument to the docx2txt.pl script.
[#] Program Name | Example Usage
--------------+------------------------------------------------------
zip | zip -FF corrupted.docx --out fixed.docx
ALZipCon | ALZipCon.exe -r corrupted.zip
pkzipc | pkzipc.exe -fix t.zip
winrar[GUI] | Repair archive, [*] Keep broken extracted files
Request
-------
If you are using this work directly/indirectly for non-personal purpose(s),
please inform the author about it along with relevant url(s), so that it can be
mentioned on the project homepage.
In case you come across some issue with it, or need a feature that can be
handled in docx to text conversion, please feel free to communicate. An
accompanying test .docx document depicting the issue/need and the corresponding
text file generated by MSOffice with character substitution enabled (or as you
would like the text file to be) will be helpful.
You can track the project via http://sourceforge.net/projects/docx2txt and refer
to project cvs if there have been changes since this release.
Disclaimer
----------
This program includes no warranty whatsoever. It is provided "AS IS". For more
information please read the COPYING document, which should be included with the
package, and describes the GNU Public License, which covers docx2txt.
Sandeep Kumar ( shimple0 -AT- yahoo .DOT. com )
|