Commit e27356ca authored by Hideki Yamane's avatar Hideki Yamane 🐈

Merge tag 'upstream/1.4'

Upstream version 1.4
parents 9676bea2 c539ea08
v1.4 : 15/05/2014
New feature:
- Added configuration variable config_unzip_opts. This removes dependency on
unzip program, and allows users to use unzipping programs like 7z, pkzipc,
winzip as well.
Updates:
- Fixed list numbering.
- Improved list/paragraph indentation and corresponding code.
- Updated README with brief guidance on how this utility can be used to recover
text from corrupted docx file.
v1.3 : 07/04/2014
New features:
New feature:
- Added support for handling lists (bullet, decimal, letter, roman) along
with (attempt at) indentation.
......
......@@ -89,7 +89,7 @@ d. You can now use this tool from within C:\docx2txt as follows.
perl docx2txt.pl C:/somedir/file.docx
perl docx2txt.pl C:\somedir\file.docx C:\otherdir\converted.txt
Please view README for further usage information.
Please view README for further examples using I/O redirection.
III. You can also install this utility via WInstall.bat and follow the
instructions during installation. WInstall.bat can be invoked in two ways.
......@@ -107,6 +107,9 @@ III. You can also install this utility via WInstall.bat and follow the
- Cygwin perl or Strawberry perl [http://strawberryperl.com/] or any other
Windows native perl implementation
- Cygwin unzip or UnZip for Windows [http://gnuwin32.sourceforge.net/downlinks/unzip.php]
- CakeCmd unzipper [http://www.quickzip.org/cakecmd.html]
- Any commandline unzipper meeting the dependencies requirement in README.
* Cygwin unzip or UnZip for Windows [http://gnuwin32.sourceforge.net/downlinks/unzip.php]
* 7z [http://www.7-zip.org/]
* pkzipc [http://www.pkware.com/software/pkzip/]
* wzunzip [http://www.winzip.com/]
* CakeCmd unzipper [http://www.quickzip.org/cakecmd.html]
docx2txt (http://docx2txt.sourceforge.net/) is a simple tool to generate
equivalent text files from Microsoft .docx documents, with an attempt towards
preserving sufficient formatting and document information, and appropriate
character conversions for a good text experience.
equivalent text files from (even corrupted) Microsoft .docx documents, with an
attempt towards preserving sufficient formatting and document information, and
appropriate character conversions for a good text experience.
You need to atleast have perl installed on your system for using this tool.
Dependencies
------------
You will need to have following programs installed on your system for using this
tool.
Mandatory :
* PERL
* A commandline unzipping program that can silently extract single file from
zip archive to console/standard output/pipe.
Unzip, 7-Zip (7z), PKZip (pkzipc), and WinZip (wzunzip) are some such known
programs that serve the purpose when invoked with appropriate command/options.
Optional :
* Bash [Needed only if you want to use wrapper docx2txt.sh]
How to Use
......@@ -66,8 +84,7 @@ environment.
Input argument in all the above cases can also be a directory holding the
unzipped content of a .docx file. This feature is particulary useful if you do
not have a commandline unzipping tool like Unzip/CakeCmd installed on your
system.
not have a commandline unzipping program as required in dependencies.
Usage help can be obtained by giving '-h' as the first argument to the script.
......@@ -85,7 +102,7 @@ You can change following settings via docx2txt.config file that is looked for
in the specified order. In case script does not find any configuration file, it
continues with builtin default settings.
a. Path to unzip program
a. Path to unzip program, and relevant command/options to be passed to it [#]
b. Path to temp directory
c. Newline in output text file (Unix/Dos way)
d. Line width (used for short line justification)
......@@ -95,6 +112,13 @@ f. Twips per character, to obtain desired list indentation in text output
You can also adjust representative bullet indicators in docx2txt.pl, but be
careful while modifying it.
[#] Unzipping Program | Relevant Command/Options
------------------+------------------------------------------------------
unzip | -p
7z | e -so -tzip
pkzipc | -console -silent -translate=none -noarchiveextension
wzunzip | -c
Viewing the text content of Docx file in Editors and File browsers
------------------------------------------------------------------
......@@ -149,6 +173,35 @@ http://www.emacswiki.org/emacs/CategoryExternalUtilities for more ways to view
.docx file text content directly in emacs.
Recovering text from corrupted .docx file
-----------------------------------------
A .docx file is a zip archive of a collection of XML files. Two kind of
corruptions - zip archive corruption, and component XML file(s) corruption, can
cause a common Docx Reader/Viewer to fail while reading the file.
The way docx2txt.pl extracts text content from .docx file, it is somewhat immune
to XML corruption, and can extract reasonable text content from even corrupted
XML file.
As for zip archive corruption, if you have an unzipper that can fix a corrupted
zip archive[#] and/or extract atleast required XML files from a corrupted zip
archive[*], you are ready to extract text from the corrupted .docx file. You may
temporarily need to rename .docx file as .zip file, if required by unzipper.
If this unzipper can extract specific files to pipe/standard output/console, you
can simply specify it in config file. Otherwise you can extract the archive
content in a directory, suitably named as per your need, and specify this
directory as the filename argument to the docx2txt.pl script.
[#] Program Name | Example Usage
--------------+------------------------------------------------------
zip | zip -FF corrupted.docx --out fixed.docx
ALZipCon | ALZipCon.exe -r corrupted.zip
pkzipc | pkzipc.exe -fix t.zip
winrar[GUI] | Repair archive, [*] Keep broken extracted files
Request
-------
......
......@@ -16,9 +16,17 @@
#
# Default : '/usr/bin/unzip'
#
#config_unzip => 'C:\Program Files\GnuWin32\bin\unzip.exe',
config_unzip => '/usr/bin/unzip',
#
# Specify the commandline option(s) to be supplied to the program specified in
# config_unzip, that allow silent extraction of specified file from zip archive
# to console/standard output/pipe.
#
# Default : '-p' (for unzip)
#
# config_unzip_opts => '-p',
#
# How the newline should be in output text file - "\n" or "\r\n".
#
......
......@@ -94,6 +94,11 @@
# roman) along with (attempt at) indentation.
# Added new configuration variable config_twipsPerChar.
# Removed configuration variable config_listIndent.
# 14/04/2014 - Fixed list numbering - lvl start value needs to be considered.
# Improved list indentation and corresponding code.
# 27/04/2014 - Improved paragraph content layout/indentation.
# 13/05/2014 - Added new configuration variable config_unzip_opts. Users can
# now use unzipping programs like 7z, pkzipc, winzip as well.
#
......@@ -103,6 +108,7 @@
#
our $config_unzip = '/usr/bin/unzip'; # Windows path like 'C:/path/to/unzip.exe'
our $config_unzip_opts = '-p'; # To extract file on standard output
our $config_newLine = "\n"; # Alternative is "\r\n".
our $config_lineWidth = 80; # Line width, used for short line justification.
......@@ -236,6 +242,10 @@ my %splchars = (
"\xA0" => '!=', # <neq>
"\xA4" => '<=', # <leq>
"\xA5" => '>=', # <geq>
},
"\xEF\x82" => {
"\xB7" => '*' # small white square
}
);
......@@ -391,10 +401,12 @@ else {
# Extract xml document content from argument docx file/directory.
#
my $unzip_cmd = "'$config_unzip' $config_unzip_opts";
if ($inpIsDir eq 'y') {
readFileInto("$ARGV[0]/word/document.xml", $content);
} else {
$content = `"$config_unzip" -p "$ARGV[0]" word/document.xml 2>$nullDevice`;
$content = `$unzip_cmd "$ARGV[0]" word/document.xml 2>$nullDevice`;
}
cleandie "Failed to extract required information from <$inputFileName>!\n" if ! $content;
......@@ -426,7 +438,7 @@ binmode $txtfile; # Ensure no auto-conversion of '\n' to '\r\n' on Windows.
if ($inpIsDir eq 'y') {
readFileInto("$ARGV[0]/word/_rels/document.xml.rels", $_);
} else {
$_ = `"$config_unzip" -p "$ARGV[0]" word/_rels/document.xml.rels 2>$nullDevice`;
$_ = `$unzip_cmd "$ARGV[0]" word/_rels/document.xml.rels 2>$nullDevice`;
}
my %docurels;
......@@ -443,7 +455,7 @@ $_ = "";
if ($inpIsDir eq 'y') {
readOptionalFileInto("$ARGV[0]/word/numbering.xml", $_);
} else {
$_ = `"$config_unzip" -p "$ARGV[0]" word/numbering.xml 2>$nullDevice`;
$_ = `$unzip_cmd "$ARGV[0]" word/numbering.xml 2>$nullDevice`;
}
my %abstractNum;
......@@ -463,15 +475,16 @@ if ($_) {
{
my $abstractNumId = $1, $temp = $2;
while ($temp =~ /<w:lvl w:ilvl="(\d+)"[^>]*>.*?<w:numFmt w:val="(.*?)"[^>]*>.*?<w:lvlText w:val="(.*?)"[^>]*>.*?<w:ind w:left="(\d+)" [^>]*>/g )
while ($temp =~ /<w:lvl w:ilvl="(\d+)"[^>]*><w:start w:val="(\d+)"[^>]*><w:numFmt w:val="(.*?)"[^>]*>.*?<w:lvlText w:val="(.*?)"[^>]*>.*?<w:ind w:left="(\d+)" w:hanging="(\d+)"[^>]*>/g )
{
# $2: NumFmt, $3: LvlText, $4: Indent (twips)
# $2: Start $3: NumFmt, $4: LvlText, ($5,$6): (Indent (twips), hanging)
@{$abstractNum{"$abstractNumId:$1"}} = (
$NFList{$2},
$3,
int (($4 / $config_twipsPerChar) + 0.5),
$4
$NFList{$3},
$4,
$2,
int ((($5-$6) / $config_twipsPerChar) + 0.5),
$5
);
}
}
......@@ -591,38 +604,39 @@ my $ssiz = 1;
sub listNumbering {
my $aref = \@{$abstractNum{"$N2ANId[$_[0]]:$_[1]"}};
my $key = "$N2ANId[$_[0]]:$_[1]";
my $ccnt;
if ($aref->[3] < $twipStack[$ssiz-1]) {
while ($twipStack[$ssiz-1] > $aref->[3]) {
pop @twipStack;
pop @keyStack;
pop @lastCnt;
$ssiz--;
my $lvlText;
if ($aref->[0] != \&bullet) {
my $key = "$N2ANId[$_[0]]:$_[1]";
my $ccnt;
if ($aref->[4] < $twipStack[$ssiz-1]) {
while ($twipStack[$ssiz-1] > $aref->[4]) {
pop @twipStack;
pop @keyStack;
pop @lastCnt;
$ssiz--;
}
}
}
if ($aref->[3] == $twipStack[$ssiz-1]) {
if ($key eq $keyStack[$ssiz-1]) {
++$lastCnt[$ssiz-1];
if ($aref->[4] == $twipStack[$ssiz-1]) {
if ($key eq $keyStack[$ssiz-1]) {
++$lastCnt[$ssiz-1];
}
else {
$keyStack[$ssiz-1] = $key;
$lastCnt[$ssiz-1] = $aref->[2];
}
}
else {
$keyStack[$ssiz-1] = $key;
$lastCnt[$ssiz-1] = 1;
elsif ($aref->[4] > $twipStack[$ssiz-1]) {
push @twipStack, $aref->[4];
push @keyStack, $key;
push @lastCnt, $aref->[2];
$ssiz++;
}
}
elsif ($aref->[3] > $twipStack[$ssiz-1]) {
push @twipStack, $aref->[3];
push @keyStack, $key;
push @lastCnt, 1;
$ssiz++;
}
$ccnt = $lastCnt[$ssiz-1];
my $lvlText;
$ccnt = $lastCnt[$ssiz-1];
if ($aref->[0] != \&bullet) {
$lvlText = $aref->[1];
$lvlText =~ s/%\d([^%]*)$/($aref->[0]->($ccnt)).$1/oe;
......@@ -633,7 +647,7 @@ sub listNumbering {
$lvlText = $aref->[0]->($aref->[1]);
}
return ' ' x $aref->[2] . $lvlText . ' ';
return ' ' x $aref->[3] . $lvlText . ' ';
}
#
......@@ -644,8 +658,6 @@ sub processParagraph {
my $para = $_[0] . "$config_newLine";
my $align = $1 if ($_[0] =~ /<w:jc w:val="([^"]*?)"\/>/);
$para =~ s|<w:numPr><w:ilvl w:val="(\d+)"/><w:numId w:val="(\d+)"\/>|listNumbering($2,$1)|oge;
$para =~ s/<.*?>//og;
return justify($align,$para) if $align;
......@@ -662,12 +674,9 @@ $content =~ s/<?xml .*?\?>(\r)?\n//;
$content =~ s{<(wp14|wp):[^>]*>.*?</\1:[^>]*>}||og;
# Remove the field instructions (instrText) and data (fldData).
$content =~ s|<w:instrText[^>]*>.*?</w:instrText>||og;
$content =~ s|<w:fldData[^>]*>[^<]*?</w:fldData>||og;
# Remove deleted text.
$content =~ s|<w:delText[^>]*>.*?</w:delText>||og;
# Remove the field instructions (instrText) and data (fldData), and deleted
# text.
$content =~ s{<w:(instrText|fldData|delText)[^>]*>.*?</w:\1>}||ogs;
# Mark cross-reference superscripting within [...].
$content =~ s|<w:vertAlign w:val="superscript"/></w:rPr><w:t>(.*?)</w:t>|[$1]|og;
......@@ -681,9 +690,14 @@ $content =~ s{<w:caps/>.*?(<w:t>|<w:t [^>]+>)(.*?)</w:t>}/uc $2/oge;
$content =~ s{<w:hyperlink r:id="(.*?)".*?>(.*?)</w:hyperlink>}/hyperlink($1,$2)/oge;
$content =~ s/<w:p[^>]+?>(.*?)<\/w:p>/processParagraph($1)/oge;
$content =~ s|<w:numPr><w:ilvl w:val="(\d+)"/><w:numId w:val="(\d+)"\/>|listNumbering($2,$1)|oge;
$content =~ s{<w:ind w:(left|firstLine)="(\d+)"( w:hanging="(\d+)")?[^>]*>}|' ' x int((($2-$4)/$config_twipsPerChar)+0.5)|oge;
$content =~ s{<w:p [^/>]+?/>|<w:br/>}|$config_newLine|og;
$content =~ s/<w:p[^>]+?>(.*?)<\/w:p>/processParagraph($1)/ogse;
$content =~ s{<w:p [^/>]+?/>|</w:p>|<w:br/>}|$config_newLine|og;
$content =~ s/<.*?>//og;
......@@ -691,7 +705,7 @@ $content =~ s/<.*?>//og;
# Convert non-ASCII characters/character sequences to ASCII characters.
#
$content =~ s/(\xC2|\xC3|\xCF|\xE2.)(.)/($splchars{$1}{$2} ? $splchars{$1}{$2} : $1.$2)/oge;
$content =~ s/(\xC2|\xC3|\xCF|\xE2.|\xEF.)(.)/($splchars{$1}{$2} ? $splchars{$1}{$2} : $1.$2)/oge;
#
# Convert docx specific (reserved HTML/XHTML) escape characters.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment