Commit 5f159dbe authored by Hideki Yamane's avatar Hideki Yamane 🐈

Imported Upstream version 1.2

parents
Sandeep Kumar ( shimple0 -AT- yahoo .DOT. com )
#
# BSD makefile for docx2txt
#
BINDIR ?= /usr/local/bin
CONFIGDIR ?= /etc
INSTALL != which install
BINFILES = docx2txt.sh docx2txt.pl
CONFIGFILE = docx2txt.config
.PHONY: install installbin installconfig
install: installbin installconfig
installbin: $(BINFILES)
[ -d $(BINDIR) ] || mkdir -p $(BINDIR)
$(INSTALL) -m 755 $> $(BINDIR)
installconfig: $(CONFIGFILE)
[ -d $(CONFIGDIR) ] || mkdir -p $(CONFIGDIR)
$(INSTALL) -m 755 $> $(CONFIGDIR)
This diff is collapsed.
v1.2 : 15/01/2012
New features:
- Perl script usage is extended to accept docx file from standard input. It also
works with input/output redirection now. Please refer to the documentation for
more information.
- Script files and configuration file can be installed in separate directories
on (non-Windows) systems using Makefile for installation.
- Linux Makefile also attempts to update the system configuration directory to
desired directory in installed Perl script.
- User specific and system wide configuration files can be maintained separately
even on windows.
Updates:
- "-h" has to be given as the first argument to Perl script to get usage help.
- Added new configuration variable "config_tempDir".
- Configuration file is uniformly looked for in current directory, user
configuration directory (APPDATA on Windows and HOME on non-Windows), system
configuration directory (same location as script files on Windows, /etc or as
set during installation on non-Windows systems) in the specified order.
- Documentation has been updated with usage examples and information on how
.docx file text content can directly be viewed using Vim and Emacs editors.
- Improved handling of special (non-text) characters, along with support for
more non-text characters like fractions.
- Fixed Bug #3463033: added ' and " to docx specific escape character
conversions.
- Fixed the wrong code that had got committed during earlier fixing of
nullDevice for Cygwin.
v1.1 : 11/12/2011
New features:
- Added a check for existence of unzip command.
- Configuration file is looked for in HOME directory as well.
Updates:
- Configuration variables now begin with config_ .
- Fixed bugs #3003903, #3082018 and #3082035.
- Fixed nulldevice for Cygwin.
- Superscripted cross-references are placed within [...] now.
v1.0 : 04/10/2009
New features:
- Input argument can also be a directory holding the unzipped content of .docx
file.
- Windows wrapper script, and support for using CakeCmd command line unzipper.
- Configuration file support for easy control over settings.
- Windows installation script.
Updates:
- Hyperlink is not displayed if hyperlink and hyperlinked text are same, even
though user has enabled hyperlink display.
- Improved handling of short line justification, capturing many cases that were
missed in earlier approach.
- Path names containing spaces are now handled.
Please refer to the updated documentation for more details.
v0.4 : 06/09/2009
New features: [suggestions from "Sergei Kulakov (sergei>AT<dewia>DOT<com)"].
- user can control display of hyperlink along with linked text.
- TOC related cleanup. TOC was not addressed so far.
Updates:
- many new character conversions (check the script code for details).
- character conversion mappings are now organised in a tabular form.
- currency characters are converted to respective full currency name.
- code tweaks to speedup the conversion process.
v0.3 : 23/09/2008
New features:
- center and right justification of text fitting in a line of (adjustible) 80
columns.
- indicating hyperlinked text along with the hyperlink.
- BSD makefile [Thanks to "Rene Maroufi" (info>AT<maroufi>DOT<net) for giving
guest access on an OpenBSD host for it].
Please refer to the release documentation for details.
- docx2txt.pl invocation has been changed a little,
- user involvement during installation is reduced.
- some suggestions on how Windows users can use this tool.
v0.2 : 15/08/2008
Docx text extraction can now be done in two ways (check version README for
further details).
- docx2txt.sh file.docx
- docx2txt.pl infile.docx outfile.txt
v0.1 : 10/08/2008
Initial Sourceforge release with attempts to handle following features during
text extraction.
- horizontal ruler, line breaks, paragraphs separation, tabs
- naive nested list formatting - assumed 8 level nesting, however if you want
to deal with further nesting, play comment-uncomment in perl script. :)
- capitalisation of text blocks i.e. in document.xml text is stored either as
lowercase or in mixed case, but in corresponding text files generated by
MSOffice it comes as all caps.
- character conversions (" ' < & > - ... etc.). Euro character is converted to
E, however you can change this behaviour by comment-uncomment in perl script.
Non-Windows users, please adjust following executables paths before proceeding
for installation.
- #! path for env in docx2txt.sh and docx2txt.pl
- path for unzip in docx2txt.config
You can skip installing docx2txt.sh and docx2txt.bat wrapper scripts (as
applicable) during manual installation. These check for overwriting the output
text file and have slightly restricted usage as compared to core docx2txt.pl
script. [check README for details]
However if you are using CakeCmd unzipper, docx2txt.bat can be quite handy as
it internally manages unzipping the .docx files that do not have .zip extension.
Installation on Linux, Cygwin, BSD and similar systems
------------------------------------------------------
Type "make" as root to install docx2txt script files for all users in
/usr/local/bin and system-wide configuration file in /etc .
If you want to install these in some other directory, you can do so via
make BINDIR=/path/to/scripts/directory CONFIGDIR=/path/to/config/directory
BSD users can use either GNU make or BSD make.
Linux "make" installation also attempts to set systemConfigDir variable in
installed docx2txt.pl file to specified CONFIGDIR.
You will need make and install utilities installed on your system for
installation via Makefile.
In case, you don't want to use Makefile for installation, you can follow these
steps for manual installation.
1. Copy docx2txt.pl, docx2txt.sh and docx2txt.config to the desired directories.
cp docx2txt.pl docx2txt.sh /path/to/scripts/directory
cp docx2txt.config /path/to/config/directory
2. Change the permission of copied files to 755 for docx2txt.pl and docx2txt.sh,
and 644 for docx2txt.config .
chmod 755 /path/to/scripts/directory/docx2txt.*
chmod 644 /path/to/config/directory/docx2txt.config
3. Change the value of systemConfigDir variable (in non-Windows settings) in
installed docx2txt.pl file from "/etc" to specified config directory.
4. Add the concerned scripts directory to your PATH, if not already in PATH.
PATH=$PATH:/path/to/scripts/directory
Installation on Windows
-----------------------
I. You can install minimal Cygwin packages from http://www.cygwin.com/ to have
working bash, cat, env, install, make, perl and unzip utilities and thus
create the required Cygwin environment for using this utility.
II. If you do not want to install even minimal Cygwin, you can try following
sequence for manual installation.
a. Get following files from /usr/bin/ of cygwin installation and place them in,
say C:\docx2txt .
cygwin1.dll
perl.exe
cygperl*.dll
unzip.exe
cygcrypt*.dll
b. Copy docx2txt.pl, docx2txt.bat and docx2txt.config to C:\docx2txt .
c. Change path for unzip in docx2txt.config to C:/docx2txt/unzip.exe and path
for perl in docx2txt.bat to C:\docx2txt\perl.exe .
d. You can now use this tool from within C:\docx2txt as follows.
docx2txt.bat file.docx
docx2txt.bat path-to-directory\file.docx
perl docx2txt.pl file.docx
perl docx2txt.pl directory\file.docx -
perl docx2txt.pl directory/file.docx file.txt
perl docx2txt.pl C:/somedir/file.docx
perl docx2txt.pl C:\somedir\file.docx C:\otherdir\converted.txt
Please view README for further usage information.
III. You can also install this utility via WInstall.bat and follow the
instructions during installation. WInstall.bat can be invoked in two ways.
WInstall.bat installation-folder-name
WInstall.bat
In second case, install script will ask user for installation folder name.
It is advisable to have working installations of perl and atleast one command
line unzipper (Unzip/CakeCmd) before running this install script, so that it
can automatically set the desired paths in installed files.
You can use
- Cygwin perl or Strawberry perl [http://strawberryperl.com/] or any other
Windows native perl implementation
- Cygwin unzip or UnZip for Windows [http://gnuwin32.sourceforge.net/downlinks/unzip.php]
- CakeCmd unzipper [http://www.quickzip.org/cakecmd.html]
#
# Makefile for docx2txt
#
BINDIR ?= /usr/local/bin
CONFIGDIR ?= /etc
INSTALL = $(shell which install 2>/dev/null)
ifeq ($(INSTALL),)
$(error "Need 'install' to install docx2txt")
endif
PERL = $(shell which perl 2>/dev/null)
ifeq ($(PERL),)
$(warning "*** Make sure 'perl' is installed and is in your PATH, before running the installed script. ***")
endif
BINFILES = docx2txt.sh docx2txt.pl
CONFIGFILE = docx2txt.config
.PHONY: install installbin installconfig
install: installbin installconfig
installbin: $(BINFILES)
@echo "Installing script files [$(BINFILES)] in \"$(BINDIR)\" .."
@[ -d "$(BINDIR)" ] || mkdir -p "$(BINDIR)"
$(INSTALL) -m 755 $^ "$(BINDIR)"
ifneq ($(PERL),)
@echo "Setting systemConfigDir to [$(CONFIGDIR)] in \"$(BINDIR)/docx2txt.pl\" .."
$(PERL) -pi -e "s%\"/etc\";%\"$(CONFIGDIR)\";%" "$(BINDIR)/docx2txt.pl"\
&& rm -f "$(BINDIR)/docx2txt.pl.bak"
else
@echo "*** Set systemConfigDir to \"$(CONFIGDIR)\" in \"$(BINDIR)/docx2txt.pl\"."
endif
installconfig: $(CONFIGFILE)
@echo "Installing config file [$(CONFIGFILE)] in \"$(CONFIGDIR)\" .."
@[ -d "$(CONFIGDIR)" ] || mkdir -p "$(CONFIGDIR)"
$(INSTALL) -m 755 $^ "$(CONFIGDIR)"
docx2txt (http://docx2txt.sourceforge.net/) is a simple tool to generate
equivalent text files from Microsoft .docx documents, with an attempt towards
preserving sufficient formatting and document information, and appropriate
character conversions for a good text experience.
You need to atleast have perl installed on your system for using this tool.
How to Use
----------
You can do the text conversion in different ways depending upon your usage
environment.
1. Using docx2txt.sh :
docx2txt.sh file.docx
OR
docx2txt.sh file
In both these cases output text will be saved in file.txt .
2. Using docx2txt.bat :
docx2txt.bat file.docx
OR
docx2txt.bat file
In both these cases output text will be saved in file.txt .
3. Using docx2txt.pl :
a. docx2txt.pl infile.docx outfile.txt
Use - as the name of output text file, to send extracted text to STDOUT,
that is, console.
b. docx2txt.pl file.docx
OR
docx2txt.pl file
In both these cases output text will be saved in file.txt .
Input can also be provided via STDIN (console) using - as the name of input
docx file. Moreover redirection of input/output is possible with this script,
making it feasible to invoke it in even more ways as illustrated below.
c. docx2txt.pl < infile.docx
In this case input is read from infile.docx and output is sent to STDOUT.
d. docx2txt.pl < infile.docx > outfile.txt
In this case input is read from infile.docx and output is sent to
outfile.txt .
e. cat infile.docx | ./docx2txt.pl
In this case content of infile.docx is read via STDIN and output is sent
to STDOUT.
f. cat infile.docx | ./docx2txt.pl - outfile.txt
In this case content of infile.docx is read via STDIN and output is sent
to outfile.txt .
Input argument in all the above cases can also be a directory holding the
unzipped content of a .docx file. This feature is particulary useful if you do
not have a commandline unzipping tool like Unzip/CakeCmd installed on your
system.
Usage help can be obtained by giving '-h' as the first argument to the script.
docx2txt.pl -h
Tune your Experience
--------------------
You can change following settings via docx2txt.config file that is looked for
- in the current directory,
- user configuration directory (APPDATA on Windows, HOME on non-Windows), and
- in the system configuration directory (same directory that holds the script
files on Windows, /etc or as set during installation on non-Windows),
in the specified order. In case script does not find any configuration file, it
continues with builtin default settings.
a. Path to unzip program
b. Path to temp directory
c. Newline in output text file (Unix/Dos way)
d. List level indentation amount
e. Line width (used for short line justification)
f. Showing of hyperlink along with linked text
g. Extra conversion of &...; sequences [Experimental, not needed normally]
You can also adjust list element indicator characters for different levels, in
docx2txt.pl to suit your formatting taste. Currently 8 level list nesting is
assumed, however if you want to deal with deeper nesting, you can adjust that
as well in the perl script, by following the related comments there.
Viewing the text content of Docx file in Editors and File browsers
------------------------------------------------------------------
1. MC (Midnight Commander)
-----------------------
You can add following binding in ~/.mc/bindings and view the text content of a
.docx file by hitting F3 key (assuming default key mappings) after moving the
cursor over concerned filename in mc pannel.
# Microsoft .docx Document
regex/\.(docx|DOCX|Docx)$
View=%view{ascii} docx2txt.pl %f -
2. VIm Editor
----------
You can add following lines in your ~/.vimrc to view the text content of a .docx
file directly when using vim.
"use docx2txt.pl to allow VIm to view the text content of a .docx file directly.
autocmd BufReadPre *.docx set ro
autocmd BufReadPost *.docx %!docx2txt.pl
Note that above .vimrc addition will allow you to view the text content of .docx
files specified as command line argument to vim, but not of those read using
":r file.docx".
Please refer to http://vimdoc.sourceforge.net/htmldoc/autocmd.html for more
information on autocmd.
3. Emacs Editor
------------
You can add following lines in your ~/.emacs file to view the text content of
a .docx file directly when using emacs.
(add-to-list 'auto-mode-alist '("\\.docx\\'" . docx2txt))
(defun docx2txt ()
"Run docx2txt on the entire buffer."
(shell-command-on-region (point-min) (point-max) "docx2txt.pl" t t))
Be warned that with above ~/.emacs code addition, if you happen to save the
buffer/file, it will overwrite the .docx file with the text content.
Please explore "Filters -- making things readable:" section at
http://www.emacswiki.org/emacs/CategoryExternalUtilities for more ways to view
.docx file text content directly in emacs.
Request
-------
If you are using this work directly/indirectly for non-personal purpose(s),
please inform the author about it along with relevant url(s), so that it can be
mentioned on the project homepage.
In case you come across some issue with it, or need a feature that can be
handled in docx to text conversion, please feel free to communicate. An
accompanying test .docx document depicting the issue/need and the corresponding
text file generated by MSOffice with character substitution enabled (or as you
would like the text file to be) will be helpful.
You can track the project via http://sourceforge.net/projects/docx2txt and refer
to project cvs if there have been changes since this release.
Disclaimer
----------
This program includes no warranty whatsoever. It is provided "AS IS". For more
information please read the COPYING document, which should be included with the
package, and describes the GNU Public License, which covers docx2txt.
Sandeep Kumar ( shimple0 -AT- yahoo .DOT. com )
1. Handle lists in better way. [partly worked on, target latest by v2.0]
2. Heuristics based cleanup of damaged document content. [Looking for more test samples.]
3. Extract images. Now there has been a user request as well. [target pre v2.0]
4. Handle footnotes.
5. Improve table and short line justification handling. Ideally table columns
in a single row should be separated by pipe. Short line justification needs
to be adjusted to situations when tab occurs in line. A quick look into these
issues suggests that logic/code will need to be reorganised to handle these.
6. Create a simple manpage, hopefully after resolving footnote and list issues.
7. Implement simple state-machine for speedup [partially worked towards it].
8. XML parsing??? and making things more efficient. When it has matured enough,
may be a C/C++ version should be looked into.
@echo off
:: docx2txt, a command-line utility to convert Docx documents to text format.
:: Copyright (C) 2008-now Sandeep Kumar
::
:: This program is free software; you can redistribute it and/or modify
:: it under the terms of the GNU General Public License as published by
:: the Free Software Foundation; either version 3 of the License, or
:: (at your option) any later version.
::
:: This program is distributed in the hope that it will be useful,
:: but WITHOUT ANY WARRANTY; without even the implied warranty of
:: MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
:: GNU General Public License for more details.
::
:: You should have received a copy of the GNU General Public License
:: along with this program; if not, write to the Free Software
:: Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
::
:: A simple commandline installer for docx2txt on Windows.
::
:: Author : Sandeep Kumar (shimple0 -AT- Yahoo .DOT. COM)
::
:: ChangeLog :
::
:: 02/10/2009 - Initial version of command line installation script for
:: Windows users. Script will prompt user for perl, unzip and
:: cakecmd paths and will update these paths in the installed
:: files using perl, if perl path is valid. Else it will simply
:: copy the concerned files to the installation folder.
::
::
:: Ensure that required command extensions are enabled.
::
setlocal enableextensions
setlocal enabledelayedexpansion
echo.
echo Welcome to command line installer for docx2txt.
echo.
::
:: Check if this install script is invoked correctly.
::
if not "%~2" == "" (
echo.
echo Usage : "%~0" [WhereToInstall]
echo.
echo WhereToInstall specifies a folder to install into.
echo.
echo If destination folder is not specified on command line,
echo then it will be asked for during the installation.
echo.
goto END
)
::
:: Check if destination folder was specified on command line, else ask for it.
::
if "%~1" == "" (
echo.
echo Where should the docx2txt tool be installed? Specify the location
echo without surrounding quotes.
echo.
set /P destdir=Installation Folder :
echo.
) else (
set destdir=%~1
)
if not exist "%destdir%" (
echo.
echo ** Folder "%destdir%" does not exist. It will be created now.
echo.
mkdir "%destdir%"
)
::
:: Check if user specified destdir is a valid folder or a not.
::
pushd "%destdir%" 2>nul
if ERRORLEVEL 1 (
echo.
echo ** "%destdir%" does not specify a valid folder name.
echo ** Exiting installer.
echo.
goto END
) else if ERRORLEVEL 0 (
popd
)
echo.
echo Please specify fully qualified paths to utilities when requested.
echo Perl.exe is required for docx2txt tool as well as for this installation.
echo.
set /A attempts=0
:GET_PERL_PATH
set /P PERL=Path to Perl.exe :
call :CHECK_FILE_EXISTENCE "%PERL%" "perl"
if ERRORLEVEL 7 (
set /A attempts=attempts+1
if !attempts! == 3 (
echo.
echo Continuing with simple installation ....
echo.
goto SIMPLE_INSTALL
) else (
goto GET_PERL_PATH
)
)
echo.
echo.
echo If you do not have CakeCmd.exe installed, simply press Enter/Return key.
echo.
set /P CAKECMD=Path to CakeCmd.exe :
echo.
echo.
echo In case you are using Cygwin Perl.exe, you need to specify Unzip.exe path
echo using forward slashes i.e. like C:/path/to/unzip.exe .
echo If you do not have Unzip.exe installed, simply press Enter/Return key.
echo.
set /P UNZIP=Path to Unzip.exe :
echo.
echo.
echo Here is the information you have provided.
echo.
echo Installation folder = %destdir%
echo Perl = %PERL%
echo CakeCmd = %CAKECMD%
echo Unzip = %UNZIP%
echo.
pause
echo.
echo Installing script files to "%destdir%" ....
copy docx2txt.pl "%destdir%" > nul
if not "%UNZIP%" == "" (
%PERL% -e "undef $/; $_ = <>; s/(unzip\s*=>)[^,]*,/$1 '$ARGV[0]',/; print;" docx2txt.config "%UNZIP%" > "%destdir%\docx2txt.config"
)
if "%CAKECMD%" == "" (
%PERL% -e "undef $/; $_ = <>; s/(set PERL=).*?(\r?\n)/$1$ARGV[0]$2/; print;" docx2txt.bat "%PERL%" > "%destdir%\docx2txt.bat"
) else (
%PERL% -e "undef $/; $_ = <>; s/(set PERL=).*?(\r?\n)/$1$ARGV[0]$2/; s/:: (set CAKECMD=).*?(\r?\n)/$1$ARGV[1]$2/; print;" docx2txt.bat "%PERL%" "%CAKECMD%" > "%destdir%\docx2txt.bat"
)
goto END
:SIMPLE_INSTALL
echo Copying script files to "%destdir%" ....
copy docx2txt.bat "%destdir%" > nul
copy docx2txt.pl "%destdir%" > nul
copy docx2txt.config "%destdir%" > nul
echo.
echo Please adjust perl, unzip and cakecmd paths (as needed) in
echo "%destdir%\docx2txt.bat" and "%destdir%\docx2txt.config"
echo.
goto END
::
:: Check whether the argument executable exists?
::
:CHECK_FILE_EXISTENCE
if not exist "%~1" (
echo.
echo ** Can not find executable "%~1".
echo.
) else if /I "%~nx1" NEQ "%~2.exe" (
echo.
echo ** "%~1" does not seem to be an executable file.
echo.
) else exit /B 0
exit /B 7
:END
endlocal
endlocal
set PERL=
set CAKECMD=
set UNZIP=
set FILES=
set attempts=
@echo off
:: docx2txt, a command-line utility to convert Docx documents to text format.
:: Copyright (C) 2008-now Sandeep Kumar
::
:: This program is free software; you can redistribute it and/or modify
:: it under the terms of the GNU General Public License as published by
:: the Free Software Foundation; either version 3 of the License, or
:: (at your option) any later version.
::
:: This program is distributed in the hope that it will be useful,
:: but WITHOUT ANY WARRANTY; without even the implied warranty of
:: MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
:: GNU General Public License for more details.
::
:: You should have received a copy of the GNU General Public License
:: along with this program; if not, write to the Free Software
:: Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
::
:: A simple commandline .docx to .txt converter
::
:: This batch file is a wrapper around core docx2txt.pl script.
::
:: Author : Sandeep Kumar (shimple0 -AT- Yahoo .DOT. COM)
::
:: ChangeLog :
::
:: 17/09/2009 - Initial version of this file. It has similar functionality
:: as corresponding unix shell script.
:: 21/09/2009 - Updations to deal with paths containing spacess.
:: 22/09/2009 - Code reorganization, mainly around delayedexpansion command
:: extension.
:: 24/09/2009 - Required docx2txt.pl is expected in same location as this
:: batch file.
::
::
:: Set path (without surrounding quotes) to perl binary.
::
set PERL=C:\Program Files\strawberry-perl-5.10.0.6\perl\bin\perl.exe
::
:: If CAKECMD variable is set, batch file will unzip the content of argument
:: .docx file in a directory and pass that directory as the argument to the
:: docx2txt.pl script.
::
:: set CAKECMD=C:\Program Files\cake\CakeCmd.exe
::
:: Ensure that required command extensions are enabled.
::
setlocal enableextensions
setlocal enabledelayedexpansion
::
:: docx2txt.pl is expected to be in same location as this batch file.
::
set DOCX2TXT_PL=%~dp0docx2txt.pl
if not exist "%DOCX2TXT_PL%" (
echo.
echo Can not continue without "%DOCX2TXT_PL%".
echo.
goto END
)
::
:: Check if this batch file is invoked correctly.
::
if "%~1" == "" goto USAGE
if not "%~2" == "" goto USAGE
goto CHECK_ARG
:USAGE
echo.
echo Usage : "%~0" file.docx
echo.
echo "file.docx" can also specify a directory holding the unzipped
echo content of a .docx file.