Commit d8689de3 authored by SVN-Git Migration's avatar SVN-Git Migration

Imported Upstream version 20110515+dfsg

parent e61b1fef
Metadata-Version: 1.0
Name: pdfminer
Version: 20110227
Version: 20110515
Summary: PDF parser and analyzer
Home-page: http://www.unixuser.org/~euske/python/pdfminer/index.html
Author: Yusuke Shinyama
......
......@@ -9,7 +9,7 @@
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Sun Feb 27 10:51:18 UTC 2011
Last Modified: Sat May 14 16:33:16 UTC 2011
<!-- hhmts end -->
</div>
......@@ -48,12 +48,12 @@ Python PDF parser and analyzer
<p>
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting
and analyzing text data. PDFMiner allows to obtain
the exact location of texts in a page, as well as
and analyzing text data. PDFMiner allows one to obtain
the exact location of text in a page, as well as
other information such as fonts or lines.
It includes a PDF converter that can transform PDF files
into other text formats (such as HTML). It has an extensible
PDF parser that can be used for other purposes instead of text analysis.
PDF parser that can be used for other purposes than text analysis.
<p>
<h3>Features</h3>
......@@ -167,9 +167,9 @@ PDFMiner comes with two handy tools:
<h3><a name="pdf2txt">pdf2txt.py</a></h3>
<p>
<code>pdf2txt.py</code> extracts text contents from a PDF file.
It extracts all the texts that are to be rendered programmatically,
ie. text represented as ASCII or Unicode strings.
It cannot recognize texts drawn as images that would require optical character recognition.
It extracts all the text that are to be rendered programmatically,
i.e. text represented as ASCII or Unicode strings.
It cannot recognize text drawn as images that would require optical character recognition.
It also extracts the corresponding locations, font names, font sizes, writing
direction (horizontal or vertical) for each text portion.
You need to provide a password for protected PDF documents when its access is restricted.
......@@ -199,8 +199,8 @@ By default, it prints the extracted contents to stdout in text format.
<p>
<dt> <code>-p <em>pageno[,pageno,...]</em></code>
<dd> Specifies the comma-separated list of the page numbers to be extracted.
Page numbers are starting from one.
By default, it extracts texts from all the pages.
Page numbers start at one.
By default, it extracts text from all the pages.
<p>
<dt> <code>-c <em>codec</em></code>
<dd> Specifies the output codec.
......@@ -210,7 +210,7 @@ By default, it extracts texts from all the pages.
<ul>
<li> <code>text</code> : TEXT format. (Default)
<li> <code>html</code> : HTML format. Not recommended for extraction purposes because the markup is messy.
<li> <code>xml</code> : XML format. Provides the most information available.
<li> <code>xml</code> : XML format. Provides the most information.
<li> <code>tag</code> : "Tagged PDF" format. A tagged PDF has its own contents annotated with
HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations.
Tags used here are defined in the PDF specification (See &sect;10.7 "<em>Tagged PDF</em>").
......@@ -224,14 +224,14 @@ Currently only JPEG images are supported.
<dt> <code>-L <em>line_margin</em></code>
<dt> <code>-W <em>word_margin</em></code>
<dd> These are the parameters used for layout analysis.
In an actual PDF file, texts might be split into several chunks
In an actual PDF file, text portions might be split into several chunks
in the middle of its running, depending on the authoring software.
Therefore, text extraction needs to splice text chunks.
In the figure below, two text chunks whose distance is closer than
the <em>char_margin</em> (shown as <em><font color="red">M</font></em>) is considered
continuous and get grouped into one. Also, two lines whose distance is closer than
the <em>line_margin</em> (<em><font color="blue">L</font></em>) is grouped
as a text box, which is a rectangular area that contains a "cluster" of texts.
as a text box, which is a rectangular area that contains a "cluster" of text portions.
Furthermore, it may be required to insert blank characters (spaces) as necessary
if the distance between two words is greater than the <em>word_margin</em>
(<em><font color="green">W</font></em>), as a blank between words might not be
......@@ -263,12 +263,16 @@ are M = 1.0, L = 0.3, and W = 0.2, respectively.
<td style="border-top:1px blue solid" align=right>&uarr;</td>
</tr></table>
<p>
<dt> <code>-C</code>
<dd> Suppress object caching.
This will reduce the memory consumption but also slows down the process.
<p>
<dt> <code>-n</code>
<dd> Suppress layout analysis.
<p>
<dt> <code>-A</code>
<dd> Forces to perform layout analysis for all the text strings,
including texts contained in figures.
including text contained in figures.
<p>
<dt> <code>-V</code>
<dd> Allows vertical writing detection.
......@@ -329,7 +333,7 @@ Comma-separated IDs, or multiple <code>-i</code> options are accepted.
<dt> <code>-p <em>pageno,pageno, ...</em></code>
<dd> Specifies the page number to be extracted.
Comma-separated page numbers, or multiple <code>-p</code> options are accepted.
Note that page numbers start from one, not zero.
Note that page numbers start at one, not zero.
<p>
<dt> <code>-r</code> (raw)
<dt> <code>-b</code> (binary)
......@@ -357,6 +361,11 @@ no stream header is displayed for the ease of saving it to a file.
<h2><a name="changes">Changes</a></h2>
<ul>
<li> 2010/05/15: Speed improvements for layout analysis.
<li> 2010/05/15: API changes. <code>LTText.get_text()</code> is added.
<li> 2010/04/20: API changes. LTPolygon class was renamed as LTCurve.
<li> 2010/04/20: LTLine now represents horizontal/vertical lines only. Thanks to Koji Nakagawa.
<li> 2010/03/07: Documentation improvements by Jakub Wilk. Memory usage patch by Jonathan Hunt.
<li> 2010/02/27: Bugfixes and layout analysis improvements. Thanks to fujimoto.report.
<li> 2010/12/26: A couple of bugfixes and minor improvements. Thanks to Kevin Brubeck Unhammer and Daniel Gerber.
<li> 2010/10/17: A couple of bugfixes and minor improvements. Thanks to standardabweichung and Alastair Irving.
......
%TGIF 4.1.45-QPL
%TGIF 4.2.2
state(0,37,100.000,0,0,0,16,1,9,1,1,0,0,0,0,1,1,'Helvetica-Bold',1,69120,0,0,1,5,0,0,1,1,0,16,0,0,1,1,1,1,1050,1485,1,0,2880,0).
%
% @(#)$Header$
......@@ -30,6 +30,8 @@ script_frac("0.6").
fg_bg_colors('black','white').
dont_reencode("FFDingbests:ZapfDingbats").
objshadow_info('#c0c0c0',2,2).
rotate_pivot(0,0,0,0).
spline_tightness(1).
page(1,"",1,'').
box('black','',50,45,300,355,2,2,1,0,0,0,0,0,0,'2',0,[
]).
......@@ -147,12 +149,12 @@ str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
"LTRect")])
])
])]).
text('black',190,333,1,1,1,62,15,118,12,3,0,0,0,0,2,62,15,0,0,"",0,0,0,0,345,'',[
minilines(62,15,0,0,1,0,0,[
mini_line(62,12,3,0,0,0,[
str_block(0,62,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,62,12,3,0,-1,0,0,0,0,0,
"LTPolygon")])
text('black',190,333,1,1,1,50,15,118,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,345,'',[
minilines(50,15,0,0,1,0,0,[
mini_line(50,12,3,0,0,0,[
str_block(0,50,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
"LTCurve")])
])
])]).
text('black',170,138,1,1,1,42,15,121,12,3,0,0,0,0,2,42,15,0,0,"",0,0,0,0,150,'',[
......@@ -298,12 +300,12 @@ str_seg('black','Helvetica-Bold',1,69120,43,12,3,0,0,0,0,0,0,0,
"LTRect")])
])
])]).
text('black',580,178,1,1,1,62,15,182,12,3,0,0,0,0,2,62,15,0,0,"",0,0,0,0,190,'',[
minilines(62,15,0,0,1,0,0,[
mini_line(62,12,3,0,0,0,[
str_block(0,62,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,62,12,3,0,-1,0,0,0,0,0,
"LTPolygon")])
text('black',580,178,1,1,1,50,15,182,12,3,0,0,0,0,2,50,15,0,0,"",0,0,0,0,190,'',[
minilines(50,15,0,0,1,0,0,[
mini_line(50,12,3,0,0,0,[
str_block(0,50,12,3,0,-1,0,0,0,[
str_seg('black','Helvetica-Bold',1,69120,50,12,3,0,-1,0,0,0,0,0,
"LTCurve")])
])
])]).
text('black',775,108,1,1,1,51,15,186,12,3,0,0,0,0,2,51,15,0,0,"",0,0,0,0,120,'',[
......
docs/layout.png

3.52 KB | W: | H:

docs/layout.png

3.5 KB | W: | H:

docs/layout.png
docs/layout.png
docs/layout.png
docs/layout.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -9,7 +9,7 @@
<div align=right class=lastmod>
<!-- hhmts start -->
Last Modified: Sun Oct 17 09:18:29 UTC 2010
Last Modified: Sat May 14 16:36:12 UTC 2011
<!-- hhmts end -->
</div>
......@@ -137,24 +137,26 @@ these objects.
<dt> <code>LTPage</code>
<dd> Represents an entire page. May contain child objects like
<code>LTTextBox</code>, <code>LTFigure</code>, <code>LTImage</code>, <code>LTRect</code>,
<code>LTPolygon</code> and <code>LTLine</code>.
<code>LTCurve</code> and <code>LTLine</code>.
<dt> <code>LTTextBox</code>
<dd> Represents a group of text chunks that can be contained in a rectangular area.
Note that this box is created by geometric analysis and does not necessarily
represents a logical boundary of the text.
It contains a list of <code>LTTextLine</code> objects.
<code>get_text()</code> method returns the text content.
<dt> <code>LTTextLine</code>
<dd> Contains a list of <code>LTChar</code> objects that represent
a single text line. The characters are aligned either horizontaly
or vertically, depending on the text's writing mode.
<code>get_text()</code> method returns the text content.
<dt> <code>LTChar</code>
<dt> <code>LTText</code>
<dd> These objects represent an actual letter in the text as a Unicode string.
<dt> <code>LTAnon</code>
<dd> Represent an actual letter in the text as a Unicode string.
Note that, while a <code>LTChar</code> object has actual boundaries,
<code>LTText</code> objects does not, as these are "virtual" characters,
<code>LTAnon</code> objects does not, as these are "virtual" characters,
inserted by a layout analyzer according to the relationship between two characters
(e.g. a space).
......@@ -169,15 +171,15 @@ in JPEG or other formats, but currently PDFMiner does not
pay much attention to graphical objects.
<dt> <code>LTLine</code>
<dd> Represents a single straight line shown in a page.
Could be used for separating texts or figures.
<dd> Represents a single straight line.
Could be used for separating text or figures.
<dt> <code>LTRect</code>
<dd> Represents a rectangle shown in a page.
<dd> Represents a rectangle.
Could be used for framing another pictures or figures.
<dt> <code>LTPolygon</code>
<dd> Represents a polygon in a page.
<dt> <code>LTCurve</code>
<dd> Represents a generic bezier curve.
</dl>
<p>
......
#!/usr/bin/env python2
__version__ = '20110227'
__version__ = '20110515'
if __name__ == '__main__': print __version__
......@@ -18,7 +18,7 @@ import os.path
import gzip
import cPickle as pickle
import cmap
from struct import pack, unpack
import struct
from psparser import PSStackParser
from psparser import PSException, PSSyntaxError, PSTypeError, PSEOF
from psparser import PSLiteral, PSKeyword
......@@ -98,7 +98,7 @@ class IdentityCMap(object):
def decode(self, code):
n = len(code)/2
if n:
return unpack('>%dH' % n, code)
return struct.unpack('>%dH' % n, code)
else:
return ()
......@@ -348,7 +348,7 @@ class CMapParser(PSStackParser):
vlen = len(svar)
#assert s1 <= e1
for i in xrange(e1-s1+1):
x = sprefix+pack('>L',s1+i)[-vlen:]
x = sprefix+struct.pack('>L',s1+i)[-vlen:]
self.cmap.add_code2cid(x, cid+i)
return
......@@ -382,7 +382,7 @@ class CMapParser(PSStackParser):
prefix = code[:-4]
vlen = len(var)
for i in xrange(e1-s1+1):
x = prefix+pack('>L',base+i)[-vlen:]
x = prefix+struct.pack('>L',base+i)[-vlen:]
self.cmap.add_cid2unichr(s1+i, x)
return
......
......@@ -4,7 +4,7 @@ from pdfdevice import PDFDevice, PDFTextDevice
from pdffont import PDFUnicodeNotDefined
from pdftypes import LITERALS_DCT_DECODE
from pdfcolor import LITERAL_DEVICE_GRAY, LITERAL_DEVICE_RGB
from layout import LTContainer, LTPage, LTText, LTLine, LTRect, LTPolygon
from layout import LTContainer, LTPage, LTText, LTLine, LTRect, LTCurve
from layout import LTFigure, LTImage, LTChar, LTTextLine
from layout import LTTextBox, LTTextBoxVertical, LTTextGroup
from utils import apply_matrix_pt, mult_matrix
......@@ -47,8 +47,6 @@ class PDFLayoutAnalyzer(PDFTextDevice):
def end_figure(self, _):
fig = self.cur_item
assert isinstance(self.cur_item, LTFigure)
if self.laparams is not None:
self.cur_item.analyze(self.laparams)
self.cur_item = self._stack.pop()
self.cur_item.add(fig)
return
......@@ -69,8 +67,10 @@ class PDFLayoutAnalyzer(PDFTextDevice):
(_,x1,y1) = path[1]
(x0,y0) = apply_matrix_pt(self.ctm, (x0,y0))
(x1,y1) = apply_matrix_pt(self.ctm, (x1,y1))
self.cur_item.add(LTLine(gstate.linewidth, (x0,y0), (x1,y1)))
elif shape == 'mlllh':
if x0 == x1 or y0 == y1:
self.cur_item.add(LTLine(gstate.linewidth, (x0,y0), (x1,y1)))
return
if shape == 'mlllh':
# rectangle
(_,x0,y0) = path[0]
(_,x1,y1) = path[1]
......@@ -83,13 +83,13 @@ class PDFLayoutAnalyzer(PDFTextDevice):
if ((x0 == x1 and y1 == y2 and x2 == x3 and y3 == y0) or
(y0 == y1 and x1 == x2 and y2 == y3 and x3 == x0)):
self.cur_item.add(LTRect(gstate.linewidth, (x0,y0,x2,y2)))
else:
# other polygon
pts = []
for p in path:
for i in xrange(1, len(p), 2):
pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
self.cur_item.add(LTPolygon(gstate.linewidth, pts))
return
# other shapes
pts = []
for p in path:
for i in xrange(1, len(p), 2):
pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
self.cur_item.add(LTCurve(gstate.linewidth, pts))
return
def render_char(self, matrix, font, fontsize, scaling, rise, cid):
......@@ -183,7 +183,7 @@ class TextConverter(PDFConverter):
for child in item:
render(child)
elif isinstance(item, LTText):
self.write_text(item.text)
self.write_text(item.get_text())
if isinstance(item, LTTextBox):
self.write_text('\n')
if self.showpageno:
......@@ -192,6 +192,14 @@ class TextConverter(PDFConverter):
self.write_text('\f')
return
# Some dummy functions to save memory/CPU when all that is wanted is text.
# This stops all the image and drawing ouput from being recorded and taking
# up RAM.
def render_image(self, name, stream):
pass
def paint_path(self, gstate, stroke, fill, evenodd, path):
pass
## HTMLConverter
##
......@@ -203,7 +211,7 @@ class HTMLConverter(PDFConverter):
'textline': 'magenta',
'textbox': 'cyan',
'textgroup': 'red',
'polygon': 'black',
'curve': 'black',
'page': 'gray',
}
......@@ -215,7 +223,7 @@ class HTMLConverter(PDFConverter):
def __init__(self, rsrcmgr, outfp, codec='utf-8', pageno=1, laparams=None,
scale=1, fontscale=0.7, layoutmode='normal', showpageno=True,
pagemargin=50, outdir=None,
rect_colors={'polygon':'black', 'page':'gray'},
rect_colors={'curve':'black', 'page':'gray'},
text_colors={'char':'black'}):
PDFConverter.__init__(self, rsrcmgr, outfp, codec=codec, pageno=pageno, laparams=laparams)
self.scale = scale
......@@ -321,11 +329,11 @@ class HTMLConverter(PDFConverter):
return
def receive_layout(self, ltpage):
def show_layout(item):
def show_group(item):
if isinstance(item, LTTextGroup):
self.place_border('textgroup', 1, item)
for child in item:
show_layout(child)
show_group(child)
return
def render(item):
if isinstance(item, LTPage):
......@@ -337,10 +345,11 @@ class HTMLConverter(PDFConverter):
self.write('<a name="%s">Page %s</a></div>\n' % (item.pageid, item.pageid))
for child in item:
render(child)
if item.layout:
show_layout(item.layout)
elif isinstance(item, LTPolygon):
self.place_border('polygon', 1, item)
if item.groups is not None:
for group in item.groups:
show_group(group)
elif isinstance(item, LTCurve):
self.place_border('curve', 1, item)
elif isinstance(item, LTFigure):
self.place_border('figure', 1, item)
for child in item:
......@@ -360,7 +369,7 @@ class HTMLConverter(PDFConverter):
render(child)
elif isinstance(item, LTChar):
self.place_border('char', 1, item)
self.place_text('char', item.text, item.x0, item.y1, item.size)
self.place_text('char', item.get_text(), item.x0, item.y1, item.size)
else:
if isinstance(item, LTTextLine):
for child in item:
......@@ -374,9 +383,9 @@ class HTMLConverter(PDFConverter):
render(child)
self.end_textbox('textbox')
elif isinstance(item, LTChar):
self.put_text(item.text, item.fontname, item.size)
self.put_text(item.get_text(), item.fontname, item.size)
elif isinstance(item, LTText):
self.write_text(item.text)
self.write_text(item.get_text())
return
render(ltpage)
self._yoffset += self.pagemargin
......@@ -411,14 +420,14 @@ class XMLConverter(PDFConverter):
return
def receive_layout(self, ltpage):
def show_layout(item):
def show_group(item):
if isinstance(item, LTTextBox):
self.outfp.write('<textbox id="%d" bbox="%s" />\n' %
(item.index, bbox2str(item.bbox)))
elif isinstance(item, LTTextGroup):
self.outfp.write('<textgroup bbox="%s">\n' % bbox2str(item.bbox))
for child in item:
show_layout(child)
show_group(child)
self.outfp.write('</textgroup>\n')
return
def render(item):
......@@ -427,9 +436,10 @@ class XMLConverter(PDFConverter):
(item.pageid, bbox2str(item.bbox), item.rotate))
for child in item:
render(child)
if item.layout:
if item.groups is not None:
self.outfp.write('<layout>\n')
show_layout(item.layout)
for group in item.groups:
show_group(group)
self.outfp.write('</layout>\n')
self.outfp.write('</page>\n')
elif isinstance(item, LTLine):
......@@ -438,8 +448,8 @@ class XMLConverter(PDFConverter):
elif isinstance(item, LTRect):
self.outfp.write('<rect linewidth="%d" bbox="%s" />\n' %
(item.linewidth, bbox2str(item.bbox)))
elif isinstance(item, LTPolygon):
self.outfp.write('<polygon linewidth="%d" bbox="%s" pts="%s"/>\n' %
elif isinstance(item, LTCurve):
self.outfp.write('<curve linewidth="%d" bbox="%s" pts="%s"/>\n' %
(item.linewidth, bbox2str(item.bbox), item.get_pts()))
elif isinstance(item, LTFigure):
self.outfp.write('<figure name="%s" bbox="%s">\n' %
......@@ -464,10 +474,10 @@ class XMLConverter(PDFConverter):
elif isinstance(item, LTChar):
self.outfp.write('<text font="%s" bbox="%s" size="%.3f">' %
(enc(item.fontname), bbox2str(item.bbox), item.size))
self.write_text(item.text)
self.write_text(item.get_text())
self.outfp.write('</text>\n')
elif isinstance(item, LTText):
self.outfp.write('<text>%s</text>\n' % item.text)
self.outfp.write('<text>%s</text>\n' % item.get_text())
elif isinstance(item, LTImage):
if self.outdir:
name = self.write_image(item)
......
This diff is collapsed.
#!/usr/bin/env python2
import sys
from sys import stderr
try:
from cStringIO import StringIO
except ImportError:
......@@ -84,8 +83,8 @@ class LZWDecoder(object):
x = self.feed(code)
yield x
if self.debug:
print >>stderr, ('nbits=%d, code=%d, output=%r, table=%r' %
(self.nbits, code, x, self.table[258:]))
print >>sys.stderr, ('nbits=%d, code=%d, output=%r, table=%r' %
(self.nbits, code, x, self.table[258:]))
return
# lzwdecode
......
#!/usr/bin/env python2
import sys
import struct
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
from cmapdb import CMapDB, CMapParser, FileUnicodeMap, CMap
from encodingdb import EncodingDB, name2unicode
from struct import pack, unpack
from psparser import PSStackParser
from psparser import PSSyntaxError, PSEOF
from psparser import LIT, KWD, STRICT
......@@ -154,7 +154,7 @@ def getdict(data):
if b0 == 28:
value = b1<<8 | b2
else:
value = b1<<24 | b2<<16 | unpack('>H', fp.read(2))[0]
value = b1<<24 | b2<<16 | struct.unpack('>H', fp.read(2))[0]
stack.append(value)
return d
......@@ -246,7 +246,7 @@ class CFFFont(object):
def __init__(self, fp):
self.fp = fp
self.offsets = []
(count, offsize) = unpack('>HB', self.fp.read(3))
(count, offsize) = struct.unpack('>HB', self.fp.read(3))
for i in xrange(count+1):
self.offsets.append(nunpack(self.fp.read(offsize)))
self.base = self.fp.tell()-1
......@@ -270,7 +270,7 @@ class CFFFont(object):
self.name = name
self.fp = fp
# Header
(_major,_minor,hdrsize,offsize) = unpack('BBBB', self.fp.read(4))
(_major,_minor,hdrsize,offsize) = struct.unpack('BBBB', self.fp.read(4))
self.fp.read(hdrsize-4)
# Name INDEX
self.name_index = self.INDEX(self.fp)
......@@ -296,16 +296,16 @@ class CFFFont(object):
format = self.fp.read(1)
if format == '\x00':
# Format 0
(n,) = unpack('B', self.fp.read(1))
for (code,gid) in enumerate(unpack('B'*n, self.fp.read(n))):
(n,) = struct.unpack('B', self.fp.read(1))
for (code,gid) in enumerate(struct.unpack('B'*n, self.fp.read(n))):
self.code2gid[code] = gid
self.gid2code[gid] = code
elif format == '\x01':
# Format 1
(n,) = unpack('B', self.fp.read(1))
(n,) = struct.unpack('B', self.fp.read(1))
code = 0
for i in xrange(n):
(first,nleft) = unpack('BB', self.fp.read(2))
(first,nleft) = struct.unpack('BB', self.fp.read(2))
for gid in xrange(first,first+nleft+1):
self.code2gid[code] = gid
self.gid2code[gid] = code
......@@ -320,17 +320,17 @@ class CFFFont(object):
if format == '\x00':
# Format 0
n = self.nglyphs-1
for (gid,sid) in enumerate(unpack('>'+'H'*n, self.fp.read(2*n))):
for (gid,sid) in enumerate(struct.unpack('>'+'H'*n, self.fp.read(2*n))):
gid += 1
name = self.getstr(sid)
self.name2gid[name] = gid
self.gid2name[gid] = name
elif format == '\x01':
# Format 1
(n,) = unpack('B', self.fp.read(1))
(n,) = struct.unpack('B', self.fp.read(1))
sid = 0
for i in xrange(n):
(first,nleft) = unpack('BB', self.fp.read(2))
(first,nleft) = struct.unpack('BB', self.fp.read(2))
for gid in xrange(first,first+nleft+1):
name = self.getstr(sid)
self.name2gid[name] = gid
......@@ -363,9 +363,9 @@ class TrueTypeFont(object):
self.fp = fp
self.tables = {}
self.fonttype = fp.read(4)
(ntables, _1, _2, _3) = unpack('>HHHH', fp.read(8))
(ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
for _ in xrange(ntables):
(name, tsum, offset, length) = unpack('>4sLLL', fp.read(16))
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
self.tables[name] = (offset, length)
return
......@@ -375,50 +375,50 @@ class TrueTypeFont(object):
(base_offset, length) = self.tables['cmap']
fp = self.fp
fp.seek(base_offset)
(version, nsubtables) = unpack('>HH', fp.read(4))
(version, nsubtables) = struct.unpack('>HH', fp.read(4))
subtables = []
for i in xrange(nsubtables):
subtables.append(unpack('>HHL', fp.read(8)))
subtables.append(struct.unpack('>HHL', fp.read(8)))
char2gid = {}
# Only supports subtable type 0, 2 and 4.
for (_1, _2, st_offset) in subtables:
fp.seek(base_offset+st_offset)
(fmttype, fmtlen, fmtlang) = unpack('>HHH', fp.read(6))
(fmttype, fmtlen, fmtlang) = struct.unpack('>HHH', fp.read(6))
if fmttype == 0:
char2gid.update(enumerate(unpack('>256B', fp.read(256))))
char2gid.update(enumerate(struct.unpack('>256B', fp.read(256))))
elif fmttype == 2:
subheaderkeys = unpack('>256H', fp.read(512))
subheaderkeys = struct.unpack('>256H', fp.read(512))
firstbytes = [0]*8192
for (i,k) in enumerate(subheaderkeys):
firstbytes[k/8] = i
nhdrs = max(subheaderkeys)/8 + 1
hdrs = []
for i in xrange(nhdrs):
(firstcode,entcount,delta,offset) = unpack('>HHhH', fp.read(8))
(firstcode,entcount,delta,offset) = struct.unpack('>HHhH', fp.read(8))
hdrs.append((i,firstcode,entcount,delta,fp.tell()-2+offset))
for (i,firstcode,entcount,delta,pos) in hdrs:
if not entcount: continue
first = firstcode + (firstbytes[i] << 8)
fp.seek(pos)
for c in xrange(entcount):
gid = unpack('>H', fp.read(2))
gid = struct.unpack('>H', fp.read(2))
if gid:
gid += delta
char2gid[first+c] = gid
elif fmttype == 4:
(segcount, _1, _2, _3) = unpack('>HHHH', fp.read(8))
(segcount, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
segcount /= 2
ecs = unpack('>%dH' % segcount, fp.read(2*segcount))
ecs = struct.unpack('>%dH' % segcount, fp.read(2*segcount))
fp.read(2)
scs = unpack('>%dH' % segcount, fp.read(2*segcount))
idds = unpack('>%dh' % segcount, fp.read(2*segcount))
scs = struct.unpack('>%dH' % segcount, fp.read(2*segcount))
idds = struct.unpack('>%dh' % segcount, fp.read(2*segcount))
pos = fp.tell()
idrs = unpack('>%dH' % segcount, fp.read(2*segcount))
idrs = struct.unpack('>%dH' % segcount, fp.read(2*segcount))
for (ec,sc,idd,idr) in zip(ecs, scs, idds, idrs):
if idr:
fp.seek(pos+idr)
for c in xrange(sc, ec+1):
char2gid[c] = (unpack('>H', fp.read(2))[0] + idd) & 0xffff
char2gid[c] = (struct.unpack('>H', fp.read(2))[0] + idd) & 0xffff
else:
for c in xrange(sc, ec+1):
char2gid[c] = (c + idd) & 0xffff
......
#!/usr/bin/env python2
import sys
import re
from sys import stderr
from struct import pack, unpack
try:
from cStringIO import StringIO
except ImportError:
......@@ -132,8 +131,9 @@ class PDFResourceManager(object):
"""
debug = 0
def __init__(self):
self.fonts = {}
def __init__(self, caching=True):
self.caching = caching
self._cached_fonts = {}
return
def get_procset(self, procs):
......@@ -155,11 +155,11 @@ class PDFResourceManager(object):
return CMap()
def get_font(self, objid, spec):
if objid and objid in self.fonts:
font = self.fonts[objid]
if objid and objid in self._cached_fonts:
font = self._cached_fonts[objid]
else:
if 2 <= self.debug:
print >>stderr, 'get_font: create: objid=%r, spec=%r' % (objid, spec)
print >>sys.stderr, 'get_font: create: objid=%r, spec=%r' % (objid, spec)
if STRICT:
if spec['Type'] is not LITERAL_FONT:
raise PDFFontError('Type is not /Font')
......@@ -195,8 +195,8 @@ class PDFResourceManager(object):
if STRICT:
raise PDFFontError('Invalid Font spec: %r' % spec)
font = PDFType1Font(self, spec) # this is so wrong!
if objid:
self.fonts[objid] = font
if objid and self.caching:
self._cached_fonts[objid] = font
return font
......@@ -263,7 +263,7 @@ class PDFContentParser(PSStackParser):
data += self.buf[self.charpos:]
self.charpos = len(self.buf)
data = data[:-(len(target)+1)] # strip the last part
data = re.sub(r'(\x0d\x0a|[\x0d\x0a])', '', data)
data = re.sub(r'(\x0d\x0a|[\x0d\x0a])$', '', data)
return (pos, data)
def flush(self):
......@@ -329,7 +329,7 @@ class PDFPageInterpreter(object):
return PREDEFINED_COLORSPACE[name]
for (k,v) in dict_value(resources).iteritems():
if 2 <= self.debug:
print >>stderr, 'Resource: %r: %r' % (k,v)
print >>sys.stderr, 'Resource: %r: %r' % (k,v)
if k == 'Font':
for (fontid,spec) in dict_value(v).iteritems():
objid = None
......@@ -649,7 +649,7 @@ class PDFPageInterpreter(object):
(a,b,c,d,e,f) = self.textstate.matrix
self.textstate.matrix = (a,b,c,d,tx*a+ty*c+e,tx*b+ty*d+f)
self.textstate.linematrix = (0, 0)
#print >>stderr, 'Td(%r,%r): %r' % (tx,ty,self.textstate)
#print >>sys.stderr, 'Td(%r,%r): %r' % (tx,ty,self.textstate)
return
# text-move
def do_TD(self, tx, ty):
......@@ -657,7 +657,7 @@ class PDFPageInterpreter(object):
self.textstate.matrix = (a,b,c,d,tx*a+ty*c+e,tx*b+ty*d+f)
self.textstate.leading = ty
self.textstate