Commit be234947 authored by Ana Guerrero López's avatar Ana Guerrero López

Import Upstream version 3.7

parent b63f50d4
......@@ -2,6 +2,12 @@
People who contributed with a substantial amount of work and that
share the copyright over some portions of the code:
NAME: H. Turgut Uyar
EMAIL: <uyar --> tekir.org>
CONTRIBUTION: the whole new "http" data access system (using a DOM and
XPath-based approach) is based on his work.
NAME: Giuseppe "Cowo" Corbelli
EMAIL: <cowo --> lugbs.linux.it>
CONTRIBUTION: provided a lot of code and hints to integrate IMDbPY
......@@ -9,9 +15,9 @@ with SQLObject, working on the imdbpy2sql.py script and the dbschema.py
module.
Actually, besides Giuseppe and me, these other people are listed
as developers for the IMDbPY project on sourceforge and may share
copyright on some (minor) portions of the code:
Actually, besides Turgut, Giuseppe and me, these other people are
listed as developers for the IMDbPY project on sourceforge and may
share copyright on some (minor) portions of the code:
NAME: Martin Kirst
EMAIL: <martin.kirst --> s1998.tu-chemnitz.de>
......@@ -19,11 +25,6 @@ CONTRIBUTION: has done an important refactoring of the imdbpyweb
program and shares with me the copyright on the whole program.
NAME: H. Turgut Uyar
EMAIL: <uyar --> itu.edu.tr>
CONTRIBUTION: has created some tests for the test-suite.
NAME: Jesper Nøhr
EMAIL: <jesper --> noehr.org>
CONTRIBUTION: provided extensive testing and some patches for
......
Changelog for IMDbPY
====================
* What's the new in release 3.7 "Burn After Reading" (22 Sep 2008)
[http]
- introduced a new set of parsers, active by default, based on DOM/XPath.
- old parsers fixed; 'news', 'genres', 'keywords', 'ratings', 'votes',
'tech', 'taglines' and 'episodes'.
[sql]
- the pure python soundex function now behaves correctly.
[general]
- minor updates to the documentation, with an introduction to the
new set of parsers and notes for packagers.
* What's the new in release 3.6 "RahXephon" (08 Jun 2008)
[general]
- support for company objects for every data access systems.
......
IMDbPY FAQS
===========
Q1: Since version 3.7, parsing the data from the IMDb web site is slow,
sloow, slooow! Why?
A1: if python-lxml is not installed in your system, IMDbPY uses the
pure-python BeautifulSoup module as a fall-back; BeautifulSoup does
an impressive job, but it can't be as fast as a parser written in C.
You can install python-lxml following the instructions in the
README.newparsers file.
......@@ -36,6 +36,10 @@ imdb (package)
| +-> searchCharacterParser
| +-> searchCompanyParser
| +-> utils
| +-> bsoupadapter.py
| +-> _bsoup.py
| +-> bsoupxpath.py
| +-> lxmladapter.py
|
+-> local (package)
| |
......@@ -92,6 +96,10 @@ http.searchCharacterParser: parse an html string, result of a query for a
http.searchCompanyParser: parse an html string, result of a query for a
company name.
http.utils: miscellaneous utilities used only by the http package.
http._bsoup: just a copy of the BeautifulSoup module, so that it's not
an external dependency.
http.bsoupadapter, http.bsoupxpath and http.lxmladapter: adapters for
BeautifulSoup and lxml.
The modules under the parser.local package are the same of the
parser.http package (the search functions are placed directly in the
......@@ -130,9 +138,9 @@ IMDbPY-based programs.
===================
I wanted to stay independent from the source of the data for a given
movie/person/character, and so the imdb.IMDb function returns an instance
of a class that provides specific methods to access a given data
source (web server, local installation, SQL database, etc.)
movie/person/character/company, and so the imdb.IMDb function returns
an instance of a class that provides specific methods to access a given
data source (web server, local installation, SQL database, etc.)
Unfortunately that means that the movieID in the Movie class, the
personID in the Person class and the characterID in the Character class
......
IMDbPY'S NEW HTML PARSERS
=========================
Since version 3.7, IMDbPY has moved its parsers for the HTML of
the IMDb's website from a set of subclasses of SGMLParser (they
were finite-states machines, being SGMLParser a SAX parser) to
a set of parsers based on the libxml2 library or on the BeautifulSoup
module (and so, using a DOM/XPath-based approach).
The idea and the implementation of these new parsers is mostly a
work of H. Turgut Uyar, and can brings to parsers shorter, easier
to write and maybe even faster.
LIBXML AND/OR BEAUTIFULSOUP
===========================
To use "lxml", you need the libxml2 library installed (and its
python-lxml binding). If it's not present on your system, you'll
fall-back to BeautifulSoup - distributed alongside IMDbPY, and so
you don't need to install anything.
However, beware that being pure-Python, BeautifulSoup is much
slower than lxml, so install it, if you can.
If for some reason you can't get lxml and BeautifulSoup is too
slow for your needs, consider the use of the 'mobile' data
access system.
GETTING LIBXML, LIBXSLT AND PYTHON-LXML
=======================================
If you're in a Microsoft Windows environment, all you need is
python-lxml (it includes all the required libraries), which can
be downloaded from here:
http://pypi.python.org/pypi/lxml/
Otherwise, if you're in a Unix environment, you can download libxml2
and libxslt from here (you need both, to install python-lxml):
http://xmlsoft.org/downloads.html
http://xmlsoft.org/XSLT/downloads.html
The python-lxml package can be found here:
http://codespeak.net/lxml/index.html#download
Obviously you should first check if these libraries are already
packaged for your distribution/operating system.
IMDbPY was tested with libxml2 2.7.1, libxslt 1.1.24 and
python-lxml python-lxml 2.1.1. Older versions can work, too; if
you have problems, submit a bug report specifying your versions.
You can also get the latest version of BeautifulSoup from here:
http://www.crummy.com/software/BeautifulSoup/
but since it's distributed with IMDbPY, you don't need it (or
you have to override the '_bsoup.py' file in the imdb/parser/http
directory).
USING THE OLD PARSERS
=====================
The old set of parsers is still around, even if it can have many
bugs.
You can force the use of the old parsers setting the 'oldParsers'
parameter to True. E.g.:
from imdb import IMDb
ia = IMDb('http', oldParsers=True)
...
FORCING LXML OR BEAUTIFULSOUP
=============================
By default, IMdbPY uses python-lxml, if it's installed.
You can force the use of one given parser passing the 'useModule'
parameter. Valid values are 'lxml' and 'BeautifulSoup'. E.g.:
from imdb import IMDb
ia = IMDb('http', useModule='BeautifulSoup')
...
useModule can also be a list/tuple of strings, to specify the
preferred order.
......@@ -500,7 +500,7 @@ available from the imdb package.
You can catch any type of errors raised by the IMDbPY package with
something like:
from imdb impotr IMDb, IMDbError
from imdb import IMDb, IMDbError
try:
i = IMDb()
......
......@@ -66,8 +66,22 @@ Refer to the web site http://imdbpy.sf.net/ and subscribe to the
mailing list: http://imdbpy.sf.net/?page=help#ml
UNICODE AND CHARACTER PAGES NOTICE
==================================
NOTES FOR PACKAGERS
===================
If you plan to package IMDbPY for your distribution/operating system,
keep in mind that, while IMDbPY can works out-of-the-box, some external
package may be required for certain functionality:
- SQLObject: it's REQUIRED if you want to use the 'sql' data access
system.
- python-lxml: the 'http' data access system will be much faster, if
it's installed.
Both should probably be "suggested" dependencies.
RECENT IMPORTANT CHANGES
========================
Since release 2.4, IMDbPY internally manages every information about
movies and people using unicode strings. Please read the README.utf8 file.
......@@ -78,6 +92,10 @@ README.currentRole file for more information.
Since release 3.6, IMDbPY supports IMDb's company pages; see the
README.companies file for more information.
Since release 3.7, IMDbPY has moved its main parsers from a SAX-based
approach to a DOM/XPath-based one; see the README.newparsers file
for more information.
FEATURES
========
......
......@@ -23,6 +23,9 @@ NOTE: it's always time to clean the code! <g>
functions. But how much code will be broken?
* for local and sql data access systems: some episode titles are
marked as {{SUSPENDED}}; they should probably be ignored.
* the text data can be store as instances of an hypothetical TextInfo
class, so that values and notes can be easily retrieved separately:
tinfo.txt and tinfo.notes (must provide __str__ and __unicode__).
[searches]
......
......@@ -30,16 +30,16 @@
[imdbpy]
# Default.
accessSystem = http
# Optional:
#proxy = http://localhost:8080/
# Optional (options common to every data access system):
#adultSearch = on
#results = 20
# Optional (options common to http and mobile data access systems):
#proxy = http://localhost:8080/
#cookie_id = string_representing_the_cookie_id
#cookie_uu = string_representing_the_cookie_uu
# Parameters for the 'mobile' data access system.
#accessSystem = mobile
# Optional:
#proxy = http://localhost:8080/
# Parameters for the 'sql' data access system.
#accessSystem = sql
......
......@@ -25,7 +25,7 @@ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
__all__ = ['IMDb', 'IMDbError', 'Movie', 'Person', 'Character', 'Company',
'available_access_systems']
__version__ = VERSION = '3.6'
__version__ = VERSION = '3.7'
# Import compatibility module (importing it is enough).
import _compat
......@@ -681,8 +681,8 @@ class IMDbBase:
params = 's=tt&q=%s' % str(urllib.quote_plus(title))
content = self._searchIMDb(params)
if not content: return None
from imdb.parser.http.searchMovieParser import BasicMovieParser
mparser = BasicMovieParser()
from imdb.parser.http.searchMovieParser import DOMBasicMovieParser
mparser = DOMBasicMovieParser()
result = mparser.parse(content)
if not (result and result.get('data')): return None
return result['data'][0][0]
......@@ -701,8 +701,8 @@ class IMDbBase:
params = 's=nm&q=%s' % str(urllib.quote_plus(name))
content = self._searchIMDb(params)
if not content: return None
from imdb.parser.http.searchPersonParser import BasicPersonParser
pparser = BasicPersonParser()
from imdb.parser.http.searchPersonParser import DOMBasicPersonParser
pparser = DOMBasicPersonParser()
result = pparser.parse(content)
if not (result and result.get('data')): return None
return result['data'][0][0]
......@@ -721,11 +721,9 @@ class IMDbBase:
content = self._searchIMDb(params)
if not content: return None
if content[:512].find('<title>IMDb Search') != -1:
from imdb.parser.http.searchCharacterParser \
import HTMLSearchCharacterParser, BasicCharacterParser
search_character_parser = HTMLSearchCharacterParser()
search_character_parser.kind = 'character'
search_character_parser._basic_parser = BasicCharacterParser
from imdb.parser.http.searchCharacterParser import \
DOMHTMLSearchCharacterParser
search_character_parser = DOMHTMLSearchCharacterParser()
result = search_character_parser.parse(content)
if not result: return None
if not result.has_key('data'): return None
......@@ -737,8 +735,10 @@ class IMDbBase:
if name == rname:
return chID
return None
from imdb.parser.http.searchCharacterParser import BasicCharacterParser
cparser = BasicCharacterParser()
# XXX: still needed?
from imdb.parser.http.searchCharacterParser import \
DOMBasicCharacterParser
cparser = DOMBasicCharacterParser()
result = cparser.parse(content)
if not (result and result.get('data')): return None
return result['data'][0][0]
......@@ -757,11 +757,9 @@ class IMDbBase:
content = self._searchIMDb(params)
if not content: return None
if content[:512].find('<title>IMDb Search') != -1:
from imdb.parser.http.searchCompanyParser \
import HTMLSearchCompanyParser, BasicCompanyParser
search_company_parser = HTMLSearchCompanyParser()
search_company_parser.kind = 'company'
search_company_parser._basic_parser = BasicCompanyParser
from imdb.parser.http.searchCompanyParser import \
DOMHTMLSearchCompanyParser
search_company_parser = DOMHTMLSearchCompanyParser()
result = search_company_parser.parse(content)
if not result: return None
if not result.has_key('data'): return None
......@@ -773,8 +771,9 @@ class IMDbBase:
if name == rname:
return chID
return None
from imdb.parser.http.searchCompanyParser import BasicCompanyParser
cparser = BasicCompanyParser()
# Still needed?
from imdb.parser.http.searchCompanyParser import DOMBasicCompanyParser
cparser = DOMBasicCompanyParser()
result = cparser.parse(content)
if not (result and result.get('data')): return None
return result['data'][0][0]
......
......@@ -8,6 +8,7 @@ called with the 'accessSystem' argument set to "http" or "web"
or "html" (this is the default).
Copyright 2004-2008 Davide Alberani <da@erlug.linux.it>
2008 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -44,9 +45,12 @@ import companyParser
class _ModuleProxy:
"""A proxy to instantiate and access parsers."""
def __init__(self, module, defaultKeys=None):
def __init__(self, module, defaultKeys=None, oldParsers=False,
useModule=None):
"""Initialize a proxy for the given module; defaultKeys, if set,
muste be a dictionary of values to set for instanced objects."""
self.oldParsers = oldParsers
self.useModule = useModule
if defaultKeys is None:
defaultKeys = {}
self._defaultKeys = defaultKeys
......@@ -59,7 +63,10 @@ class _ModuleProxy:
if name in _sm._OBJECTS:
_entry = _sm._OBJECTS[name]
# Initialize the parser.
obj = _entry[0]()
kwds = {}
if not self.oldParsers and self.useModule:
kwds = {'useModule': self.useModule}
obj = _entry[0][self.oldParsers](**kwds)
attrsToSet = self._defaultKeys.copy()
attrsToSet.update(_entry[1] or {})
# Set attribute to the object.
......@@ -86,6 +93,7 @@ _cookie_uu = 'su4/m8cho4c6HP+W1qgq6wchOmhnF0w+lIWvHjRUPJ6nRA9sccEafjGADJ6hQGrMd4
class IMDbURLopener(FancyURLopener):
"""Fetch web pages and handle errors."""
def __init__(self, *args, **kwargs):
self._last_url = u''
FancyURLopener.__init__(self, *args, **kwargs)
# Headers to add to every request.
# XXX: IMDb's web server doesn't like urllib-based programs,
......@@ -137,6 +145,7 @@ class IMDbURLopener(FancyURLopener):
if PY_VERSION > (2, 3):
kwds['size'] = size
content = uopener.read(**kwds)
self._last_url = uopener.url
# Maybe the server is so nice to tell us the charset...
server_encode = uopener.info().getparam('charset')
# Otherwise, look at the content-type HTML meta tag.
......@@ -201,8 +210,9 @@ class IMDbHTTPAccessSystem(IMDbBase):
accessSystem = 'http'
def __init__(self, isThin=0, adultSearch=1, proxy=-1,
cookie_id=-1, cookie_uu=None, *arguments, **keywords):
def __init__(self, isThin=0, adultSearch=1, proxy=-1, oldParsers=False,
useModule=None, cookie_id=-1, cookie_uu=None,
*arguments, **keywords):
"""Initialize the access system."""
IMDbBase.__init__(self, *arguments, **keywords)
self.urlOpener = IMDbURLopener()
......@@ -230,14 +240,22 @@ class IMDbHTTPAccessSystem(IMDbBase):
self.set_proxy(proxy)
_def = {'_modFunct': self._defModFunct, '_as': self.accessSystem}
# Proxy objects.
self.smProxy = _ModuleProxy(searchMovieParser, defaultKeys=_def)
self.spProxy = _ModuleProxy(searchPersonParser, defaultKeys=_def)
self.scProxy = _ModuleProxy(searchCharacterParser, defaultKeys=_def)
self.scompProxy = _ModuleProxy(searchCompanyParser, defaultKeys=_def)
self.mProxy = _ModuleProxy(movieParser, defaultKeys=_def)
self.pProxy = _ModuleProxy(personParser, defaultKeys=_def)
self.cProxy = _ModuleProxy(characterParser, defaultKeys=_def)
self.compProxy = _ModuleProxy(companyParser, defaultKeys=_def)
self.smProxy = _ModuleProxy(searchMovieParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
self.spProxy = _ModuleProxy(searchPersonParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
self.scProxy = _ModuleProxy(searchCharacterParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
self.scompProxy = _ModuleProxy(searchCompanyParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
self.mProxy = _ModuleProxy(movieParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
self.pProxy = _ModuleProxy(personParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
self.cProxy = _ModuleProxy(characterParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
self.compProxy = _ModuleProxy(companyParser, defaultKeys=_def,
oldParsers=oldParsers, useModule=useModule)
def _normalize_movieID(self, movieID):
"""Normalize the given movieID."""
......@@ -331,6 +349,7 @@ class IMDbHTTPAccessSystem(IMDbBase):
def _retrieve(self, url, size=-1):
"""Retrieve the given URL."""
##print url
return self.urlOpener.retrieve_unicode(url, size=size)
def _get_search_content(self, kind, ton, results):
......@@ -361,8 +380,7 @@ class IMDbHTTPAccessSystem(IMDbBase):
##params = 'q=%s&tt=on&mx=%s' % (quote_plus(title), str(results))
##cont = self._retrieve(imdbURL_find % params)
cont = self._get_search_content('tt', title, results)
return self.smProxy.search_movie_parser.parse(cont,
results=results)['data']
return self.smProxy.search_movie_parser.parse(cont, results=results)['data']
def get_movie_main(self, movieID):
if not self.isThin:
......@@ -491,7 +509,7 @@ class IMDbHTTPAccessSystem(IMDbBase):
def get_movie_guests(self, movieID):
cont = self._retrieve(imdbURL_movie_main % movieID + 'epcast')
return self.mProxy.episodes_parser.parse(cont)
return self.mProxy.episodes_cast_parser.parse(cont)
get_movie_episodes_cast = get_movie_guests
def get_movie_merchandising_links(self, movieID):
......@@ -547,8 +565,7 @@ class IMDbHTTPAccessSystem(IMDbBase):
#params = 'q=%s&nm=on&mx=%s' % (quote_plus(name), str(results))
#cont = self._retrieve(imdbURL_find % params)
cont = self._get_search_content('nm', name, results)
return self.spProxy.search_person_parser.parse(cont,
results=results)['data']
return self.spProxy.search_person_parser.parse(cont, results=results)['data']
def get_person_main(self, personID):
cont = self._retrieve(imdbURL_person_main % personID + 'maindetails')
......@@ -605,8 +622,7 @@ class IMDbHTTPAccessSystem(IMDbBase):
def _search_character(self, name, results):
cont = self._get_search_content('char', name, results)
return self.scProxy.search_character_parser.parse(cont,
results=results)['data']
return self.scProxy.search_character_parser.parse(cont, results=results)['data']
def get_character_main(self, characterID):
cont = self._retrieve(imdbURL_character_main % characterID)
......@@ -633,7 +649,8 @@ class IMDbHTTPAccessSystem(IMDbBase):
def _search_company(self, name, results):
cont = self._get_search_content('co', name, results)
return self.scompProxy.search_company_parser.parse(cont,
url = self.urlOpener._last_url
return self.scompProxy.search_company_parser.parse(cont, url=url,
results=results)['data']
def get_company_main(self, companyID):
......
This diff is collapsed.
"""
parser.http.bsoupadapter module (imdb.parser.http package).
This module adapts the beautifulsoup xpath support to the internal mechanism.
Copyright 2008 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
import _bsoup as BeautifulSoup
import bsoupxpath
def fromstring(html_string):
"""Return a DOM representation of the string.
"""
return BeautifulSoup.BeautifulSoup(html_string,
convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
def tostring(element):
"""Return a unicode representation of an element.
"""
try:
return unicode(element)
except AttributeError:
return str(element)
def fix_rowspans(html_string):
"""Repeat td elements according to their rowspan attributes in subsequent
tr elements.
"""
dom = fromstring(html_string)
cols = dom.findAll('td', rowspan=True)
for col in cols:
span = int(col.get('rowspan'))
position = len(col.findPreviousSiblings('td'))
row = col.parent
next = row
for i in xrange(span-1):
next = next.findNextSibling('tr')
# if not cloned, child will be moved to new parent
clone = fromstring(tostring(col)).td
next.insert(position, clone)
return tostring(dom)
def apply_xpath(node, path):
"""Apply an xpath expression to a node. Return a list of nodes.
"""
#xpath = bsoupxpath.Path(path)
xpath = bsoupxpath.get_path(path)
return xpath.apply(node)
This diff is collapsed.
......@@ -9,6 +9,7 @@ E.g., for "Jesse James" the referred pages would be:
...and so on...
Copyright 2007-2008 Davide Alberani <da@erlug.linux.it>
2008 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -25,8 +26,76 @@ along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
from utils import ParserBase
import re
from imdb.Movie import Movie
from utils import ParserBase, Attribute, Extractor, DOMParserBase, \
build_movie, analyze_imdbid
from personParser import DOMHTMLMaindetailsParser
_personIDs = re.compile(r'/name/nm([0-9]{7})')
class DOMHTMLCharacterMaindetailsParser(DOMHTMLMaindetailsParser):
"""Parser for the "biography" page of a given character.
The page should be provided as a string, as taken from
the akas.imdb.com server. The final result will be a
dictionary, with a key for every relevant section.
Example:
bparser = DOMHTMLCharacterMaindetailsParser()
result = bparser.parse(character_biography_html_string)
"""
_containsObjects = True
_film_attrs = [Attribute(key=None,
multi=True,
path={
'link': "./a[1]/@href",
'title': ".//text()",
'status': "./i/a//text()",
'roleID': "./a/@href"
},
postprocess=lambda x:
build_movie(x.get('title') or u'',
movieID=analyze_imdbid(x.get('link') or u''),
roleID=_personIDs.findall(x.get('roleID') or u''),
status=x.get('status') or None,
_parsingCharacter=True))]
extractors = [
Extractor(label='title',
path="//title",
attrs=Attribute(key='name',
path="./text()",
postprocess=lambda x: \
x.replace(' (Character)', '').strip())),
Extractor(label='headshot',
path="//a[@name='headshot']",
attrs=Attribute(key='headshot',
path="./img/@src")),
Extractor(label='akas',
path="//div[h5='Alternate Names:']",
attrs=Attribute(key='akas',
path="./text()",
postprocess=lambda x: x.strip().split(' / '))),
Extractor(label='filmography',
path="//div[@class='filmo'][not(h5)]/ol/li",
attrs=_film_attrs),
Extractor(label='filmography sections',
group="//div[@class='filmo'][h5]",
group_key="./h5/a/text()",
group_key_normalize=lambda x: x.lower()[:-1],
path="./ol/li",
attrs=_film_attrs),
]
preprocessors = [
# Check that this doesn't cut "status"...
(re.compile(r'<br>(\.\.\.| ).+?</li>', re.I | re.M), '</li>')]
class HTMLCharacterBioParser(ParserBase):
"""Parser for the "biography" page of a given character.
......@@ -112,6 +181,51 @@ class HTMLCharacterBioParser(ParserBase):
self._cur_bio += data.replace('\n', ' ')
class DOMHTMLCharacterBioParser(DOMParserBase):
"""Parser for the "biography" page of a given character.
The page should be provided as a string, as taken from
the akas.imdb.com server. The final result will be a
dictionary, with a key for every relevant section.
Example:
bparser = DOMHTMLCharacterBioParser()
result = bparser.parse(character_biography_html_string)
"""
_defGetRefs = True
extractors = [
Extractor(label='introduction',
path="//div[@id='_intro']",
attrs=Attribute(key='introduction',
path=".//text()",
postprocess=lambda x: x.strip())),
Extractor(label='biography',
path="//span[@class='_biography']",
attrs=Attribute(key='biography',
multi=True,
path={
'info': "./preceding-sibling::h4[1]//text()",
'text': ".//text()",
},
postprocess=lambda x: u'%s::%s' % (
x.get('info').strip(),
x.get('text').replace('\n',
' ').replace('||', '\n\n').strip()))),
]
preprocessors = [
(re.compile('(<div id="swiki.2.3.1">)', re.I), r'\1<div id="_intro">'),
(re.compile('(<a name="history">)\s*(<table .*?</table>)',
re.I | re.DOTALL),
r'</div>\2\1</a>'),
(re.compile('(<a name="[^"]+">)(<h4>)', re.I), r'</span>\1</a>\2'),
(re.compile('(</h4>)</a>', re.I), r'\1<span class="_biography">'),
(re.compile('<br/><br/>', re.I), r'||'),
(re.compile('\|\|\n', re.I), r'</span>'),
]
class HTMLCharacterQuotesParser(ParserBase):
"""Parser for the "quotes" page of a given character.
The page should be provided as a string, as taken from
......@@ -199,13 +313,54 @@ class HTMLCharacterQuotesParser(ParserBase):
self._quotes[-1] += data
class DOMHTMLCharacterQuotesParser(DOMParserBase):
"""Parser for the "quotes" page of a given character.
The page should be provided as a string, as taken from
the akas.imdb.com server. The final result will be a
dictionary, with a key for every relevant section.
Example:
qparser = DOMHTMLCharacterQuotesParser()
result = qparser.parse(character_quotes_html_string)
"""
_defGetRefs = True
extractors = [
Extractor(label='introduction',
group="//h5",
group_key="./a/text()",
path="./following-sibling::div[1]",
attrs=Attribute(key=None,
path=".//text()",
postprocess=lambda x: x.strip().replace(': ',
': ').replace(': ', ': ').split('||'))),
]