Commit ff15b6b6 authored by Ana Guerrero López's avatar Ana Guerrero López

Import Upstream version 4.9

parent 5842f76f
syntax: glob
build
dist
*.egg-info
*.mo
*.pyc
*.pyo
*.so
*.pyd
*~
*.swp
setuptools-*.egg
c3dba80881f0a810b3bf93051a56190b297e7a50 4.6
c8b07121469a2173a587b1a34beb4f1fecd640b6 4.7
ba221c9050599463b4b78c89a8bdada7d7aef173 4.8
e807ba790392d406018af0f98d5dad5117721a4d 4.8.1
b02c61369b27e0d5af0a755a8a2fc3355c08bb67 4.8.2
This diff is collapsed.
......@@ -22,7 +22,8 @@ listed as developers for the IMDbPY project on sourceforge and may
share copyright on some (minor) portions of the code:
NAME: Alberto Malagoli
CONTRIBUTION: developed the new web site, and detain the copyright of it.
CONTRIBUTION: developed the new web site, and detains the copyright of it,
and provided helper functions and other code.
NAME: Martin Kirst
......
......@@ -21,6 +21,24 @@ of help, and also for the wonderful http://bitbucket.org)
Below, a list of persons who contributed with bug reports, small
patches and hints (kept in a reverse order since IMDbPY 4.5):
* John Lambert, Rick Summerhill and Maciej for reports and fixes
for the search query.
* Kaspars "Darklow" Sprogis for an impressive amount of tests and reports about
bugs parsing the plain text data files and many new ideas.
* Damien Stewart for many bug reports about the Windows environment.
* Vincenzo Ampolo for a bug report about the new imdbIDs save/restore queries.
* Tomáš Hnyk for the idea of an option to reraise caught exceptions.
* Emmanuel Tabard for ideas, code and testing on restoring imdbIDs.
* Fabian Roth for a bug report about the new style of episodes list.
* Y. Josuin for a bug report on missing info in crazy credits file.
* Arfrever Frehtes Taifersar Arahesis for a patch for locales.
* Gustaf Nilsson for bug reports about BeautifulSoup.
......@@ -41,9 +59,6 @@ patches and hints (kept in a reverse order since IMDbPY 4.5):
* Jef "ofthelit", for a patch for the reduce.sh script bug
reports for Windows.
* "Darklow" for an impressive amount of tests and reports about
a bug about data parsing in the plain text data files.
* Reiner Herrmann for benchmarks using SSD hard drives.
* Thomas Stewart for some tests and reports about a bug
......
Changelog for IMDbPY
====================
* What's the new in release 4.9 "Iron Sky" (15 Jun 2012)
[general]
- urls used to access the IMDb site can be configured.
- helpers function to handle movie AKAs in various
languages (code by Alberto Malagoli).
- renamed the 'articles' module into 'linguistics'.
- introduced the 'reraiseExceptions' option, to re-raise
evey caught exception.
[http]
- fix for changed search parameters.
- introduced a 'timeout' parameter for connections to the web server.
- fix for business information.
- parser for the new style of episodes list.
- unicode searches handled as iso8859-1.
- fix for garbage in AKA titles.
[sql]
- vastly improved the store/restore of imdbIDs; now it should be faster
and more accurate.
- now the 'name' table contains a 'gender' field that can be 'm', 'f' or NULL.
- fix for nicknames.
- fix for missing titles in the crazy credits file.
- handled exceptions creating indexes, foreign keys and
executing custom queries.
- fixed creation on index for keywords.
- excluded {{SUSPENDED}} titles.
* What's the new in release 4.8.2 "The Big Bang Theory" (02 Nov 2011)
[general]
- fixed install path of locales.
......
......@@ -18,7 +18,7 @@ imdb (package)
+-> _compat
+-> _exceptions
+-> _logging
+-> articles
+-> linguistics
+-> Movie
+-> Person
+-> Character
......@@ -64,7 +64,7 @@ _compat: compatibility functions and class for some strange environments
(internally used).
_exceptions: defines the exceptions internally used.
_logging: provides the logging facility used by IMDbPY.
articles: defines some functions and data useful to smartly guess the
linguistics: defines some functions and data useful to smartly guess the
language of a movie title (internally used).
Movie: contains the Movie class, used to describe and manage a movie.
Person: contains the Person class, used to describe and manage a person.
......
......@@ -84,8 +84,8 @@ To solve this problem, there are other keys: "smart canonical title",
converting a title into its canonical format.
It works, but it needs to know something about articles in various
languages: if you want to help, see the LANG_ARTICLES and _LANG_COUNTRIES
dictionaries in the 'articles' module.
languages: if you want to help, see the LANG_ARTICLES and LANG_COUNTRIES
dictionaries in the 'linguistics' module.
To know what the language in which a movie title is assumed to be,
call its 'guessLanguage' method (it will return None, if unable to guess).
......@@ -93,3 +93,17 @@ If you want to force a given language instead of the guessed one, you
can call its 'smartCanonicalTitle' method, setting the 'lang' argument
appropriately.
TITLE AKAS
==========
Sometimes it's useful to manage title's AKAs knowing their languages.
In the 'helpers' module there are some (hopefully) useful functions:
akasLanguages(movie) - given a movie, return a list of tuples
in (lang, AKA) format (lang can be None, if unable to detect).
sortAKAsBySimilarity(movie, title) - sorts the AKAs on a movie considering
how much they are similar to a given title (see
the code for more options).
getAKAsInLanguage(movie, lang) - return a list of AKAs of the movie in the given
language (see the code for more options).
......@@ -21,8 +21,6 @@ NOTE: it's always time to clean the code! <g>
at least for 'http' and 'mobile', since they are used by mobile devices.
* The analyze_title/build_title functions are grown too complex and
beyond their initial goals.
* for the sql data access system: some episode titles are
marked as {{SUSPENDED}}; they should probably be ignored.
[searches]
......
......@@ -28,39 +28,50 @@
#
[imdbpy]
# Default.
## Default.
accessSystem = http
# Optional (options common to every data access system):
## Optional (options common to every data access system):
# Activate adult searches (on, by default).
#adultSearch = on
# Number of results for searches (20 by default).
#results = 20
# Re-raise all caught exceptions (off, by default).
#reraiseExceptions = off
# Optional (options common to http and mobile data access systems):
## Optional (options common to http and mobile data access systems):
# Proxy used to access the network. If it requires authentication,
# try with: http://username:password@server_address:port/
#proxy = http://localhost:8080/
# Cookies of the IMDb.com account
#cookie_id = string_representing_the_cookie_id
#cookie_uu = string_representing_the_cookie_uu
## Timeout for the connection to IMDb (30 seconds, by default).
#timeout = 30
# Base url to access pages on the IMDb.com web server.
#imdbURL_base = http://akas.imdb.com/
# Parameters for the 'http' data access system.
## Parameters for the 'http' data access system.
# Parser to use; can be a single value or a list of value separated by
# a comma, to express order preference. Valid values: "lxml", "beautifulsoup"
#useModule = lxml,beautifulsoup
# Parameters for the 'mobile' data access system.
## Parameters for the 'mobile' data access system.
#accessSystem = mobile
# Parameters for the 'sql' data access system.
## Parameters for the 'sql' data access system.
#accessSystem = sql
#uri = mysql://user:password@localhost/imdb
# ORM to use; can be a single value or a list of value separated by
# a comma, to express order preference. Valid values: "sqlobject", "sqlalchemy"
#useORM = sqlobject,sqlalchemy
# Set the threshold for logging messages.
## Set the threshold for logging messages.
# Can be one of "debug", "info", "warning", "error", "critical" (default:
# "warning").
#loggingLevel = debug
# Path to a configuration file for the logging facility;
## Path to a configuration file for the logging facility;
# see: http://docs.python.org/library/logging.html#configuring-logging
#loggingConfig = ~/.imdbpy-logger.cfg
......
......@@ -23,7 +23,7 @@ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
from copy import deepcopy
from imdb import articles
from imdb import linguistics
from imdb.utils import analyze_title, build_title, canonicalTitle, \
flatten, _Container, cmpMovies
......@@ -206,7 +206,7 @@ class Movie(_Container):
else:
country = self.get('countries')
if country:
lang = articles.COUNTRY_LANG.get(country[0])
lang = linguistics.COUNTRY_LANG.get(country[0])
return lang
def smartCanonicalTitle(self, title=None, lang=None):
......
......@@ -6,7 +6,7 @@ a person from the IMDb database.
It can fetch data through different media (e.g.: the IMDb web pages,
a SQL database, etc.)
Copyright 2004-2011 Davide Alberani <da@erlug.linux.it>
Copyright 2004-2012 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -25,7 +25,7 @@ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
__all__ = ['IMDb', 'IMDbError', 'Movie', 'Person', 'Character', 'Company',
'available_access_systems']
__version__ = VERSION = '4.8.2'
__version__ = VERSION = '4.9'
# Import compatibility module (importing it is enough).
import _compat
......@@ -35,7 +35,7 @@ from types import MethodType
from imdb import Movie, Person, Character, Company
import imdb._logging
from imdb._exceptions import IMDbError, IMDbDataAccessError
from imdb._exceptions import IMDbError, IMDbDataAccessError, IMDbParserError
from imdb.utils import build_title, build_name, build_company_name
_aux_logger = logging.getLogger('imdbpy.aux')
......@@ -43,6 +43,10 @@ _aux_logger = logging.getLogger('imdbpy.aux')
# URLs of the main pages for movies, persons, characters and queries.
imdbURL_base = 'http://akas.imdb.com/'
# NOTE: the urls below will be removed in a future version.
# please use the values in the 'urls' attribute
# of the IMDbBase subclass instance.
# http://akas.imdb.com/title/
imdbURL_movie_base = '%stitle/' % imdbURL_base
# http://akas.imdb.com/title/tt%s/
......@@ -242,6 +246,9 @@ class IMDbBase:
# Top-level logger for IMDbPY.
_imdb_logger = logging.getLogger('imdbpy')
# Whether to re-raise caught exceptions or not.
_reraise_exceptions = False
def __init__(self, defaultModFunct=None, results=20, keywordsResults=100,
*arguments, **keywords):
"""Initialize the access system.
......@@ -267,6 +274,53 @@ class IMDbBase:
if keywordsResults < 1:
keywordsResults = 100
self._keywordsResults = keywordsResults
self._reraise_exceptions = keywords.get('reraiseExceptions') or False
self.set_imdb_urls(keywords.get('imdbURL_base') or imdbURL_base)
def set_imdb_urls(self, imdbURL_base):
"""Set the urls used accessing the IMDb site."""
imdbURL_base = imdbURL_base.strip().strip('"\'')
if not imdbURL_base.startswith('http://'):
imdbURL_base = 'http://%s' % imdbURL_base
if not imdbURL_base.endswith('/'):
imdbURL_base = '%s/' % imdbURL_base
# http://akas.imdb.com/title/
imdbURL_movie_base='%stitle/' % imdbURL_base
# http://akas.imdb.com/title/tt%s/
imdbURL_movie_main=imdbURL_movie_base + 'tt%s/'
# http://akas.imdb.com/name/
imdbURL_person_base='%sname/' % imdbURL_base
# http://akas.imdb.com/name/nm%s/
imdbURL_person_main=imdbURL_person_base + 'nm%s/'
# http://akas.imdb.com/character/
imdbURL_character_base='%scharacter/' % imdbURL_base
# http://akas.imdb.com/character/ch%s/
imdbURL_character_main=imdbURL_character_base + 'ch%s/'
# http://akas.imdb.com/company/
imdbURL_company_base='%scompany/' % imdbURL_base
# http://akas.imdb.com/company/co%s/
imdbURL_company_main=imdbURL_company_base + 'co%s/'
# http://akas.imdb.com/keyword/%s/
imdbURL_keyword_main=imdbURL_base + 'keyword/%s/'
# http://akas.imdb.com/chart/top
imdbURL_top250=imdbURL_base + 'chart/top',
# http://akas.imdb.com/chart/bottom
imdbURL_bottom100=imdbURL_base + 'chart/bottom'
# http://akas.imdb.com/find?%s
imdbURL_find=imdbURL_base + 'find?%s'
self.urls = dict(
movie_base=imdbURL_movie_base,
movie_main=imdbURL_movie_main,
person_base=imdbURL_person_base,
person_main=imdbURL_person_main,
character_base=imdbURL_character_base,
character_main=imdbURL_character_main,
company_base=imdbURL_company_base,
company_main=imdbURL_company_main,
keyword_main=imdbURL_keyword_main,
top250=imdbURL_top250,
bottom100=imdbURL_bottom100,
find=imdbURL_find)
def _normalize_movieID(self, movieID):
"""Normalize the given movieID."""
......@@ -721,6 +775,9 @@ class IMDbBase:
'"%s" (accessSystem: %s)',
i, mopID, mop.accessSystem, exc_info=True)
ret = {}
# If requested by the user, reraise the exception.
if self._reraise_exceptions:
raise
keys = None
if 'data' in ret:
res.update(ret['data'])
......
......@@ -4,7 +4,8 @@ helpers module (imdb package).
This module provides functions not used directly by the imdb package,
but useful for IMDbPY-based programs.
Copyright 2006-2010 Davide Alberani <da@erlug.linux.it>
Copyright 2006-2012 Davide Alberani <da@erlug.linux.it>
2012 Alberto Malagoli <albemala AT gmail.com>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -24,6 +25,7 @@ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
# XXX: find better names for the functions in this modules.
import re
import difflib
from cgi import escape
import gettext
from gettext import gettext as _
......@@ -35,7 +37,9 @@ from imdb.utils import modClearRefs, re_titleRef, re_nameRef, \
re_characterRef, _tagAttr, _Container, TAGS_TO_MODIFY
from imdb import IMDb, imdbURL_movie_base, imdbURL_person_base, \
imdbURL_character_base
import imdb.locale
from imdb.linguistics import COUNTRY_LANG
from imdb.Movie import Movie
from imdb.Person import Person
from imdb.Character import Character
......@@ -546,3 +550,91 @@ def parseXML(xml):
return None
_re_akas_lang = re.compile('(?:[(])([a-zA-Z]+?)(?: title[)])')
_re_akas_country = re.compile('\(.*?\)')
# akasLanguages, sortAKAsBySimilarity and getAKAsInLanguage code
# copyright of Alberto Malagoli (refactoring by Davide Alberani).
def akasLanguages(movie):
"""Given a movie, return a list of tuples in (lang, AKA) format;
lang can be None, if unable to detect."""
lang_and_aka = []
akas = set((movie.get('akas') or []) +
(movie.get('akas from release info') or []))
for aka in akas:
# split aka
aka = aka.encode('utf8').split('::')
# sometimes there is no countries information
if len(aka) == 2:
# search for something like "(... title)" where ... is a language
language = _re_akas_lang.search(aka[1])
if language:
language = language.groups()[0]
else:
# split countries using , and keep only the first one (it's sufficient)
country = aka[1].split(',')[0]
# remove parenthesis
country = _re_akas_country.sub('', country).strip()
# given the country, get corresponding language from dictionary
language = COUNTRY_LANG.get(country)
else:
language = None
lang_and_aka.append((language, aka[0].decode('utf8')))
return lang_and_aka
def sortAKAsBySimilarity(movie, title, _titlesOnly=True, _preferredLang=None):
"""Return a list of movie AKAs, sorted by their similarity to
the given title.
If _titlesOnly is not True, similarity information are returned.
If _preferredLang is specified, AKAs in the given language will get
a higher score.
The return is a list of title, or a list of tuples if _titlesOnly is False."""
language = movie.guessLanguage()
# estimate string distance between current title and given title
m_title = movie['title'].lower()
l_title = title.lower()
if isinstance(l_title, unicode):
l_title = l_title.encode('utf8')
scores = []
score = difflib.SequenceMatcher(None, m_title.encode('utf8'), l_title).ratio()
# set original title and corresponding score as the best match for given title
scores.append((score, movie['title'], None))
for language, aka in akasLanguages(movie):
# estimate string distance between current title and given title
m_title = aka.lower()
if isinstance(m_title, unicode):
m_title = m_title.encode('utf8')
score = difflib.SequenceMatcher(None, m_title, l_title).ratio()
# if current language is the same as the given one, increase score
if _preferredLang and _preferredLang == language:
score += 1
scores.append((score, aka, language))
scores.sort(reverse=True)
if _titlesOnly:
return [x[1] for x in scores]
return scores
def getAKAsInLanguage(movie, lang, _searchedTitle=None):
"""Return a list of AKAs of a movie, in the specified language.
If _searchedTitle is given, the AKAs are sorted by their similarity
to it."""
akas = []
for language, aka in akasLanguages(movie):
if lang == language:
akas.append(aka)
if _searchedTitle:
scores = []
if isinstance(_searchedTitle, unicode):
_searchedTitle = _searchedTitle.encode('utf8')
for aka in akas:
m_aka = aka
if isinstance(m_aka):
m_aka = m_aka.encode('utf8')
scores.append(difflib.SequenceMatcher(None, m_aka.lower(),
_searchedTitle.lower()), aka)
scores.sort(reverse=True)
akas = [x[1] for x in scores]
return akas
"""
articles module (imdb package).
linguistics module (imdb package).
This module provides functions and data to handle in a smart way
articles (in various languages) at the beginning of movie titles.
languages and articles (in various languages) at the beginning of movie titles.
Copyright 2009 Davide Alberani <da@erlug.linux.it>
Copyright 2009-2012 Davide Alberani <da@erlug.linux.it>
2012 Alberto Malagoli <albemala AT gmail.com>
2009 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
......@@ -74,20 +75,80 @@ LANG_ARTICLESget = LANG_ARTICLES.get
# Maps a language to countries where it is the main language.
# If you want to add an entry for another language or country, mail it at
# imdbpy-devel@lists.sourceforge.net .
_LANG_COUNTRIES = {
'English': ('USA', 'UK', 'Canada', 'Ireland', 'Australia'),
'Italian': ('Italy',),
'Spanish': ('Spain', 'Mexico'),
'Portuguese': ('Portugal', 'Brazil'),
'Turkish': ('Turkey',),
#'German': ('Germany', 'East Germany', 'West Germany'),
#'French': ('France'),
LANG_COUNTRIES = {
'English': ('Canada', 'Swaziland', 'Ghana', 'St. Lucia', 'Liberia', 'Jamaica', 'Bahamas', 'New Zealand', 'Lesotho', 'Kenya', 'Solomon Islands', 'United States', 'South Africa', 'St. Vincent and the Grenadines', 'Fiji', 'UK', 'Nigeria', 'Australia', 'USA', 'St. Kitts and Nevis', 'Belize', 'Sierra Leone', 'Gambia', 'Namibia', 'Micronesia', 'Kiribati', 'Grenada', 'Antigua and Barbuda', 'Barbados', 'Malta', 'Zimbabwe', 'Ireland', 'Uganda', 'Trinidad and Tobago', 'South Sudan', 'Guyana', 'Botswana', 'United Kingdom', 'Zambia'),
'Italian': ('Italy', 'San Marino', 'Vatican City'),
'Spanish': ('Spain', 'Mexico', 'Argentina', 'Bolivia', 'Guatemala', 'Uruguay', 'Peru', 'Cuba', 'Dominican Republic', 'Panama', 'Costa Rica', 'Ecuador', 'El Salvador', 'Chile', 'Equatorial Guinea', 'Spain', 'Colombia', 'Nicaragua', 'Venezuela', 'Honduras', 'Paraguay'),
'French': ('Cameroon', 'Burkina Faso', 'Dominica', 'Gabon', 'Monaco', 'France', "Cote d'Ivoire", 'Benin', 'Togo', 'Central African Republic', 'Mali', 'Niger', 'Congo, Republic of', 'Guinea', 'Congo, Democratic Republic of the', 'Luxembourg', 'Haiti', 'Chad', 'Burundi', 'Madagascar', 'Comoros', 'Senegal'),
'Portuguese': ('Portugal', 'Brazil', 'Sao Tome and Principe', 'Cape Verde', 'Angola', 'Mozambique', 'Guinea-Bissau'),
'German': ('Liechtenstein', 'Austria', 'West Germany', 'Switzerland', 'East Germany', 'Germany'),
'Arabic': ('Saudi Arabia', 'Kuwait', 'Jordan', 'Oman', 'Yemen', 'United Arab Emirates', 'Mauritania', 'Lebanon', 'Bahrain', 'Libya', 'Palestinian State (proposed)', 'Qatar', 'Algeria', 'Morocco', 'Iraq', 'Egypt', 'Djibouti', 'Sudan', 'Syria', 'Tunisia'),
'Turkish': ('Turkey', 'Azerbaijan'),
'Swahili': ('Tanzania',),
'Swedish': ('Sweden',),
'Icelandic': ('Iceland',),
'Estonian': ('Estonia',),
'Romanian': ('Romania',),
'Samoan': ('Samoa',),
'Slovenian': ('Slovenia',),
'Tok Pisin': ('Papua New Guinea',),
'Palauan': ('Palau',),
'Macedonian': ('Macedonia',),
'Hindi': ('India',),
'Dutch': ('Netherlands', 'Belgium', 'Suriname'),
'Marshallese': ('Marshall Islands',),
'Korean': ('Korea, North', 'Korea, South', 'North Korea', 'South Korea'),
'Vietnamese': ('Vietnam',),
'Danish': ('Denmark',),
'Khmer': ('Cambodia',),
'Lao': ('Laos',),
'Somali': ('Somalia',),
'Filipino': ('Philippines',),
'Hungarian': ('Hungary',),
'Ukrainian': ('Ukraine',),
'Bosnian': ('Bosnia and Herzegovina',),
'Georgian': ('Georgia',),
'Lithuanian': ('Lithuania',),
'Malay': ('Brunei',),
'Tetum': ('East Timor',),
'Norwegian': ('Norway',),
'Armenian': ('Armenia',),
'Russian': ('Russia',),
'Slovak': ('Slovakia',),
'Thai': ('Thailand',),
'Croatian': ('Croatia',),
'Turkmen': ('Turkmenistan',),
'Nepali': ('Nepal',),
'Finnish': ('Finland',),
'Uzbek': ('Uzbekistan',),
'Albanian': ('Albania', 'Kosovo'),
'Hebrew': ('Israel',),
'Bulgarian': ('Bulgaria',),
'Greek': ('Cyprus', 'Greece'),
'Burmese': ('Myanmar',),
'Latvian': ('Latvia',),
'Serbian': ('Serbia',),
'Afar': ('Eritrea',),
'Catalan': ('Andorra',),
'Chinese': ('China', 'Taiwan'),
'Czech': ('Czech Republic', 'Czechoslovakia'),
'Bislama': ('Vanuatu',),
'Japanese': ('Japan',),
'Kinyarwanda': ('Rwanda',),
'Amharic': ('Ethiopia',),
'Persian': ('Afghanistan', 'Iran'),
'Tajik': ('Tajikistan',),
'Mongolian': ('Mongolia',),
'Dzongkha': ('Bhutan',),
'Urdu': ('Pakistan',),
'Polish': ('Poland',),
'Sinhala': ('Sri Lanka',),
}
# Maps countries to their main language.
COUNTRY_LANG = {}
for lang in _LANG_COUNTRIES:
for country in _LANG_COUNTRIES[lang]:
for lang in LANG_COUNTRIES:
for country in LANG_COUNTRIES[lang]:
COUNTRY_LANG[country] = lang
......
This diff is collapsed.
......@@ -42,7 +42,7 @@ AXIS_PRECEDING_SIBLING = 'preceding-sibling'
AXES = (AXIS_ANCESTOR, AXIS_ATTRIBUTE, AXIS_CHILD, AXIS_DESCENDANT,
AXIS_FOLLOWING, AXIS_FOLLOWING_SIBLING, AXIS_PRECEDING_SIBLING)
XPATH_FUNCTIONS = ('starts-with', 'string-length')
XPATH_FUNCTIONS = ('starts-with', 'string-length', 'contains')
def tokenize_path(path):
......@@ -306,8 +306,11 @@ class PredicateFilter:
self.__filter = self.__axis
self.node_test = arguments
self.value = value
elif name == 'starts-with':
self.__filter = self.__starts_with
elif name in ('starts-with', 'contains'):
if name == 'starts-with':
self.__filter = self.__starts_with
else:
self.__filter = self.__contains
args = map(string.strip, arguments.split(','))
if args[0][0] == '@':
self.arguments = (True, args[0][1:], args[1][1:-1])
......@@ -362,6 +365,19 @@ class PredicateFilter:
return first.startswith(self.arguments[2])
return False
def __contains(self, node):
if self.arguments[0]:
# this is an attribute
attribute_name = self.arguments[1]
if node.has_key(attribute_name):
first = node[attribute_name]
return self.arguments[2] in first
elif self.arguments[1] == 'text()':
first = node.contents and node.contents[0]
if isinstance(first, BeautifulSoup.NavigableString):
return self.arguments[2] in first
return False
def __string_length(self, node):
if self.arguments[0]:
# this is an attribute
......
......@@ -9,7 +9,7 @@ pages would be:
plot summary: http://akas.imdb.com/title/tt0094226/plotsummary
...and so on...
Copyright 2004-2011 Davide Alberani <da@erlug.linux.it>
Copyright 2004-2012 Davide Alberani <da@erlug.linux.it>
2008 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
......@@ -450,12 +450,18 @@ class DOMHTMLMovieParser(DOMParserBase):
akas = data.get('akas') or []
other_akas = data.get('other akas') or []
akas += other_akas
nakas = []
for aka in akas:
aka = aka.strip()
if aka.endswith('" -'):
aka = aka[:-3].rstrip()
nakas.append(aka)
if 'akas' in data:
del data['akas']
if 'other akas' in data:
del data['other akas']
if akas:
data['akas'] = akas
if nakas:
data['akas'] = nakas
if 'runtimes' in data:
data['runtimes'] = [x.replace(' min', u'')
for x in data['runtimes']]
......@@ -952,10 +958,10 @@ class DOMHTMLReleaseinfoParser(DOMParserBase):
akas = data.get('akas') or []
nakas = []
for aka in akas:
title = aka.get('title', '').strip()
title = (aka.get('title') or '').strip()
if not title:
continue
countries = aka.get('countries', '').split('/')
countries = (aka.get('countries') or '').split('/')
if not countries:
nakas.append(title)
else:
......@@ -1125,6 +1131,7 @@ class DOMHTMLEpisodesRatings(DOMParserBase):
def _normalize_href(href):
if (href is not None) and (not href.lower().startswith('http://')):
if href.startswith('/'): href = href[1:]
# TODO: imdbURL_base may be set by the user!
href = '%s%s' % (imdbURL_base, href)
return href
......@@ -1252,7 +1259,7 @@ class DOMHTMLTechParser(DOMParserBase):
for t in x.split('\n') if t.strip()]))]
preprocessors = [
(re.compile('(<h5>.*?</h5>)', re.I), r'\1<div class="_imdbpy">'),
(re.compile('(<h5>.*?</h5>)', re.I), r'</div>\1<div class="_imdbpy">'),
(re.compile('((<br/>|</p>|</table>))\n?<br/>(?!<a)', re.I),
r'\1</div>'),
# the ones below are for the publicity parser
......@@ -1399,6 +1406,107 @@ def _parse_review(x):
return result
class DOMHTMLSeasonEpisodesParser(DOMParserBase):
"""Parser for the "episode list" page of a given movie.
The page should be provided as a string, as taken from
the akas.imdb.com server. The final result will be a
dictionary, with a key for every relevant section.
Example:
sparser = DOMHTMLSeasonEpisodesParser()
result = sparser.parse(episodes_html_string)
"""
extractors = [
Extractor(label='series link',
path="//div[@class='parent']",
attrs=[Attribute(key='series link',
path=".//a/@href")]
),
Extractor(label='series title',
path="//head/meta[@property='og:title']",
attrs=[Attribute(key='series title',
path="./@content")]
),
Extractor(label='seasons list',
path="//select[@id='bySeason']//option",
attrs=[Attribute(key='_seasons',
multi=True,
path="./@value")]),
Extractor(label='selected season',
path="//select[@id='bySeason']//option[@selected]",
attrs=[Attribute(key='_current_season',
path='./@value')]),
Extractor(label='episodes',
path=".",
group="//div[@class='info']",
group_key=".//meta/@content",
group_key_normalize=lambda x: 'episode %s' % x,
attrs=[Attribute(key=None,
multi=True,
path={
"link": ".//strong//a[@href][1]/@href",
"original air date": ".//div[@class='airdate']/text()",
"title": ".//strong//text()",
"plot": ".//div[@class='item_description']//text()"
}
)]
)
]
def postprocess_data(self, data):
series_id = analyze_imdbid(data.get('series link'))
series_title = data.get('series title', '').strip()
selected_season = data.get('_current_season',
'unknown season').strip()