Commit efc87ed9 authored by Ana Guerrero López's avatar Ana Guerrero López

Import Upstream version 2.9

parent a0580248
See also CONTRIBUTORS.txt for a list of the most important developers.
See also CONTRIBUTORS.txt for a list of the most important developers
who share the copyright on some portions of the code.
I'd like to thank the following people for their help:
......@@ -9,6 +10,11 @@ I'd like to thank the following people for their help:
* Ana Guerrero, for the official debian package.
* Hadley Rich for reporting bugs and providing patches for troubles
parsing tv series' episodes and searching for tv series' titles.
* Vincent Crevot, for a bug report about unicode support.
* Jay Klein for a bug report and testing to fix a nasty bug in the
imdbpy2sql.py script (splitting too large data sets).
......@@ -30,7 +36,7 @@ I'd like to thank the following people for their help:
retrieve a movie/person object, given an URL.
* Sebastian Pölsterl, for a bug report about the cover url for
tv (mini) series.
tv (mini) series, and another one about search_* methods.
* Martin Kirst for many hints and the work on the imdbpyweb program.
......@@ -48,8 +54,6 @@ I'd like to thank the following people for their help:
* Trevor MacPhail, for a bug report about search_* methods and
the ParserBase.parse method.
* Sebastian Pölsterl, for a bug report about search_* methods.
* Guillaume Wisniewski, for a bug report.
* Kent Johnson, for a bug report.
......
Changelog for IMDbPY
====================
* What's the new in release 2.9 "Rodan! The Flying Monster" (21 Feb 2007)
[global]
- on 19 February IMDb has redesigned its site; this is the last
IMDbPY's release to parse the "old layout" pages; from now on,
the development will be geared to support the new web pages.
See the README.redesign file for more information.
- minor clean-ups and functions added to the helpers module.
[http]
- fixed some unicode-related problems searching for movie titles and
person names; also changed the queries used to search titles/names.
- fixed a bug parsing episodes for tv series.
- fixed a bug retrieving movieID for tv series, searching for titles.
[mobile]
- fixed a problem searching exact matches (movie titles only).
- fixed a bug with cast entries, after minor changes to the IMDb's
web site HTML.
[local and sql]
- fixed a bug parsing birth/death dates and notes.
[sql]
- (maybe) fixed another unicode-related bug fetching data from a
MySQL database. Maybe. Maybe. Maybe.
* What's the new in release 2.8 "Apollo 13" (14 Dec 2006)
[general]
- fix for environments where sys.stdin was overridden by a custom object.
......
IMDb's web site redesign
========================
On 19 February 2007, IMDb introduced a complete redesign of their
web site. This means that the 'http' and 'mobile' parser are no
more able to parse the new html; as a temporary solution, the account
used by IMDbPY was set to "use previous layout", meaning that - for
a certain amount of time - the current IMDbPY version (2.9) will work.
This (2.9) will be the last version of IMDbPY to parse the old layout:
from now on, on the CVS, the development will be geared to use the new
layout - and a new IMDb's account will be used.
Conclusion: if you find a bug in 'http' or 'mobile' in this release,
please report it anyway (it can also affect the new code), but consider
that a bit of time will be needed, to fix everything.
Even better, help the development subscribing to the mailing list:
http://imdbpy.sourceforge.net/?page=devel
......@@ -9,7 +9,8 @@ NOTE: it's always time to clean the code! <g>
[general]
* Write better summary() methods for Movie and Person classes.
* Some portions of code are poorly commented.
* The documentation is written in my funny English.
* The documentation is written in my funny Anglo-Bolognese.
* a better test-suite is really needed.
* Compatibility with Python 2.2 and previous versions is no more assured
for every data access system (the imdbpy2sql.py script for sure
requires at least Python 2.3).
......@@ -17,13 +18,14 @@ NOTE: it's always time to clean the code! <g>
beyond their initial goals.
* the 'year' keyword can probably be an int, instead of a string;
the '????' case can be handled directly by the analyze_title/build_title
functions.
functions. But how much code will be broken?
* for local and sql data access systems: some episode titles are
marked as {{SUSPENDED}}; they should probably be ignored.
[searches]
* Support advanced query for movie titles/person names.
* Support advanced query for movie titles/person names - if possible
this should be available in every data access systems.
[Movie objects]
......@@ -41,7 +43,8 @@ NOTE: it's always time to clean the code! <g>
notes ("written by", "as Aka Name", ...)
* The 'laserdisc' information for 'local' and 'sql' is probabily
wrong: I think they merge data from different laserdisc titles.
* there are links to hollywoodreporter.com that are not bathered in
Anyway these data are no more updated by IMDb, and so...
* there are links to hollywoodreporter.com that are not gathered in
the "external reviews" page.
......
......@@ -341,7 +341,9 @@ class IMDbBase:
return None if it's unable to get the imdbID."""
if not title: return None
import urllib
params = 'q=%s&s=pt' % str(urllib.quote_plus(title))
if isinstance(title, unicode):
title = title.encode('utf-8')
params = 'q=%s;s=pt' % str(urllib.quote_plus(title))
content = self._searchIMDb(params)
if not content: return None
from imdb.parser.http.searchMovieParser import BasicMovieParser
......@@ -356,7 +358,9 @@ class IMDbBase:
return None if it's unable to get the imdbID."""
if not name: return None
import urllib
params = 'q=%s&s=pn' % str(urllib.quote_plus(name))
if isinstance(name, unicode):
name = name.encode('utf-8')
params = 'q=%s;s=pn' % str(urllib.quote_plus(name))
content = self._searchIMDb(params)
if not content: return None
from imdb.parser.http.searchPersonParser import BasicPersonParser
......
......@@ -55,11 +55,12 @@ def makeCgiPrintEncoding(encoding):
cgiPrint = makeCgiPrintEncoding('latin_1')
def makeModCGILinks(movieTxt, personTxt):
def makeModCGILinks(movieTxt, personTxt, encoding='latin_1'):
"""Make a function used to pretty-print movies and persons refereces;
movieTxt and personTxt are the strings used for the substitutions.
movieTxt must contains %(movieID)s and %(title)s, while personTxt
must contains %(personID)s and %(name)s."""
_cgiPrint = makeCgiPrintEncoding(encoding)
def modCGILinks(s, titlesRefs, namesRefs):
"""Substitute movies and persons references."""
# XXX: look ma'... more nested scopes! <g>
......@@ -69,8 +70,8 @@ def makeModCGILinks(movieTxt, personTxt):
if item:
movieID = item.movieID
to_replace = movieTxt % {'movieID': movieID,
'title': unicode(cgiPrint(to_replace),
'latin_1',
'title': unicode(_cgiPrint(to_replace),
encoding,
'xmlcharrefreplace')}
return to_replace
def _replacePerson(match):
......@@ -79,8 +80,8 @@ def makeModCGILinks(movieTxt, personTxt):
if item:
personID = item.personID
to_replace = personTxt % {'personID': personID,
'name': unicode(cgiPrint(to_replace),
'latin_1',
'name': unicode(_cgiPrint(to_replace),
encoding,
'xmlcharrefreplace')}
return to_replace
s = s.replace('<', '&lt;').replace('>', '&gt;')
......@@ -91,9 +92,11 @@ def makeModCGILinks(movieTxt, personTxt):
return modCGILinks
# links to the imdb.com web site.
modHtmlLinks = makeModCGILinks(
movieTxt='<a href="http://akas.imdb.com/title/tt%(movieID)s">%(title)s</a>',
personTxt='<a href="http://akas.imdb.com/name/nm%(personID)s">%(name)s</a>')
_movieTxt = '<a href="http://akas.imdb.com/title/tt%(movieID)s">%(title)s</a>'
_personTxt = '<a href="http://akas.imdb.com/name/nm%(personID)s">%(name)s</a>'
modHtmlLinks = makeModCGILinks(movieTxt=_movieTxt, personTxt=_personTxt)
modHtmlLinksASCII = makeModCGILinks(movieTxt=_movieTxt, personTxt=_personTxt,
encoding='ascii')
everyentcharrefs = entcharrefs.copy()
......
......@@ -7,7 +7,7 @@ the imdb.IMDb function will return an instance of this class when
called with the 'accessSystem' argument set to "http" or "web"
or "html" (this is the default).
Copyright 2004-2006 Davide Alberani <da@erlug.linux.it>
Copyright 2004-2007 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -238,16 +238,20 @@ class IMDbHTTPAccessSystem(IMDbBase):
kind can be tt (for titles) or nm (for names)
ton is the title or the name to search.
results is the maximum number of results to be retrieved."""
params = 'q=%s&%s=on&mx=%s' % (quote_plus(ton), kind, str(results))
if isinstance(ton, unicode):
ton = ton.encode('utf-8')
##params = 'q=%s&%s=on&mx=%s' % (quote_plus(ton), kind, str(results))
params = 's=%s;mx=%s;q=%s' % (kind, str(results), quote_plus(ton))
cont = self._retrieve(imdbURL_search % params)
if cont.find('more than 500 partial matches') == -1:
return cont
# The retrieved page contains no results, because too many
# titles or names contain the string we're looking for.
if kind == 'nm':
params = 'q=%s;more=nm' % quote_plus(ton)
else:
params = 'q=%s;more=tt' % quote_plus(ton)
params = 'q=%s;more=%s' % (quote_plus(ton), kind)
##if kind == 'nm':
## params = 'q=%s;more=nm' % quote_plus(ton)
##else:
## params = 'q=%s;more=tt' % quote_plus(ton)
size = 22528 + results * 512
return self._retrieve(imdbURL_search % params, size=size)
......@@ -256,8 +260,8 @@ class IMDbHTTPAccessSystem(IMDbBase):
# XXX: To retrieve the complete results list:
# params = urllib.urlencode({'more': 'tt', 'q': title})
##params = urllib.urlencode({'tt': 'on','mx': str(results),'q': title})
#params = 'q=%s&tt=on&mx=%s' % (quote_plus(title), str(results))
#cont = self._retrieve(imdbURL_search % params)
##params = 'q=%s&tt=on&mx=%s' % (quote_plus(title), str(results))
##cont = self._retrieve(imdbURL_search % params)
cont = self._get_search_content('tt', title, results)
return search_movie_parser.parse(cont, results=results)['data']
......
......@@ -702,6 +702,7 @@ class HTMLAwardsParser(ParserBase):
"""Reset the parser."""
self._aw_data = []
self._is_big = 0
self._is_small = 0
self._is_current_assigner = 0
self._begin_aw = 0
self._in_td = 0
......@@ -2571,7 +2572,7 @@ class HTMLEpisodesParser(ParserBase):
self._cur_episode = None
self._in_episodes = 0
self._in_td_eps = 0
self._in_td_title = 1
self._in_td_title = 0
self._in_title = 0
self._cur_title = u''
self._curid = ''
......@@ -2619,7 +2620,9 @@ class HTMLEpisodesParser(ParserBase):
self._in_td_eps = 0
def start_a(self, attrs):
if self._ignore_this_table: return
# Commented to prevent a whole season to be skipped, if the last
# episode of the previous season has the "next US airing" info.
##if self._ignore_this_table: return
href = self.get_attr_value(attrs, 'href')
if href and href.startswith('/title/tt'):
curid = self.re_imdbID.findall(href)
......
......@@ -193,6 +193,10 @@ class HTMLSearchMovieParser(ParserBase):
self._current_imdbID = ''
def start_a(self, attrs):
# Prevent tv series to get the (wrong) movieID from the
# last episode, sometimes listed in the <li>...</li> tag
# along with the series' title.
if self._current_imdbID: return
link = self.get_attr_value(attrs, 'href')
# The next data is a movie title; now store the imdbID.
if link and link.lower().startswith('/title'):
......
......@@ -53,9 +53,11 @@ def _parseList(l, prefix, mline=1):
if ltmp:
reslapp(joiner(ltmp))
ltmp[:] = []
ltmpapp(line[firstlen:].strip())
data = line[firstlen:].strip()
if data: ltmpapp(data)
elif mline and line[:otherlen] == otherl:
ltmpapp(line[otherlen:].strip())
data = line[otherlen:].strip()
if data: ltmpapp(data)
else:
if ltmp:
reslapp(joiner(ltmp))
......@@ -115,6 +117,28 @@ def _parseBioBy(l):
tmpbio[:] = []
return bios
def _getDateAndNotes(s):
"""Parse (birth|death) date and notes."""
s = s.strip()
if not s: return ('', '')
notes = ''
if s[0].isdigit() or s.split()[0].lower() in ('c.', 'january', 'february',
'march', 'april', 'may', 'june',
'july', 'august', 'september',
'october', 'november',
'december', 'ca.', 'circa',
'????,'):
i = s.find(',')
if i != -1:
notes = s[i+1:].strip()
s = s[:i]
else:
notes = s
s = ''
if s == '????': s = ''
return s, notes
def _parseBiography(biol):
"""Parse the biographies.data file."""
res = {}
......@@ -126,19 +150,29 @@ def _parseBiography(biol):
x4 = x[:4]
x6 = x[:6]
if x4 == 'DB: ':
bdate = x.strip()
i = bdate.find(',')
if i != -1:
res['birth notes'] = bdate[i+1:].strip()
bdate = bdate[:i]
res['birth date'] = bdate[4:]
date, notes = _getDateAndNotes(x[4:])
if date:
res['birth date'] = date
if notes:
res['birth notes'] = notes
#bdate = x.strip()
#i = bdate.find(',')
#if i != -1:
# res['birth notes'] = bdate[i+1:].strip()
# bdate = bdate[:i]
#res['birth date'] = bdate[4:]
elif x4 == 'DD: ':
ddate = x.strip()
i = ddate.find(',')
if i != -1:
res['death notes'] = ddate[i+1:].strip()
ddate = ddate[:i]
res['death date'] = ddate[4:]
date, notes = _getDateAndNotes(x[4:])
if date:
res['death date'] = date
if notes:
res['death notes'] = notes
#ddate = x.strip()
#i = ddate.find(',')
#if i != -1:
# res['death notes'] = ddate[i+1:].strip()
# ddate = ddate[:i]
#res['death date'] = ddate[4:]
elif x6 == 'SP: * ':
res.setdefault('spouse', []).append(x[6:].strip())
elif x4 == 'RN: ':
......
......@@ -6,7 +6,7 @@ IMDb's data for mobile systems.
the imdb.IMDb function will return an instance of this class when
called with the 'accessSystem' argument set to "mobile".
Copyright 2005-2006 Davide Alberani <da@erlug.linux.it>
Copyright 2005-2007 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -23,14 +23,14 @@ along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
import re, urllib
import re
from types import ListType, TupleType
from imdb.Movie import Movie
from imdb.Person import Person
from imdb.utils import analyze_title, analyze_name, canonicalName, re_episodes
from imdb._exceptions import IMDbDataAccessError
from imdb.parser.http import IMDbHTTPAccessSystem, imdbURL_search, \
from imdb.parser.http import IMDbHTTPAccessSystem, \
imdbURL_movie, imdbURL_person
from imdb.parser.http.utils import subXMLRefs, subSGMLRefs
......@@ -146,8 +146,10 @@ class IMDbMobileAccessSystem(IMDbHTTPAccessSystem):
if aonly:
stripped = _findBetween(name, '>', '</a>')
if len(stripped) == 1: name = stripped[0]
if name[0:1] == '>': name = name[1:]
name = _unHtml(name)
gt_indx = name.find('>')
if gt_indx != -1:
name = name[gt_indx+1:].lstrip()
if not (pid and name): continue
plappend(Person(personID=str(pid[0]), name=name,
currentRole=currentRole, notes=notes,
......@@ -158,8 +160,8 @@ class IMDbMobileAccessSystem(IMDbHTTPAccessSystem):
def _search_movie(self, title, results):
##params = urllib.urlencode({'tt': 'on','mx': str(results),'q': title})
#params = 'q=%s&tt=on&mx=%s' % (urllib.quote_plus(title), str(results))
#cont = self._mretrieve(imdbURL_search % params)
##params = 'q=%s&tt=on&mx=%s' % (urllib.quote_plus(title), str(results))
##cont = self._mretrieve(imdbURL_search % params)
cont = subXMLRefs(self._get_search_content('tt', title, results))
title = _findBetween(cont, '<title>', '</title>')
res = []
......@@ -169,8 +171,12 @@ class IMDbMobileAccessSystem(IMDbHTTPAccessSystem):
# XXX: a direct hit!
title = _unHtml(title[0])
midtag = _getTagWith(cont, 'name="arg"')
if not midtag: midtag = _getTagWith(cont, 'name="auto"')
mid = None
if midtag: mid = _findBetween(midtag[0], 'value="', '"')
if midtag:
mid = _findBetween(midtag[0], 'value="', '"')
if mid and not mid[0].isdigit():
mid = re_imdbID.findall(mid[0])
if not (mid and title): return res
res[:] = [(str(mid[0]), analyze_title(title, canonical=1))]
else:
......@@ -304,8 +310,6 @@ class IMDbMobileAccessSystem(IMDbHTTPAccessSystem):
if smie != -1:
castdata = castdata[:smib].strip() + \
castdata[smie+18:].strip()
castdata = castdata.replace(' bgcolor="#F0F0F0"', '')
castdata = castdata.replace(' bgcolor="#FFFFFF"', '')
castdata = castdata.replace('/tr> <tr', '/tr><tr')
cast = self._getPersons(castdata, sep='</tr><tr', hasCr=1)
if cast: d['cast'] = cast
......@@ -371,8 +375,8 @@ class IMDbMobileAccessSystem(IMDbHTTPAccessSystem):
def _search_person(self, name, results):
##params = urllib.urlencode({'nm': 'on', 'mx': str(results), 'q': name})
#params = 'q=%s&nm=on&mx=%s' % (urllib.quote_plus(name), str(results))
#cont = self._mretrieve(imdbURL_search % params)
##params = 'q=%s&nm=on&mx=%s' % (urllib.quote_plus(name), str(results))
##cont = self._mretrieve(imdbURL_search % params)
cont = subXMLRefs(self._get_search_content('nm', name, results))
name = _findBetween(cont, '<title>', '</title>')
res = []
......
......@@ -142,9 +142,12 @@ DB_TABLES = [Name, KindType, Title, AkaName, AkaTitle, RoleType, CastInfo,
def setConnection(uri, debug=False):
"""Set connection for every table."""
kw = {}
if uri.lower().startswith('mysql'):
kw['use_unicode'] = 1
kw['sqlobject_encoding'] = 'utf8'
# FIXME: it's absolutely unclear what we should do to correctly
# support unicode in MySQL; with the last versions of SQLObject,
# it seems that setting use_unicode=1 is the _wrong_ thing to do.
##if uri.lower().startswith('mysql'):
## kw['use_unicode'] = 1
## kw['sqlobject_encoding'] = 'utf8'
conn = connectionForURI(uri, **kw)
conn.debug = debug
for table in DB_TABLES:
......
......@@ -34,7 +34,7 @@ DO_SCRIPTS = 1
# version of the software; CVS releases contain a string
# like ".cvsYearMonthDay(OptionalChar)".
version = '2.8'
version = '2.9'
home_page = 'http://imdbpy.sf.net/'
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment