Commit fde10387 authored by Ana Guerrero López's avatar Ana Guerrero López

Import Upstream version 4.1

parent 8bc38665
This diff is collapsed.
......@@ -21,6 +21,10 @@ I'd like to thank the following people for their help:
* James Rubino for many bug reports.
* Andre LeBlanc for a bug report about airing date of tv series episodes.
* neonrush for a bug parsing Malcolm McDowell filmography!
* Alen Ribic for some bug reports and hints.
* Joachim Selke for some bug reports with SQLAlchemy and DB2 and a lot
......
Changelog for IMDbPY
====================
* What's the new in release 4.1 "State Of Play" (02 May 2009)
[general]
- DTD definition.
- support for locale.
- support for the new style for movie titles ("The Title" and no
more "Title, The" is internally used).
- minor fix to XML code to work with the test-suite.
[http]
- char references in the &#xHEXCODE; format are handled.
- fixed a bug with movies containing '....' in titles. And I'm
talking about Malcolm McDowell's filmography!
- 'airing' contains object (so the accessSystem variable is set).
- 'tv schedule' ('airing') pages of episodes can be parsed.
- 'tv schedule' is now a valid alias for 'airing'.
- minor fixes for empty/wrong strings.
[sql]
- in the database, soundex values for titles are always calculated
after the article is stripped (if any).
- imdbpy2sql.py has the --fix-old-style-titles option, to handle
files in the old format.
- fixed a bug saving imdbIDs.
[local]
- the 'local' data access system should be considered obsolete, and
will probably be removed in the next release.
* What's the new in release 4.0 "Watchmen" (12 Mar 2009)
[general]
- the installer is now based on setuptools.
......
......@@ -64,3 +64,21 @@ the README.package file), to replace these references with their own
strings (e.g.: a link to a web page); it's up to the user, to be sure
that the output of the defaultModFunct function is valid XML.
DTD
===
Since version 4.1 a DTD is available; it can be found in this
directory or on the web, at:
http://imdbpy.sf.net/dtd/imdbpy41.dtd
The version number may (or may not) change with new version, if
changes to the DTD were introduced.
LOCALIZATION
============
Since version 4.1 it's possible to translate the XML tags;
see README.locale.
LOCALIZATION FOR IMDbPY
=======================
Since version 4.1 it's easy to translate the labels that describe
sets of information.
LIMITATION
==========
So far no internal message or exception is translated, the
internationalization is limited to the "tags" returned
by the getAsXML and asXML methods of the Movie, Person, Character
or Company classes. Beware that in many cases these "tags" are not
the same as the "keys" used to access information in the same
classes, as if they are dictionaries.
E.g.: you can translate "long-imdb-name" - the tag returned by
the call person.getAsXML('long imdb name') - but not "long imdb name"
directly.
USAGE
=====
If you want to add i18n to your IMDbPY-based application, all you need
to do is to switch to the 'imdbpy' text domain.
E.g.:
import imdb.locale
# Standard gettext stuff.
import gettext
from gettext import gettext as _
# Switch to the imdbpy domain.
gettext.textdomain('imdbpy')
# Request a translation.
print _(u'long-imdb-name')
ADD A NEW LANGUAGE
==================
In the imdb.locale package, you'll find some scripts useful to build
your own internationalization files.
If you create a new translation or update an existing one, you can send
it to the <imdbpy-devel@lists.sourceforge.net> mailing list, for
inclusion in the next releases.
- the generatepot.py should be used only when the DTD is changed; it's
used to create the imdbpy.pot file (the one shipped is always
up-to-date).
- you can copy the imdbpy.pot file to your language's .po file (e.g.
imdbpy-fr.po for French) and modify it accordingly to your needs.
- then you must run rebuildmo.py (which is automatically called
at install time, by the setup.py script) to create the .mo files.
If you need to upgrade an existing .po file, after changes to the .pot
file (usually because the DTD was changed), you can use the msgmerge
tool, part of the GNU gettext suite.
E.g.:
msgmerge -N imdbpy-fr.po imdbpy.pot > new-imdbpy-fr.po
......@@ -25,14 +25,14 @@ by the web server ("The Series" The Episode (2005)) or in the format
of the plain text data files ("The Series" (2004) {The Episode (#ser.epi)})
An example of the returned dictionary: call the function:
analyze_title('"The Series" The Episode (2005)', canonical=1)
analyze_title('"The Series" The Episode (2005)')
the result will be:
{'kind': 'episode', # kind is set to 'episode'.
'year': '2005', # the release year of this episode.
'title': 'Episode, The', # episode title
'title': 'The Episode', # episode title
'episode of': {'kind': 'tv series', # 'episode of' will contains
'title': 'Series, The'} # information about the series.
'title': 'The Series'} # information about the series.
}
......@@ -41,9 +41,9 @@ the same information.
The build_title() function takes an optional argument: ptdf; is it's
set to false (the default), it returns the title of the episode in
the format used by the IMDb's web server ("Series, The" An Episode (2005)),
the format used by the IMDb's web server ("The Series" An Episode (2006)),
otherwise it uses the format used by the plain text data files (something
like "Series, The" (2004) {An Episode (#2.5)})
like "The Series" (2004) {An Episode (#2.5)})
SERIES
......
......@@ -31,6 +31,12 @@ in a set of CSV (Comma Separated Values) files, to be later imported
in a database. Actually only MySQL, PostgreSQL and IBM DB2 are
supported.
In version 4.1 the imdbpy2sql.py script has the '--fix-old-style-titles'
command line argument; if used, every movie title will be converted to
the new style ("The Title", instead of the old "Title, The").
This option will go away in 4.2, and is intended only to support old
set of plain text data files.
REQUIREMENTS
============
......
......@@ -132,6 +132,12 @@ Another new feature, is the ability to get top250 and bottom100 lists;
see the "TOP250 / BOTTOM100 LISTS" section of the README.package file
for more information.
Since release 4.1 a DTD for the XML output is available (see
imdbpyXY.dtd). Other important features are locale (i18n) support (see
README.locale) and support for the new style of movie titles used by IMDb
(now in the "The Title" style, and no more as "Title, The").
The 'local' data access system should be considered obsolete.
FEATURES
========
......
This diff is collapsed.
......@@ -23,7 +23,7 @@ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
from copy import deepcopy
from imdb.utils import analyze_title, build_title, normalizeTitle, \
from imdb.utils import analyze_title, build_title, canonicalTitle, \
flatten, _Container, cmpMovies
......@@ -42,6 +42,7 @@ class Movie(_Container):
# Aliases for some not-so-intuitive keys.
keys_alias = {
'tv schedule': 'airing',
'user rating': 'rating',
'plot summary': 'plot',
'plot summaries': 'plot',
......@@ -171,7 +172,7 @@ class Movie(_Container):
def set_title(self, title):
"""Set the title of the movie."""
# XXX: convert title to unicode, if it's a plain string?
d_title = analyze_title(title, canonical=1)
d_title = analyze_title(title)
self.data.update(d_title)
def _additional_keys(self):
......@@ -190,26 +191,23 @@ class Movie(_Container):
"""Handle special keys."""
if self.data.has_key('episode of'):
if key == 'long imdb episode title':
return build_title(self.data, canonical=0)
return build_title(self.data)
elif key == 'series title':
ser_title = self.data['episode of'].get('canonical title') or \
self.data['episode of']['title']
return normalizeTitle(ser_title)
return self.data['episode of']['title']
elif key == 'canonical series title':
ser_title = self.data['episode of'].get('canonical title') or \
self.data['episode of']['title']
return ser_title
ser_title = self.data['episode of']['title']
return canonicalTitle(ser_title)
elif key == 'episode title':
return normalizeTitle(self.data.get('title', u''))
elif key == 'canonical episode title':
return self.data.get('title', u'')
elif key == 'canonical episode title':
return canonicalTitle(self.data.get('title', u''))
if self.data.has_key('title'):
if key == 'title':
return normalizeTitle(self.data['title'])
return self.data['title']
elif key == 'long imdb title':
return build_title(self.data, canonical=0)
return build_title(self.data)
elif key == 'canonical title':
return self.data['title']
return canonicalTitle(self.data['title'])
elif key == 'long imdb canonical title':
return build_title(self.data, canonical=1)
return None
......@@ -232,8 +230,8 @@ class Movie(_Container):
if not isinstance(other, self.__class__): return 0
if self.data.has_key('title') and \
other.data.has_key('title') and \
build_title(self.data, canonical=1) == \
build_title(other.data, canonical=1):
build_title(self.data, canonical=0) == \
build_title(other.data, canonical=0):
return 1
if self.accessSystem == other.accessSystem and \
self.movieID is not None and self.movieID == other.movieID:
......@@ -284,7 +282,7 @@ class Movie(_Container):
if self.has_key('long imdb episode title'):
title = self.get('long imdb episode title')
else:
title = self.get('long imdb canonical title')
title = self.get('long imdb title')
r = '<Movie id:%s[%s] title:_%s_>' % (self.movieID, self.accessSystem,
title)
if isinstance(r, unicode): r = r.encode('utf_8', 'replace')
......
......@@ -146,9 +146,9 @@ class Person(_Container):
elif key == 'canonical name':
return self.data['name']
elif key == 'long imdb name':
return build_name(self.data)
return build_name(self.data, canonical=0)
elif key == 'long imdb canonical name':
return build_name(self.data, canonical=1)
return build_name(self.data)
return None
def getID(self):
......
......@@ -25,7 +25,7 @@ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
__all__ = ['IMDb', 'IMDbError', 'Movie', 'Person', 'Character', 'Company',
'available_access_systems']
__version__ = VERSION = '4.0'
__version__ = VERSION = '4.1'
# Import compatibility module (importing it is enough).
import _compat
......@@ -819,7 +819,7 @@ class IMDbBase:
if mop.movieID is not None:
imdbID = aSystem.get_imdbMovieID(mop.movieID)
else:
imdbID = aSystem.title2imdbID(build_title(mop, canonical=1,
imdbID = aSystem.title2imdbID(build_title(mop, canonical=0,
ptdf=1))
elif isinstance(mop, Person.Person):
if mop.personID is not None:
......@@ -830,6 +830,7 @@ class IMDbBase:
if mop.characterID is not None:
imdbID = aSystem.get_imdbCharacterID(mop.characterID)
else:
# canonical=0 ?
imdbID = aSystem.character2imdbID(build_name(mop, canonical=1))
elif isinstance(mop, Company.Company):
if mop.companyID is not None:
......
......@@ -36,7 +36,7 @@ from imdb.Person import Person
from imdb.Character import Character
from imdb.Company import Company
from imdb.parser.http.utils import re_entcharrefssub, entcharrefs, \
entcharrefsget, subXMLRefs, subSGMLRefs
subXMLRefs, subSGMLRefs
# An URL, more or less.
......
"""
locale package (imdb package).
This package provides scripts and files for internationalization
of IMDbPY.
Copyright 2009 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
import gettext
import os
LOCALE_DIR = os.path.dirname(__file__)
gettext.bindtextdomain('imdbpy', LOCALE_DIR)
#!/usr/bin/env python
"""
generatepot.py script.
This script generates the imdbpy.pot file, from the DTD.
Copyright 2009 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
import re
import sys
from datetime import datetime as dt
DEFAULT_MESSAGES = { }
ELEMENT_PATTERN = r"""<!ELEMENT\s+([^\s]+)"""
re_element = re.compile(ELEMENT_PATTERN)
POT_HEADER_TEMPLATE = r"""# Gettext message file for imdbpy
msgid ""
msgstr ""
"Project-Id-Version: imdbpy\n"
"POT-Creation-Date: %(now)s\n"
"PO-Revision-Date: YYYY-MM-DD HH:MM+0000\n"
"Last-Translator: YOUR NAME <YOUR@EMAIL>\n"
"Language-Team: TEAM NAME <TEAM@EMAIL>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"Language-Code: en\n"
"Language-Name: English\n"
"Preferred-Encodings: utf-8\n"
"Domain: imdbpy\n"
"""
if len(sys.argv) != 2:
print "Usage: %s dtd_file" % sys.argv[0]
sys.exit()
dtdfilename = sys.argv[1]
dtd = open(dtdfilename).read()
elements = re_element.findall(dtd)
uniq = set(elements)
elements = list(uniq)
print POT_HEADER_TEMPLATE % {
'now': dt.strftime(dt.now(), "%Y-%m-%d %H:%M+0000")
}
for element in sorted(elements):
if element in DEFAULT_MESSAGES:
print '# Default: %s' % DEFAULT_MESSAGES[element]
else:
print '# Default: %s' % element.replace('-', ' ').capitalize()
print 'msgid "%s"' % element
print 'msgstr ""'
# use this part instead of the line above to generate the po file for English
#if element in DEFAULT_MESSAGES:
# print 'msgstr "%s"' % DEFAULT_MESSAGES[element]
#else:
# print 'msgstr "%s"' % element.replace('-', ' ').capitalize()
print
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# Written by Martin v. Löwis <loewis@informatik.hu-berlin.de>
"""Generate binary message catalog from textual translation description.
This program converts a textual Uniforum-style message catalog (.po file) into
a binary GNU catalog (.mo file). This is essentially the same function as the
GNU msgfmt program, however, it is a simpler implementation.
Usage: msgfmt.py [OPTIONS] filename.po
Options:
-o file
--output-file=file
Specify the output file to write to. If omitted, output will go to a
file named filename.mo (based off the input file name).
-h
--help
Print this message and exit.
-V
--version
Display version information and exit.
"""
import sys
import os
import getopt
import struct
import array
__version__ = "1.1"
MESSAGES = {}
def usage(code, msg=''):
print >> sys.stderr, __doc__
if msg:
print >> sys.stderr, msg
sys.exit(code)
def add(id, str, fuzzy):
"Add a non-fuzzy translation to the dictionary."
global MESSAGES
if not fuzzy and str:
MESSAGES[id] = str
def generate():
"Return the generated output."
global MESSAGES
keys = MESSAGES.keys()
# the keys are sorted in the .mo file
keys.sort()
offsets = []
ids = strs = ''
for id in keys:
# For each string, we need size and file offset. Each string is NUL
# terminated; the NUL does not count into the size.
offsets.append((len(ids), len(id), len(strs), len(MESSAGES[id])))
ids += id + '\0'
strs += MESSAGES[id] + '\0'
output = ''
# The header is 7 32-bit unsigned integers. We don't use hash tables, so
# the keys start right after the index tables.
# translated string.
keystart = 7*4+16*len(keys)
# and the values start after the keys
valuestart = keystart + len(ids)
koffsets = []
voffsets = []
# The string table first has the list of keys, then the list of values.
# Each entry has first the size of the string, then the file offset.
for o1, l1, o2, l2 in offsets:
koffsets += [l1, o1+keystart]
voffsets += [l2, o2+valuestart]
offsets = koffsets + voffsets
output = struct.pack("Iiiiiii",
0x950412deL, # Magic
0, # Version
len(keys), # # of entries
7*4, # start of key index
7*4+len(keys)*8, # start of value index
0, 0) # size and offset of hash table
output += array.array("i", offsets).tostring()
output += ids
output += strs
return output
def make(filename, outfile):
ID = 1
STR = 2
# Compute .mo name from .po name and arguments
if filename.endswith('.po'):
infile = filename
else:
infile = filename + '.po'
if outfile is None:
outfile = os.path.splitext(infile)[0] + '.mo'
try:
lines = open(infile).readlines()
except IOError, msg:
print >> sys.stderr, msg
sys.exit(1)
section = None
fuzzy = 0
# Parse the catalog
lno = 0
for l in lines:
lno += 1
# If we get a comment line after a msgstr, this is a new entry
if l[0] == '#' and section == STR:
add(msgid, msgstr, fuzzy)
section = None
fuzzy = 0
# Record a fuzzy mark
if l[:2] == '#,' and 'fuzzy' in l:
fuzzy = 1
# Skip comments
if l[0] == '#':
continue
# Now we are in a msgid section, output previous section
if l.startswith('msgid'):
if section == STR:
add(msgid, msgstr, fuzzy)
section = ID
l = l[5:]
msgid = msgstr = ''
# Now we are in a msgstr section
elif l.startswith('msgstr'):
section = STR
l = l[6:]
# Skip empty lines
l = l.strip()
if not l:
continue
# XXX: Does this always follow Python escape semantics?
l = eval(l)
if section == ID:
msgid += l
elif section == STR:
msgstr += l
else:
print >> sys.stderr, 'Syntax error on %s:%d' % (infile, lno), \
'before:'
print >> sys.stderr, l
sys.exit(1)
# Add last entry
if section == STR:
add(msgid, msgstr, fuzzy)
# Compute output
output = generate()
try:
open(outfile,"wb").write(output)
except IOError,msg:
print >> sys.stderr, msg
def main():
try:
opts, args = getopt.getopt(sys.argv[1:], 'hVo:',
['help', 'version', 'output-file='])
except getopt.error, msg:
usage(1, msg)
outfile = None
# parse options
for opt, arg in opts:
if opt in ('-h', '--help'):
usage(0)
elif opt in ('-V', '--version'):
print >> sys.stderr, "msgfmt.py", __version__
sys.exit(0)
elif opt in ('-o', '--output-file'):
outfile = arg
# do it
if not args:
print >> sys.stderr, 'No input file given'
print >> sys.stderr, "Try `msgfmt --help' for more information."
return
for filename in args:
make(filename, outfile)
if __name__ == '__main__':
main()
#!/usr/bin/env python
"""
rebuildmo.py script.
This script builds the .mo files, from the .po files.
Copyright 2009 H. Turgut Uyar <uyar@tekir.org>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
import glob
import msgfmt
import os
#LOCALE_DIR = os.path.dirname(__file__)
def rebuildmo():
lang_glob = 'imdbpy-*.po'
created = []
for input_file in glob.glob(lang_glob):
lang = input_file[7:-3]
if not os.path.exists(lang):
os.mkdir(lang)
mo_dir = os.path.join(lang, 'LC_MESSAGES')
if not os.path.exists(mo_dir):
os.mkdir(mo_dir)
output_file = os.path.join(mo_dir, 'imdbpy.mo')
msgfmt.make(input_file, output_file)
created.append(lang)
return created
if __name__ == '__main__':
languages = rebuildmo()
print 'Created locale for: %s.' % ' '.join(languages)
......@@ -31,7 +31,7 @@ from imdb.Movie import Movie
from imdb.utils import analyze_title, build_title, analyze_name, \
build_name, canonicalTitle, canonicalName, \
normalizeName, normalizeTitle, re_titleRef, \
re_nameRef, re_year_index, _articles, \
re_nameRef, re_year_index, _unicodeArticles, \
analyze_company_name
re_nameIndex = re.compile(r'\(([IVXLCDM]+)\)')
......@@ -51,8 +51,8 @@ class IMDbLocalAndSqlAccessSystem(IMDbBase):
"""Find titles or names references in strings."""
if isinstance(o, (unicode, str)):
for title in re_titleRef.findall(o):
a_title = analyze_title(title, canonical=1)
rtitle = build_title(a_title, canonical=1, ptdf=1)
a_title = analyze_title(title, canonical=0)
rtitle = build_title(a_title, ptdf=1)
if trefs.has_key(rtitle): continue
movieID = self._getTitleID(rtitle)
if movieID is None:
......@@ -157,7 +157,7 @@ def titleVariations(title, fromPtdf=0):
if title1:
title2 = title1
t2s = title2.split(u', ')
if t2s[-1].lower() in _articles:
if t2s[-1].lower() in _unicodeArticles:
title2 = u', '.join(t2s[:-1])
return title1, title2, title3
......@@ -329,12 +329,13 @@ def scan_titles(titles_list, title1, title2, title3, results=0,
# titleS -> titleR
# titleS, the -> titleR, the
if not searchingEpisode:
til = canonicalTitle(til)
ratios = [ratcliff(title1, til, sm1) + 0.05]
# til2 is til without the article, if present.
til2 = til
tils = til2.split(', ')
matchHasArt = 0
if tils[-1].lower() in _articles: