Commit 8cf69e1c authored by Ana Guerrero López's avatar Ana Guerrero López

Import Upstream version 3.5

parent 6895fa63
......@@ -5,7 +5,7 @@ characters4local.py script.
This script creates some files to manage characters' information
for the 'local' data access system.
Copyright 2007 Davide Alberani <da@erlug.linux.it>
Copyright 2007-2008 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -92,7 +92,13 @@ def doCast(dataF, roleCount=0):
if i < noWith:
# Eat 'attributeID'.
fread(3)
length = ord(fread(1))
try:
length = ord(fread(1))
except TypeError:
# Prevent the strange case where fread(1) returns '';
# it should not happen; maybe there's some garbage in
# the files...
length = 0
if length > 0:
curRole = fread(length)
noterixd = curRole.rfind('(')
......@@ -123,7 +129,7 @@ def doCast(dataF, roleCount=0):
def writeData(d, directory):
"""Write d data into file in the specified directory."""
"""Write d data into files in the specified directory."""
# Open files.
print 'Start writing data to directory %s.' % directory
char2id = anydbm.open(os.path.join(directory, 'character2id.index'), 'n')
......
This diff is collapsed.
......@@ -15,12 +15,24 @@ I'd like to thank the following people for their help:
* Jesper Nøhr for a lot of testing, especially on 'sql'.
* Mark Armendariz for a bug report about too long field in MySQL db
and some tests/analyses.
* Alexy Khrabrov, for a report about a subtle bug in imdbpy2sql.py.
* Clark Bassett for bug reports and fixes about the imdbpy2sql.py
script and the cutils.c C module.
* mumas for reporting a bug in summary methods.
* Ken R. Garland for a bug report about 'cover url' and a lot of
other hints.
* Steven Ovits for hints and tests with Microsoft SQL Server, SQLExpress
and preliminary work on supporting diff files.
* Fredrik Arnell for tests and bug reports about the imdbpy2sql.py script.
* Arnab for a bug report in the imdbpy2sql.py script.
* Elefterios Stamatogiannakis for the hint about transactions and SQLite,
......
Changelog for IMDbPY
====================
* What's the new in release 3.5 "Blade Runner" (19 Apr 2008)
[general]
- first changes to work on Symbian mobile phones.
- now there is an imdb.available_access_systems() function, that can
be used to get a list of available data access systems.
- it's possible to pass 'results' as a parameter of the imdb.IMDb
function; it sets the number of results to return for queries.
- fixed summary() method in Movie and Person, to correctly handle
unicode chars.
- the helpers.makeObject2Txt function now supports recursion over
dictionaries.
- cutils.c MXLINELEN increased from 512 to 1024; some critical
strcpy replaced with strncpy.
- fixed configuration parser to be compatible with Python 2.2.
- updated list of articles and some stats in the comments.
- documentation updated.
[sql]
- fixed minor bugs in imdbpy2sql.py.
- restores imdbIDs for characters.
- now CharactersCache honors custom queries.
- the imdbpy2sql.py's --mysql-force-myisam command line option can be
used to force usage of MyISAM tables on InnoDB databases.
- added some warnings to the imdbpy2sql.py script.
[local]
- fixed a bug in the fall-back function used to scan movie titles,
when the cutils module is not available.
- mini biographies are cut up to 2**16-1 chars, to prevent troubles
with some MySQL servers.
- fixed bug in characters4local.py, dealing with some garbage in the files.
* What's the new in release 3.4 "Flatliners" (16 Dec 2007)
[general]
- *** NOTE FOR PACKAGERS *** in the docs directory there is the
......
......@@ -7,7 +7,7 @@ database; this required some substantial changes to how actors'
and acresses' roles were handled.
Starting with release 3.4, "local" and "sql" data access systems
are supported, too - but they work a bit differently from "http"
and "mobile". See "MOBILE AND LOCAL" below.
and "mobile". See "SQL AND LOCAL" below.
The currentRole instance attribute can be found in every instance
of Person, Movie and Character classes, even if actually the Character
......@@ -50,14 +50,16 @@ will return a good-old-unicode string, like expected in the previous
version of IMDbPY.
MOBILE AND LOCAL
================
SQL AND LOCAL
=============
Fetching data from the web, only characters with an active page
on the web site will have their characterID; we don't have these
information accessing "sql" and "local", so _every_ character
will have an associated characterID.
This way, every character with the same name will share the same ID.
This way, every character with the same name will share the same
characterID, even if - in fact - they may not be portraying the
same character.
For "local", to activate support for characters, you have to
run the characters4local.py script, specifying the directory
......
......@@ -8,8 +8,10 @@ file.
Obviously you can still prefer to use the 'local' data access
system if you're already using the moviedb program.
NOTE: see README.currentRole for information about character support.
NOTE: see README.currentRole for information about character support;
to put it simple, after you've installed everything you can use the
characters4local.py script to generate files for characters (it will
required some time).
Select a mirror of the "The Plain Text Data Files" from
the http://www.imdb.com/interfaces.html page and download
......@@ -25,8 +27,9 @@ NOTE: the current (3.24) moviedb version is old an it was not
thought with tv series episodes support in mind.
It can still work very well, but you've to modify some constants
in the code: edit the "moviedb.h" file in the "src" directory,
and change MAXTITLES to _at least_ 1400000, MAXNAKAENTRIES
to 700000 and LINKSTART to 1000000.
and change MAXTITLES to _at least_ 1600000, MAXNAKAENTRIES
to 700000, MAXFILMOGRAPHIES to 20470 and LINKSTART to 1000000.
Also, setting MXLINELEN to 1023 is a good idea.
See http://us.imdb.com/database_statistics for more up-to-date
statistics.
......
......@@ -84,8 +84,8 @@ is available at:
On some mobile phone a pair of modules can be missing, and
you have to install it manually as libraries; you can find
these two modules (sgmllib.py and htmlentitydefs.py) here:
http://imdbpy.sourceforge.net/symbiangui/mobile-imdbpy-modules-0.1.tar.gz
these modules (sgmllib.py, htmlentitydefs.py and ConfigParser.py) here:
http://imdbpy.sourceforge.net/?page=mobile
THE "HTTPTHIN" DATA ACCESS SYSTEM
......
......@@ -89,7 +89,7 @@ The fastest database appears to be MySQL, with about 95 minutes to
complete on my test system (read below).
A lot of memory (RAM or swap space) is required, in the range of
at least 150/200 megabytes (plus more for the database server).
In the end, the database will require between 1.5GB and 3GB of disc space.
In the end, the database will require between 2.5GB and 5GB of disc space.
As said, the performances varies greatly using a database server or another:
MySQL, for instance, has an executemany() method of the cursor object
......@@ -98,21 +98,18 @@ database requires a call to the execute() method for every single row
of data, and they will be much slower - from 2 to 7 times slower than
MySQL.
I've done some tests, using an AMD Athlon 1800+, 512MB of RAM, over a
complete plain text data files set (as of 12 Nov 2006, with about
890.000 titles and over 2.000.000 names):
I've done some tests, using an AMD Athlon 1800+, 1GB of RAM, over a
complete plain text data files set (as of 11 Apr 2008, with more than
1.200.000 titles and over 2.200.000 names):
database | time in minutes: total (insert data/create indexes)
----------------------+-----------------------------------------------------
MySQL 5.0 MyISAM | 115 (95/20)
MySQL 5.0 InnoDB | ??? (80/???)
| maybe I've not cofigurated it properly: it
| looks like the creation of the indexes will
| take more than 2 or 3 hours. But see NOTES below.
PostgreSQL 8.1 | 190 (177/13)
SQLite 3.2 | ??? (80/???)
| with the "--sqlite-transactions" command line
| option; otherwise it's _really_ slow: even
MySQL 5.0 MyISAM | 205 (160/45)
MySQL 5.0 InnoDB | _untested_, see NOTES below.
PostgreSQL 8.1 | 560 (530/30)
SQLite 3.3 | ??? (150/???) - very slow building indexes.
| Timed with the "--sqlite-transactions" command
| line option; otherwise it's _really_ slow: even
| 35 hours or more.
SQL Server | about 3 or 4 hours.
......@@ -127,12 +124,24 @@ The imdbpy2sql.py will print a lot of debug information on standard output;
you can save it in a file, appending (without quotes) "2>&1 | tee output.txt"
[MySQL InnoDB]
[MySQL]
In general, if you get an embarrassingly high numbero of "TOO MANY DATA
... SPLITTING" lines, consider increasing max_allowed_packet (in the
configuration of your MySQL server) to at least 8M or 16M.
Otherwise, inserting the data will be very slow, and some data may
be lost.
[MySQL InnoDB and MyISAM]
InnoDB is abysmal slow for our purposes: my suggestion is to always
use MyISAM tables and - if you really want to use InnoDB - convert
the tables later.
The imdbpy2sql.py script provides a simple way to manage this case,
The imdbpy2sql.py script provides a simple way to manage these cases,
see ADVANCED FEATURES below.
In my opinion, the cleaner thing to do is to set the server to use
MyISAM tables or - you you can't modifiy the server - use the
--mysql-force-myisam command line option of imdbpy2sql.py.
Anyway, if you really need to use InnoDB, in the server-side settings
I recommend to set innodb_file_per_table to "true".
......@@ -232,9 +241,9 @@ or BEFORE_CREATE time...), replacing the "%(table)s" text in the QUERY
with the appropriate table name.
Other available TIMEs are: 'BEFORE_MOVIES_TODB', 'AFTER_MOVIES_TODB',
'BEFORE_PERSONS_TODB', 'AFTER_PERSONS_TODB', 'BEFORE_SQLDATA_TODB',
'AFTER_SQLDATA_TODB', 'BEFORE_AKAMOVIES_TODB' and 'AFTER_AKAMOVIES_TODB';
they take no modifiers.
'BEFORE_PERSONS_TODB', 'AFTER_PERSONS_TODB', 'BEFORE_CHARACTERS_TODB',
'AFTER_CHARACTERS_TODB', 'BEFORE_SQLDATA_TODB', 'AFTER_SQLDATA_TODB',
'BEFORE_AKAMOVIES_TODB' and 'AFTER_AKAMOVIES_TODB'; they take no modifiers.
Special TIMEs 'BEFORE_EVERY_TODB' and 'AFTER_EVERY_TODB' apply to
every BEFORE_* and AFTER_* TIME above mentioned.
These commands are executed before and after every _toDB() call in
......
......@@ -32,8 +32,9 @@
accessSystem = http
# Optional:
#proxy = http://localhost:8080/
# Optional (option common to every data access system):
# Optional (options common to every data access system):
#adultSearch = on
#results = 20
# Parameters for the 'mobile' data access system.
#accessSystem = mobile
......
......@@ -4,7 +4,7 @@ Character module (imdb package).
This module provides the Character class, used to store information about
a given character.
Copyright 2007 Davide Alberani <da@erlug.linux.it>
Copyright 2007-2008 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -178,16 +178,16 @@ class Character(_Container):
def summary(self):
"""Return a string with a pretty-printed summary for the character."""
if not self: return u''
s = 'Character\n=====\nName: %s\n' % \
s = u'Character\n=====\nName: %s\n' % \
self.get('name', u'')
bio = self.get('biography')
if bio:
s += 'Biography: %s\n' % bio[0]
s += u'Biography: %s\n' % bio[0]
filmo = self.get('filmography')
if filmo:
a_list = [x.get('long imdb canonical title', u'')
for x in filmo[:5]]
s += 'Last movies with this character: %s.\n' % '; '.join(a_list)
s += u'Last movies with this character: %s.\n' % u'; '.join(a_list)
return s
......@@ -4,7 +4,7 @@ Movie module (imdb package).
This module provides the Movie class, used to store information about
a given movie.
Copyright 2004-2007 Davide Alberani <da@erlug.linux.it>
Copyright 2004-2008 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -293,51 +293,51 @@ class Movie(_Container):
def summary(self):
"""Return a string with a pretty-printed summary for the movie."""
if not self: return u''
def _nameAndRole(personList, joiner=', '):
def _nameAndRole(personList, joiner=u', '):
"""Build a pretty string with name and role."""
nl = []
for person in personList:
n = person.get('name', u'')
if person.currentRole: n += ' (%s)' % person.currentRole
if person.currentRole: n += u' (%s)' % person.currentRole
nl.append(n)
return joiner.join(nl)
s = 'Movie\n=====\nTitle: %s\n' % \
s = u'Movie\n=====\nTitle: %s\n' % \
self.get('long imdb canonical title', u'')
genres = self.get('genres')
if genres: s += 'Genres: %s.' % ', '.join(genres)
if genres: s += u'Genres: %s.' % u', '.join(genres)
director = self.get('director')
if director:
s += 'Director: %s.\n' % _nameAndRole(director)
s += u'Director: %s.\n' % _nameAndRole(director)
writer = self.get('writer')
if writer:
s += 'Writer: %s.\n' % _nameAndRole(writer)
s += u'Writer: %s.\n' % _nameAndRole(writer)
cast = self.get('cast')
if cast:
cast = cast[:5]
s += 'Cast: %s.\n' % _nameAndRole(cast)
s += u'Cast: %s.\n' % _nameAndRole(cast)
runtime = self.get('runtimes')
if runtime:
s += 'Runtime: %s.\n' % ', '.join(runtime)
s += u'Runtime: %s.\n' % u', '.join(runtime)
countries = self.get('countries')
if countries:
s += 'Country: %s.\n' % ', '.join(countries)
s += u'Country: %s.\n' % u', '.join(countries)
lang = self.get('languages')
if lang:
s += 'Language: %s.\n' % ', '.join(lang)
s += u'Language: %s.\n' % u', '.join(lang)
rating = self.get('rating')
if rating:
s += 'Rating: %s' % rating
s += u'Rating: %s' % rating
nr_votes = self.get('votes')
if nr_votes:
s += '(%s votes)' % nr_votes
s += '.\n'
s += u'(%s votes)' % nr_votes
s += u'.\n'
plot = self.get('plot')
if plot:
plot = plot[0]
i = plot.find('::')
if i != -1:
plot = plot[i+2:]
s += 'Plot: %s' % plot
s += u'Plot: %s' % plot
return s
......@@ -4,7 +4,7 @@ Person module (imdb package).
This module provides the Person class, used to store information about
a given person.
Copyright 2004-2007 Davide Alberani <da@erlug.linux.it>
Copyright 2004-2008 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -226,35 +226,35 @@ class Person(_Container):
def summary(self):
"""Return a string with a pretty-printed summary for the person."""
if not self: return u''
s = 'Person\n=====\nName: %s\n' % \
s = u'Person\n=====\nName: %s\n' % \
self.get('long imdb canonical name', u'')
bdate = self.get('birth date')
if bdate:
s += 'Birth date: %s' % bdate
s += u'Birth date: %s' % bdate
bnotes = self.get('birth notes')
if bnotes:
s += ' (%s)' % bnotes
s += '.\n'
s += u' (%s)' % bnotes
s += u'.\n'
ddate = self.get('death date')
if ddate:
s += 'Death date: %s' % ddate
s += u'Death date: %s' % ddate
dnotes = self.get('death notes')
if dnotes:
s += ' (%s)' % dnotes
s += '.\n'
s += u' (%s)' % dnotes
s += u'.\n'
bio = self.get('mini biography')
if bio:
s += 'Biography: %s\n' % bio[0]
s += u'Biography: %s\n' % bio[0]
director = self.get('director')
if director:
d_list = [x.get('long imdb canonical title', u'')
for x in director[:3]]
s += 'Last movies directed: %s.\n' % '; '.join(d_list)
s += u'Last movies directed: %s.\n' % u'; '.join(d_list)
act = self.get('actor') or self.get('actress')
if act:
a_list = [x.get('long imdb canonical title', u'')
for x in act[:5]]
s += 'Last movies acted: %s.\n' % '; '.join(a_list)
s += u'Last movies acted: %s.\n' % u'; '.join(a_list)
return s
......@@ -6,7 +6,7 @@ a person from the IMDb database.
It can fetch data through different media (e.g.: the IMDb web pages,
a local installation, a SQL database, etc.)
Copyright 2004-2007 Davide Alberani <da@erlug.linux.it>
Copyright 2004-2008 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
......@@ -23,8 +23,12 @@ along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
__all__ = ['IMDb', 'IMDbError', 'Movie', 'Person', 'Character']
__version__ = VERSION = '3.4'
__all__ = ['IMDb', 'IMDbError', 'Movie', 'Person', 'Character',
'available_access_systems']
__version__ = VERSION = '3.5'
# Import compatibility module.
import _compat
import sys, os, ConfigParser
from types import UnicodeType, TupleType, ListType, MethodType
......@@ -95,6 +99,8 @@ class ConfigParserWithCase(ConfigParser.ConfigParser):
def _manageValue(self, value):
"""Custom substitutions for values."""
if not isinstance(value, (str, unicode)):
return value
vlower = value.lower()
if vlower in self._boolean_states:
return self._boolean_states[vlower]
......@@ -111,10 +117,10 @@ class ConfigParserWithCase(ConfigParser.ConfigParser):
def items(self, section, *args, **kwds):
"""Return a list of (key, value) tuples of items of the
given section."""
if not self.has_section(section):
if section != 'DEFAULT' and not self.has_section(section):
return []
items = ConfigParser.ConfigParser.items(self, section, *args, **kwds)
return [(key, self._manageValue(value)) for key, value in items]
keys = ConfigParser.ConfigParser.options(self, section)
return [(k, self.get(section, k, *args, **kwds)) for k in keys]
def getDict(self, section):
"""Return a dictionary of items of the specified section."""
......@@ -172,6 +178,33 @@ def IMDb(accessSystem=None, *arguments, **keywords):
% accessSystem
def available_access_systems():
"""Return the list of available data access systems."""
asList = []
# XXX: trying to import modules is a good thing?
try:
from parser.http import IMDbHTTPAccessSystem
asList += ['http', 'httpThin']
except ImportError:
pass
try:
from parser.mobile import IMDbMobileAccessSystem
asList.append('mobile')
except ImportError:
pass
try:
from parser.local import IMDbLocalAccessSystem
asList.append('local')
except ImportError:
pass
try:
from parser.sql import IMDbSqlAccessSystem
asList.append('sql')
except ImportError:
pass
return asList
# XXX: I'm not sure this is a good guess.
# I suppose that an argument of the IMDb function can be used to
# set a default encoding for the output, and then Movie, Person and
......@@ -191,7 +224,8 @@ class IMDbBase:
# in the subclasses).
accessSystem = 'UNKNOWN'
def __init__(self, defaultModFunct=None, *arguments, **keywords):
def __init__(self, defaultModFunct=None, results=20,
*arguments, **keywords):
"""Initialize the access system.
If specified, defaultModFunct is the function used by
default by the Person, Movie and Character objects, when
......@@ -200,6 +234,14 @@ class IMDbBase:
# The function used to output the strings that need modification (the
# ones containing references to movie titles and person names).
self._defModFunct = defaultModFunct
# Number of results to get.
try:
results = int(results)
except (TypeError, ValueError):
results = 20
if results < 1:
results = 20
self._results = results
def _normalize_movieID(self, movieID):
"""Normalize the given movieID."""
......@@ -281,9 +323,11 @@ class IMDbBase:
# subclass, somewhere under the imdb.parser package.
raise NotImplementedError, 'override this method'
def search_movie(self, title, results=20):
def search_movie(self, title, results=None):
"""Return a list of Movie objects for a query for the given title.
The results argument is the maximum number of results to return."""
if results is None:
results = self._results
try:
results = int(results)
except (ValueError, OverflowError):
......@@ -325,10 +369,12 @@ class IMDbBase:
# subclass, somewhere under the imdb.parser package.
raise NotImplementedError, 'override this method'
def search_person(self, name, results=20):
def search_person(self, name, results=None):
"""Return a list of Person objects for a query for the given name.
The results argument is the maximum number of results to return."""
if results is None:
results = self._results
try:
results = int(results)
except (ValueError, OverflowError):
......@@ -368,10 +414,12 @@ class IMDbBase:
# subclass, somewhere under the imdb.parser package.
raise NotImplementedError, 'override this method'
def search_character(self, name, results=20):
def search_character(self, name, results=None):
"""Return a list of Character objects for a query for the given name.
The results argument is the maximum number of results to return."""
if results is None:
results = self._results
try:
results = int(results)
except (ValueError, OverflowError):
......
"""
_compat module (imdb package).
This module provides compatibility functions used by the imdb package
to deal with unusual environments.
Copyright 2008 Davide Alberani <da@erlug.linux.it>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
"""
import os
# If true, we're working on a Symbian device.
if os.name == 'e32':
# Replace os.path.expandvars and os.path.expanduser, if needed.
def _noact(x):
"""Ad-hoc replacement for IMDbPY."""
return x
try:
os.path.expandvars
except AttributeError:
os.path.expandvars = _noact
try:
os.path.expanduser
except AttributeError:
os.path.expanduser = _noact
# time.strptime is missing, on Symbian devices.
import time
try:
time.strptime
except AttributeError:
import re
_re_web_time = re.compile(r'Episode dated (\d+) (\w+) (\d+)')
_re_ptdf_time = re.compile(r'\((\d+)-(\d+)-(\d+)\)')
_month2digit = {'January': '1', 'February': '2', 'March': '3',
'April': '4', 'May': '5', 'June': '6', 'July': '7',
'August': '8', 'September': '9', 'October': '10',
'November': '11', 'December': '12'}
def strptime(s, format):
"""Ad-hoc strptime replacement for IMDbPY."""
try:
if format.startswith('Episode'):
res = _re_web_time.findall(s)[0]
return (int(res[2]), int(_month2digit[res[1]]), int(res[0]),
0, 0, 0, 0, 1, 0)
else:
res = _re_ptdf_time.findall(s)[0]
return (int(res[0]), int(res[1]), int(res[2]),
0, 0, 0, 0, 1, 0)
except:
raise ValueError, u'error in IMDbPY\'s ad-hoc strptime!'
time.strptime = strptime
......@@ -88,12 +88,17 @@ def makeObject2Txt(movieTxt=None, personTxt=None, characterTxt=None,
if _limitRecursion is None:
_limitRecursion = 0
elif _limitRecursion > 5:
return ''
return u''
_limitRecursion += 1
# XXX: recur also on dictionaries' keys and values?
if isinstance(obj, (list, tuple)):
return joiner.join([object2txt(o, _limitRecursion=_limitRecursion)
for o in obj])
elif isinstance(obj, dict):
# XXX: not exactly nice, neither useful, I fear.
return joiner.join([u'%s::%s' %
(object2txt(k, _limitRecursion=_limitRecursion),
object2txt(v, _limitRecursion=_limitRecursion))
for k, v in obj.items()])
objData = {}
if isinstance(obj, Movie):
objData['movieID'] = obj.movieID
......@@ -113,7 +118,7 @@ def makeObject2Txt(movieTxt=None, personTxt=None, characterTxt=None,
if proceed:
return matchobj.group(2)
else:
return ''
return u''
return matchobj.group(2)
while re_conditional.search(outs):
outs = re_conditional.sub(_excludeFalseConditionals, outs)
......@@ -131,7 +136,7 @@ def makeObject2Txt(movieTxt=None, personTxt=None, characterTxt=None,
value = u''
elif not isinstance(value, (unicode, str)):
value = unicode(value)
outs = outs.replace('%(' + key + ')s', value)
outs = outs.replace(u'%(' + key + u')s', value)
return outs
return object2txt
......
......@@ -20,7 +20,7 @@
* - pysoundex():
* Return a soundex code string, for the given string.
*
* Copyright 2004-2007 Davide Alberani <da@erlug.linux.it>
* Copyright 2004-2008 Davide Alberani <da@erlug.linux.it>
* Released under the GPL license.
*
* NOTE: The Ratcliff-Obershelp part was heavily based on code from the
......@@ -67,8 +67,8 @@
#define COMPARE 2.0
#define STRING_MAXLENDIFFER 0.7
/* As of 26 Mar 2006, the longest title is 280 chars. */
#define MXLINELEN 512
/* As of 05 Mar 2008, the longest title is ~600 chars. */
#define MXLINELEN 1023
#define FSEP '|'
#define RO_THRESHOLD 0.6
......@@ -78,20 +78,20 @@
/* List of articles.
XXX: see comments about articles in the imdb.utils module. */
#define ART_COUNT 45
#define ART_COUNT 46
char *articles[ART_COUNT] = {"the ", "la ", "a ", "die ", "der ", "le ", "el ",
"l'", "il ", "das ", "les ", "o ", "ein ", "i ", "un ", "los ", "de ",
"an ", "una ", "las ", "eine ", "den ", "gli ", "het ","os ", "lo ",
"az ", "det ","ha-", "een ", "ang ", "oi ", "ta ", "al-", "dem ",
"mga ", "uno ", "un'", "ett ", " ", "eines ", " ","els ", " ",
" "};
"l'", "il ", "das ", "les ", "i ", "o ", "ein ", "un ", "de ", "los ",
"an ", "una ", "las ", "eine ", "den ", "het ", "gli ", "lo ", "os ",
"ang ", "oi ", "az ", "een ", "ha-", "det ", "ta ", "al-",
"mga ", "un'", "uno ", "ett ", "dem ", "egy ", "els ", "eines ", " ",
" ", " ", " "};
char *articlesNoSP[ART_COUNT] = {"the", "la", "a", "die", "der", "le", "el",
"l'", "il", "das", "les", "o", "ein", "i", "un", "los", "de",
"an", "una", "las", "eine", "den", "gli", "het", "os", "lo",
"az", "det","ha-", "een", "ang", "oi", "ta", "al-", "dem",
"mga", "uno", "un'", "ett", "", "eines", "", "els", "",
""};
"l'", "il", "das", "les", "i", "o", "ein", "un", "de", "los",
"an", "una", "las", "eine", "den", "het", "gli", "lo", "os",
"ang", "oi", "az", "een", "ha-", "det", "ta", "al-",
"mga", "un'", "uno", "ett", "dem", "egy", "els", "eines", "",
"", "", ""};
//*****************************************
......@@ -202,8 +202,8 @@ pyratcliff(PyObject *self, PyObject *pArgs)
char *s1 = NULL;
char *s2 = NULL;
PyObject *discard = NULL;
char s1copy[MXLINELEN];
char s2copy[MXLINELEN];
char s1copy[MXLINELEN+1];
char s2copy[MXLINELEN+1];
/* The optional PyObject parameter is here to be compatible
* with the pure python implementation, which uses a
......@@ -211,8 +211,8 @@ pyratcliff(PyObject *self, PyObject *pArgs)
if (!PyArg_ParseTuple(pArgs, "ss|O", &s1, &s2, &discard))
return NULL;
str<