1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218
|
DEVELOPMENT OF IMDbPY
=====================
A lot of other information useful to IMDbPY developers are available
in the "README.package" file.
Sections in this file:
* STRUCTURE OF THE IMDbPY PACKAGE
* GENERIC DESCRIPTION
* HOW TO EXTEND
STRUCTURE OF THE IMDbPY PACKAGE
===============================
imdb (package)
|
+-> _exceptions
+-> Movie
+-> Person
+-> utils
+-> helpers
+-> parser (package)
|
+-> http (package)
| |
| +-> movieParser
| +-> personParser
| +-> searchMovieParser
| +-> searchPersonParser
| +-> utils
|
+-> local (package)
| |
| +-> movieParser
| +-> personParser
| +-> utils
|
+-> mobile (package)
|
+-> sql (package)
| |
| +-> dbschema
|
+-> common (package)
|
+-> cutils (C module)
+-> locsql
Description:
imdb (package): contains the IMDb function, the IMDbBase class and imports
the IMDbError exception class.
_exceptions: defines the exceptions internally used.
Movie: contains the Movie class, used to describe and manage a movie.
Person: contains the Person class, used to describe and manage a person.
utils: miscellaneous utilities used by many IMDbPY modules.
parser (package): a package containing a package for every data access system
implemented.
http (package): contains the IMDbHTTPAccessSystem class which is a subclass
of the imdb.IMDbBase class; it provides the methods used to
retrieve and manage data from the web server (using,
in turn, the other modules in the package).
It defines methods to get a movie and to search for a title.
http.movieParser: parse html strings from the pages on the IMDb web server about
a movie; returns dictionaries of {key: value}
http.personParser: parse html strings from the pages on the IMDb web server
about a person; returns dictionaries.
http.searchMovieParser: parse an html string, result of a query for a movie
title.
http.searchPersonParser: parse an html string, result of a query for a person
name.
http.utils: miscellaneous utilities used only by the http package.
The modules under the parser.local package are the same of the
parser.http package (the search functions are placed directly in the
IMDbLocalAccessSystem class); obviously they manage a local installation
of the database.
The parser.sql package manages the access to the data in the SQL
database, created with the imdbpy2sql.py script; see the README.sqldb file.
The dbschema module contains tables definitions and some useful functions.
The class in the parser.mobile package is a subclass of the one found
in parser.http, with some method overridden to be many times faster (from
2 to 20 times); it's useful for systems with slow bandwidth and not
much CPU power.
The parser.common package contains code common to different packages;
so far there're only the locsql and the cutils modules, both used by
"local" and "sql" data access systems.
The cutils module is a C module containing C function to speed up the
IMDbPY package; for every function defined here, a (slower!) pure Python
fall back is provided where it's needed.
So far in the cutils module are included functions used to search
names.keys and titles.key files by the "local" data access system,
a function to get a list of episodes titles provided the movieID
of a tv series (scanning titles.key), a function to calculate the
Ratcliff-Obershelp similarity of two Python strings and another
function to return the soundex code of a given Python string.
The helpers module contains functions and other goodies not directly
used by the IMDbPY package, but that can be useful to develop
IMDbPY-based programs.
GENERIC DESCRIPTION
===================
I wanted to stay independent from the source of the data for a given
movie/person, and so the imdb.IMDb function returns an instance of a class
that provides specific methods to access a given data source (web server,
local installation, SQL database, etc.)
Unfortunately that means that the movieID in the Movie class and the
personID in the Person class are dependent on the data access system
used. So, when a Movie or Person object is instantiated, the accessSystem
instance variable is set to a string used to identify the used data access
system.
HOW TO EXTEND
=============
To introduce a new data access system, you've to write a new package
inside the "parser" package; this new package must provide a subclass
of the imdb.IMDb class which must define at least the following methods:
_search_movie(title) - to search for a given title; must return a
list of (movieID, {movieData}) tuples.
_search_person(name) - to search for a given name; must return a
list of (movieID, {personData}) tuples.
get_movie_*(movieID) - a set of methods, one for every set of information
defined for a Movie object; should return
a dictionary with the relative information.
This dictionary can contains some optional keys:
'data': must be a dictionary with the movie info.
'titlesRefs': a dictionary of 'movie title': movieObj
pairs.
'namesRefs': a dictionary of 'person name': personObj
pairs.
get_person_*(personID) - a set of methods, one for every set of information
defined for a Person object; should return
a dictionary with the relative information.
get_imdbMovieID(movieID) - must convert the given movieID to a string
representing the imdbID, as used by the IMDb web
server (e.g.: '0094226' for Brian De Palma's
"The Untouchables").
get_imdbPersonID(personID) - must convert the given personID a string
representing the imdbID, as used by the IMDb web
server (e.g.: '0000154' for "Mel Gibson").
_normalize_movieID(movieID) - must convert the provided movieID in a
format suitable for internal use (e.g.:
convert a string to a long int).
NOTE: as a rule of thumb you _always_ need
to provide a way to convert a "string
representation of the movieID" into the
internally used format, and the internally
used format should _always_ be converted to
a string, in a way or another.
Rationale: a movieID can be passed from the
command line, or from a web browser.
_normalize_personID(personID) - idem.
_get_real_movieID(movieID) - return the true movieID; useful to handle
title aliases.
_get_real_personID(personID) - idem.
The class should raise the appropriate exceptions, when needed;
IMDbDataAccessError must be raised when you cannot access the resource
you need to retrieve movie info or you're unable to do a query (this is
_not_ the case when a query returns zero matches: in this situation an
empty list must be returned); IMDbParserError should be raised when an
error occurred parsing some data.
Now you've to modify the imdb.IMDb function so that, when the right
data access system is selected with the "accessSystem" parameter, an
instance of your newly created class is returned.
NOTE: this is a somewhat misleading example: we already have a
data access system for sql database (it's called 'sql' and it supports
also MySQL, amongst other). Maybe I'll find a better example...
E.g.: if you want to call your new data access system "mysql" (meaning
that the data are stored in a mysql database), you've to add to the imdb.IMDb
function something like:
if accessSystem == 'mysql':
from parser.mysql import IMDbMysqlAccessSystem
return IMDbMysqlAccessSystem(*arguments, **keywords)
where "parser.mysql" is the package you've created to access the
local installation, and "IMDbMysqlAccessSystem" is the subclass of
imdb.IMDbBase.
Then it's possibile to use the new data access system like:
from imdb import IMDb
i = IMDb(accessSystem='mysql')
results = i.search_movie('the matrix')
print results
A specific data access system implementation can defines it's own
methods.
As an example, the IMDbHTTPAccessSystem that is in the parser.http package
defines the method set_proxy() to manage the use a web proxy; you
can use it this way:
from imdb import IMDb
i = IMDb(accessSystem='http') # the 'accessSystem' argument is not
# really needed, since "http" is the default.
i.set_proxy('http://localhost:8080/')
A list of special methods provided by the imdb.IMDbBase subclass, along
with their description, is always available calling the get_special_methods()
of the IMDb class.
E.g.:
i = IMDb(accessSystem='http')
print i.get_special_methods()
will print a dictionary with the format:
{'method_name': 'method_description', ...}
|