File: README.devel

package info (click to toggle)
imdbpy 2.7-2
links: PTS
area: main
in suites: etch, etch-m68k
size: 780 kB
ctags: 1,295
sloc: python: 8,867; ansic: 440; makefile: 44
file content (218 lines) | stat: -rw-r--r-- 9,565 bytes
  DEVELOPMENT OF IMDbPY
  =====================

A lot of other information useful to IMDbPY developers are available
in the "README.package" file.

Sections in this file:
* STRUCTURE OF THE IMDbPY PACKAGE
* GENERIC DESCRIPTION
* HOW TO EXTEND


  STRUCTURE OF THE IMDbPY PACKAGE
  ===============================

imdb (package)
 |
 +-> _exceptions
 +-> Movie
 +-> Person
 +-> utils
 +-> helpers
 +-> parser (package)
       |
       +-> http (package)
       |    |
       |    +-> movieParser
       |    +-> personParser
       |    +-> searchMovieParser
       |    +-> searchPersonParser
       |    +-> utils
       |
       +-> local (package)
       |    |
       |    +-> movieParser
       |    +-> personParser
       |    +-> utils
       |
       +-> mobile (package)
       |
       +-> sql (package)
       |    |
       |    +-> dbschema
       |
       +-> common (package)
            |
            +-> cutils (C module)
            +-> locsql


Description:
imdb (package): contains the IMDb function, the IMDbBase class and imports
                the IMDbError exception class.
_exceptions: defines the exceptions internally used.
Movie: contains the Movie class, used to describe and manage a movie.
Person: contains the Person class, used to describe and manage a person.
utils: miscellaneous utilities used by many IMDbPY modules.
parser (package): a package containing a package for every data access system
                  implemented.
http (package): contains the IMDbHTTPAccessSystem class which is a subclass
                of the imdb.IMDbBase class; it provides the methods used to
                retrieve and manage data from the web server (using,
                in turn, the other modules in the package).
                It defines methods to get a movie and to search for a title.
http.movieParser: parse html strings from the pages on the IMDb web server about
                  a movie; returns dictionaries of {key: value}
http.personParser: parse html strings from the pages on the IMDb web server
                   about a person; returns dictionaries.
http.searchMovieParser: parse an html string, result of a query for a movie
                        title.
http.searchPersonParser: parse an html string, result of a query for a person
                         name.
http.utils: miscellaneous utilities used only by the http package.

The modules under the parser.local package are the same of the
parser.http package (the search functions are placed directly in the
IMDbLocalAccessSystem class); obviously they manage a local installation
of the database.

The parser.sql package manages the access to the data in the SQL
database, created with the imdbpy2sql.py script; see the README.sqldb file.
The dbschema module contains tables definitions and some useful functions.

The class in the parser.mobile package is a subclass of the one found
in parser.http, with some method overridden to be many times faster (from
2 to 20 times); it's useful for systems with slow bandwidth and not
much CPU power.

The parser.common package contains code common to different packages;
so far there're only the locsql and the cutils modules, both used by
"local" and "sql" data access systems.
The cutils module is a C module containing C function to speed up the
IMDbPY package; for every function defined here, a (slower!) pure Python
fall back is provided where it's needed.
So far in the cutils module are included functions used to search
names.keys and titles.key files by the "local" data access system,
a function to get a list of episodes titles provided the movieID
of a tv series (scanning titles.key), a function to calculate the
Ratcliff-Obershelp similarity of two Python strings and another
function to return the soundex code of a given Python string.


The helpers module contains functions and other goodies not directly
used by the IMDbPY package, but that can be useful to develop
IMDbPY-based programs.


  GENERIC DESCRIPTION
  ===================

I wanted to stay independent from the source of the data for a given
movie/person, and so the imdb.IMDb function returns an instance of a class
that provides specific methods to access a given data source (web server,
local installation, SQL database, etc.)

Unfortunately that means that the movieID in the Movie class and the
personID in the Person class are dependent on the data access system
used.  So, when a Movie or Person object is instantiated, the accessSystem
instance variable is set to a string used to identify the used data access
system.


  HOW TO EXTEND
  =============

To introduce a new data access system, you've to write a new package
inside the "parser" package; this new package must provide a subclass
of the imdb.IMDb class which must define at least the following methods:
 _search_movie(title)  - to search for a given title; must return a
                         list of (movieID, {movieData}) tuples.
 _search_person(name)  - to search for a given name; must return a
                         list of (movieID, {personData}) tuples.
 get_movie_*(movieID)  - a set of methods, one for every set of information
                         defined for a Movie object; should return
                         a dictionary with the relative information.
                         This dictionary can contains some optional keys:
                         'data': must be a dictionary with the movie info.
                         'titlesRefs': a dictionary of 'movie title': movieObj
                                       pairs.
                         'namesRefs': a dictionary of 'person name': personObj
                                      pairs.
 get_person_*(personID) - a set of methods, one for every set of information
                          defined for a Person object; should return
                          a dictionary with the relative information.
 get_imdbMovieID(movieID) - must convert the given movieID to a string
                            representing the imdbID, as used by the IMDb web
                            server (e.g.: '0094226' for Brian De Palma's
                            "The Untouchables").
 get_imdbPersonID(personID) - must convert the given personID a string
                              representing the imdbID, as used by the IMDb web
                              server (e.g.: '0000154' for "Mel Gibson").
 _normalize_movieID(movieID) - must convert the provided movieID in a
                               format suitable for internal use (e.g.:
                               convert a string to a long int).
                               NOTE: as a rule of thumb you _always_ need
                               to provide a way to convert a "string
                               representation of the movieID" into the
                               internally used format, and the internally
                               used format should _always_ be converted to
                               a string, in a way or another.
                               Rationale: a movieID can be passed from the
                               command line, or from a web browser.
 _normalize_personID(personID) - idem.
 _get_real_movieID(movieID) - return the true movieID; useful to handle
                              title aliases.
 _get_real_personID(personID) - idem.

The class should raise the appropriate exceptions, when needed;
IMDbDataAccessError must be raised when you cannot access the resource
you need to retrieve movie info or you're unable to do a query (this is
_not_ the case when a query returns zero matches: in this situation an
empty list must be returned); IMDbParserError should be raised when an
error occurred parsing some data.

Now you've to modify the imdb.IMDb function so that, when the right
data access system is selected with the "accessSystem" parameter, an
instance of your newly created class is returned.

NOTE: this is a somewhat misleading example: we already have a
data access system for sql database (it's called 'sql' and it supports
also MySQL, amongst other).  Maybe I'll find a better example...
E.g.: if you want to call your new data access system "mysql" (meaning
that the data are stored in a mysql database), you've to add to the imdb.IMDb
function something like:
  if accessSystem == 'mysql':
      from parser.mysql import IMDbMysqlAccessSystem
      return IMDbMysqlAccessSystem(*arguments, **keywords)

where "parser.mysql" is the package you've created to access the
local installation, and "IMDbMysqlAccessSystem" is the subclass of
imdb.IMDbBase.
Then it's possibile to use the new data access system like:
  from imdb import IMDb
  i = IMDb(accessSystem='mysql')
  results = i.search_movie('the matrix')
  print results

A specific data access system implementation can defines it's own
methods.
As an example, the IMDbHTTPAccessSystem that is in the parser.http package
defines the method set_proxy() to manage the use a web proxy; you
can use it this way:
      from imdb import IMDb
      i = IMDb(accessSystem='http') # the 'accessSystem' argument is not
                              # really needed, since "http" is the default.
      i.set_proxy('http://localhost:8080/')

A list of special methods provided by the imdb.IMDbBase subclass, along
with their description, is always available calling the get_special_methods()
of the IMDb class.
E.g.:
     i = IMDb(accessSystem='http')
     print i.get_special_methods()

will print a dictionary with the format:
  {'method_name': 'method_description', ...}