File: README.sqldb

package info (click to toggle)
imdbpy 2.7-2
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 780 kB
  • ctags: 1,295
  • sloc: python: 8,867; ansic: 440; makefile: 44
file content (139 lines) | stat: -rw-r--r-- 4,939 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

NOTE: this release (2.6) contains huge performances improvements;
while still using the SQLObject ORM, the imdbpy2sql.py script can take
just a pair of hours to complete, instead of the 6 or more hours of the
previous version.

If you want to help, please subscribe the imdbpy-devel mailing list at:
  http://imdbpy.sf.net/?page=help#ml
 

  SQL
  ===

Since version 2.1 it's possible to transfer the whole IMDb's
database from the plain text data files into a SQL database.
Starting with version 2.5 every database supported by the SQLObject
Object Relational Manager can be used to store and retrieve
movies and persons information.
This means that MySQL, PostgreSQL, SQLite, Firebird, MAX DB,
Sybase and MSSQL are supported and, as your read this text,
maybe other database backends were added.

You need the SQLObject package, at least version 0.8; even better
if you can download the latest SVN snapshot.

SQLObject home page: http://sqlobject.org/
SVN command to download the latest development version:
  svn co http://svn.colorstudy.com/SQLObject/trunk SQLObject


  SQL DATABASE INSTALLATION
  =========================

Select a mirror of the "The Plain Text Data Files" from
the http://www.imdb.com/interfaces.html page and download
every file in the main directory (beware that the "diffs"
subdirectory contains _a lot_ of files you _don't_ need,
so don't start mirroring everything!).

Starting from release 2.4, you can just download the files you need,
instead of every single file; the files not downloaded will be skipped.
This feature is still quite untested, so please report any bug.

Create a database named "imdb" (or whatever you like),
using the tool provided by your database; as an example, for MySQL
you will use the 'mysqladmin' command:
  # mysqladmin -p create imdb
For PostgreSQL, you have to use the "createdb" command:
  # createdb -W imdb

To create the tables and to populate the database, you must run
the imdbpy2sql.py script:
  # imdbpy2sql.py -d /dir/with/plainTextDataFiles/ -u 'URI'

Where the 'URI' argument is a string representing the connection
to your database, with the schema:
  scheme://[user[:password]@]host[:port]/database[?parameters]

Where 'scheme' is one in "sqlite", "mysql", "postgres", "firebird",
"interbase", "maxdb", "sapdb", "mssql", "sybase".

Some examples:
    mysql://user:password@host/database
    postgres://user:password@host/database
    mysql://host/database?debug=1
    postgres:///full/path/to/socket/database
    postgres://host:5432/database
    sqlite:///full/path/to/database
    sqlite:/C|/full/path/to/database
    sqlite:/:memory:

For other information you can read the SQLObject documentation.


  TIMING
  ======

The performances are hugely dependant upon the underlying Python
module/package used to access the database.
The fastest database appears to be MySQL, with about 95 minutes to
complete on my test system (read below).
A lot of memory (RAM or swap space) is required, in the range of
at least 150/200 megabytes (plus more for the database server).
In the end, the database will require between 1.5GB and 3GB of disc space.

As said, the performances varies greatly using a database server or another:
MySQL, for instance, has an executemany() method of the cursor object
that accept multiple data insertion with a single SQL statement; other
database requires a call to the execute() method for every single row
of data, and they will be much slower - from 2 to 7 times slower than
MySQL.

I've done some tests, using an AMD Athlon 1800+, 512MB of RAM:

      database         |  time in minutes: total (insert data/create indexes)
 ----------------------+-----------------------------------------------------
   MySQL 5.0 MyISAM    |  95 (75/20)
   MYSQL 5.0 InnoDB    |  ??? (80/???)
                       |  maybe I've not cofigurated it properly: it
                       |  looks like the creation of the indexes will
                       |  more then 2 or 3 hours.
   PostgreSQL 8.1      |  190 (177/13)
   SQLite 3.2          | not tested: it seems way too slow: maybe 35 _hours_
                       | to complete; maybe I've misconfigured or I'm
                       | misusing it.

If you have different experiences, please tell me!


  NOTE
  ====

The imdbpy2sql.py will print a lot of debug information on standard output;
you can save it in a file, appending (without quotes) "2>&1 | tee output.txt"


  SQLITE NOTE
  ===========

It seems that, with older versions of the python-sqlite package, the first
run may fail; if you get a DatabaseError exception saying "no such table",
try running again the command with the same arguments.


  SQL USAGE
  =========

Now you can use IMDbPY with the database:
  from imdb import IMDb
  i = IMDb('sql', uri='YOUR_URI_STRING')
  resList = i.search_movie('the incredibles')
  for x in resList: print x
  ti = resList[0]
  i.update(ti)
  print ti['director'][0]

and so on...