1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354
|
\chapter{Python compiler package \label{compiler}}
\sectionauthor{Jeremy Hylton}{jeremy@zope.com}
The Python compiler package is a tool for analyzing Python source code
and generating Python bytecode. The compiler contains libraries to
generate an abstract syntax tree from Python source code and to
generate Python bytecode from the tree.
The \refmodule{compiler} package is a Python source to bytecode
translator written in Python. It uses the built-in parser and
standard \refmodule{parser} module to generated a concrete syntax
tree. This tree is used to generate an abstract syntax tree (AST) and
then Python bytecode.
The full functionality of the package duplicates the builtin compiler
provided with the Python interpreter. It is intended to match its
behavior almost exactly. Why implement another compiler that does the
same thing? The package is useful for a variety of purposes. It can
be modified more easily than the builtin compiler. The AST it
generates is useful for analyzing Python source code.
This chapter explains how the various components of the
\refmodule{compiler} package work. It blends reference material with
a tutorial.
The following modules are part of the \refmodule{compiler} package:
\localmoduletable
\section{The basic interface}
\declaremodule{}{compiler}
The top-level of the package defines four functions. If you import
\module{compiler}, you will get these functions and a collection of
modules contained in the package.
\begin{funcdesc}{parse}{buf}
Returns an abstract syntax tree for the Python source code in \var{buf}.
The function raises SyntaxError if there is an error in the source
code. The return value is a \class{compiler.ast.Module} instance that
contains the tree.
\end{funcdesc}
\begin{funcdesc}{parseFile}{path}
Return an abstract syntax tree for the Python source code in the file
specified by \var{path}. It is equivalent to
\code{parse(open(\var{path}).read())}.
\end{funcdesc}
\begin{funcdesc}{walk}{ast, visitor\optional{, verbose}}
Do a pre-order walk over the abstract syntax tree \var{ast}. Call the
appropriate method on the \var{visitor} instance for each node
encountered.
\end{funcdesc}
\begin{funcdesc}{compile}{source, filename, mode, flags=None,
dont_inherit=None}
Compile the string \var{source}, a Python module, statement or
expression, into a code object that can be executed by the exec
statement or \function{eval()}. This function is a replacement for the
built-in \function{compile()} function.
The \var{filename} will be used for run-time error messages.
The \var{mode} must be 'exec' to compile a module, 'single' to compile a
single (interactive) statement, or 'eval' to compile an expression.
The \var{flags} and \var{dont_inherit} arguments affect future-related
statements, but are not supported yet.
\end{funcdesc}
\begin{funcdesc}{compileFile}{source}
Compiles the file \var{source} and generates a .pyc file.
\end{funcdesc}
The \module{compiler} package contains the following modules:
\refmodule[compiler.ast]{ast}, \module{consts}, \module{future},
\module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols},
\module{transformer}, and \refmodule[compiler.visitor]{visitor}.
\section{Limitations}
There are some problems with the error checking of the compiler
package. The interpreter detects syntax errors in two distinct
phases. One set of errors is detected by the interpreter's parser,
the other set by the compiler. The compiler package relies on the
interpreter's parser, so it get the first phases of error checking for
free. It implements the second phase itself, and that implementation is
incomplete. For example, the compiler package does not raise an error
if a name appears more than once in an argument list:
\code{def f(x, x): ...}
A future version of the compiler should fix these problems.
\section{Python Abstract Syntax}
The \module{compiler.ast} module defines an abstract syntax for
Python. In the abstract syntax tree, each node represents a syntactic
construct. The root of the tree is \class{Module} object.
The abstract syntax offers a higher level interface to parsed Python
source code. The \ulink{\module{parser}}
{http://www.python.org/doc/current/lib/module-parser.html}
module and the compiler written in C for the Python interpreter use a
concrete syntax tree. The concrete syntax is tied closely to the
grammar description used for the Python parser. Instead of a single
node for a construct, there are often several levels of nested nodes
that are introduced by Python's precedence rules.
The abstract syntax tree is created by the
\module{compiler.transformer} module. The transformer relies on the
builtin Python parser to generate a concrete syntax tree. It
generates an abstract syntax tree from the concrete tree.
The \module{transformer} module was created by Greg
Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an
experimental Python-to-C compiler. The current version contains a
number of modifications and improvements, but the basic form of the
abstract syntax and of the transformer are due to Stein and Tutt.
\subsection{AST Nodes}
\declaremodule{}{compiler.ast}
The \module{compiler.ast} module is generated from a text file that
describes each node type and its elements. Each node type is
represented as a class that inherits from the abstract base class
\class{compiler.ast.Node} and defines a set of named attributes for
child nodes.
\begin{classdesc}{Node}{}
The \class{Node} instances are created automatically by the parser
generator. The recommended interface for specific \class{Node}
instances is to use the public attributes to access child nodes. A
public attribute may be bound to a single node or to a sequence of
nodes, depending on the \class{Node} type. For example, the
\member{bases} attribute of the \class{Class} node, is bound to a
list of base class nodes, and the \member{doc} attribute is bound to
a single node.
Each \class{Node} instance has a \member{lineno} attribute which may
be \code{None}. XXX Not sure what the rules are for which nodes
will have a useful lineno.
\end{classdesc}
All \class{Node} objects offer the following methods:
\begin{methoddesc}{getChildren}{}
Returns a flattened list of the child nodes and objects in the
order they occur. Specifically, the order of the nodes is the
order in which they appear in the Python grammar. Not all of the
children are \class{Node} instances. The names of functions and
classes, for example, are plain strings.
\end{methoddesc}
\begin{methoddesc}{getChildNodes}{}
Returns a flattened list of the child nodes in the order they
occur. This method is like \method{getChildren()}, except that it
only returns those children that are \class{Node} instances.
\end{methoddesc}
Two examples illustrate the general structure of \class{Node}
classes. The \keyword{while} statement is defined by the following
grammar production:
\begin{verbatim}
while_stmt: "while" expression ":" suite
["else" ":" suite]
\end{verbatim}
The \class{While} node has three attributes: \member{test},
\member{body}, and \member{else_}. (If the natural name for an
attribute is also a Python reserved word, it can't be used as an
attribute name. An underscore is appended to the word to make it a
legal identifier, hence \member{else_} instead of \keyword{else}.)
The \keyword{if} statement is more complicated because it can include
several tests.
\begin{verbatim}
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
\end{verbatim}
The \class{If} node only defines two attributes: \member{tests} and
\member{else_}. The \member{tests} attribute is a sequence of test
expression, consequent body pairs. There is one pair for each
\keyword{if}/\keyword{elif} clause. The first element of the pair is
the test expression. The second elements is a \class{Stmt} node that
contains the code to execute if the test is true.
The \method{getChildren()} method of \class{If} returns a flat list of
child nodes. If there are three \keyword{if}/\keyword{elif} clauses
and no \keyword{else} clause, then \method{getChildren()} will return
a list of six elements: the first test expression, the first
\class{Stmt}, the second text expression, etc.
The following table lists each of the \class{Node} subclasses defined
in \module{compiler.ast} and each of the public attributes available
on their instances. The values of most of the attributes are
themselves \class{Node} instances or sequences of instances. When the
value is something other than an instance, the type is noted in the
comment. The attributes are listed in the order in which they are
returned by \method{getChildren()} and \method{getChildNodes()}.
\input{asttable}
\subsection{Assignment nodes}
There is a collection of nodes used to represent assignments. Each
assignment statement in the source code becomes a single
\class{Assign} node in the AST. The \member{nodes} attribute is a
list that contains a node for each assignment target. This is
necessary because assignment can be chained, e.g. \code{a = b = 2}.
Each \class{Node} in the list will be one of the following classes:
\class{AssAttr}, \class{AssList}, \class{AssName}, or
\class{AssTuple}.
Each target assignment node will describe the kind of object being
assigned to: \class{AssName} for a simple name, e.g. \code{a = 1}.
\class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}.
\class{AssList} and \class{AssTuple} for list and tuple expansion
respectively, e.g. \code{a, b, c = a_tuple}.
The target assignment nodes also have a \member{flags} attribute that
indicates whether the node is being used for assignment or in a delete
statement. The \class{AssName} is also used to represent a delete
statement, e.g. \class{del x}.
When an expression contains several attribute references, an
assignment or delete statement will contain only one \class{AssAttr}
node -- for the final attribute reference. The other attribute
references will be represented as \class{Getattr} nodes in the
\member{expr} attribute of the \class{AssAttr} instance.
\subsection{Examples}
This section shows several simple examples of ASTs for Python source
code. The examples demonstrate how to use the \function{parse()}
function, what the repr of an AST looks like, and how to access
attributes of an AST node.
The first module defines a single function. Assume it is stored in
\file{/tmp/doublelib.py}.
\begin{verbatim}
"""This is an example module.
This is the docstring.
"""
def double(x):
"Return twice the argument"
return x * 2
\end{verbatim}
In the interactive interpreter session below, I have reformatted the
long AST reprs for readability. The AST reprs use unqualified class
names. If you want to create an instance from a repr, you must import
the class names from the \module{compiler.ast} module.
\begin{verbatim}
>>> import compiler
>>> mod = compiler.parseFile("/tmp/doublelib.py")
>>> mod
Module('This is an example module.\n\nThis is the docstring.\n',
Stmt([Function(None, 'double', ['x'], [], 0,
'Return twice the argument',
Stmt([Return(Mul((Name('x'), Const(2))))]))]))
>>> from compiler.ast import *
>>> Module('This is an example module.\n\nThis is the docstring.\n',
... Stmt([Function(None, 'double', ['x'], [], 0,
... 'Return twice the argument',
... Stmt([Return(Mul((Name('x'), Const(2))))]))]))
Module('This is an example module.\n\nThis is the docstring.\n',
Stmt([Function(None, 'double', ['x'], [], 0,
'Return twice the argument',
Stmt([Return(Mul((Name('x'), Const(2))))]))]))
>>> mod.doc
'This is an example module.\n\nThis is the docstring.\n'
>>> for node in mod.node.nodes:
... print node
...
Function(None, 'double', ['x'], [], 0, 'Return twice the argument',
Stmt([Return(Mul((Name('x'), Const(2))))]))
>>> func = mod.node.nodes[0]
>>> func.code
Stmt([Return(Mul((Name('x'), Const(2))))])
\end{verbatim}
\section{Using Visitors to Walk ASTs}
\declaremodule{}{compiler.visitor}
The visitor pattern is ... The \refmodule{compiler} package uses a
variant on the visitor pattern that takes advantage of Python's
introspection features to eliminate the need for much of the visitor's
infrastructure.
The classes being visited do not need to be programmed to accept
visitors. The visitor need only define visit methods for classes it
is specifically interested in; a default visit method can handle the
rest.
XXX The magic \method{visit()} method for visitors.
\begin{funcdesc}{walk}{tree, visitor\optional{, verbose}}
\end{funcdesc}
\begin{classdesc}{ASTVisitor}{}
The \class{ASTVisitor} is responsible for walking over the tree in the
correct order. A walk begins with a call to \method{preorder()}. For
each node, it checks the \var{visitor} argument to \method{preorder()}
for a method named `visitNodeType,' where NodeType is the name of the
node's class, e.g. for a \class{While} node a \method{visitWhile()}
would be called. If the method exists, it is called with the node as
its first argument.
The visitor method for a particular node type can control how child
nodes are visited during the walk. The \class{ASTVisitor} modifies
the visitor argument by adding a visit method to the visitor; this
method can be used to visit a particular child node. If no visitor is
found for a particular node type, the \method{default()} method is
called.
\end{classdesc}
\class{ASTVisitor} objects have the following methods:
XXX describe extra arguments
\begin{methoddesc}{default}{node\optional{, \moreargs}}
\end{methoddesc}
\begin{methoddesc}{dispatch}{node\optional{, \moreargs}}
\end{methoddesc}
\begin{methoddesc}{preorder}{tree, visitor}
\end{methoddesc}
\section{Bytecode Generation}
The code generator is a visitor that emits bytecodes. Each visit method
can call the \method{emit()} method to emit a new bytecode. The basic
code generator is specialized for modules, classes, and functions. An
assembler converts that emitted instructions to the low-level bytecode
format. It handles things like generator of constant lists of code
objects and calculation of jump offsets.
|