1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
|
\declaremodule{standard}{email.Parser}
\modulesynopsis{Parse flat text email messages to produce a message
object structure.}
Message object structures can be created in one of two ways: they can be
created from whole cloth by instantiating \class{Message} objects and
stringing them together via \method{attach()} and
\method{set_payload()} calls, or they can be created by parsing a flat text
representation of the email message.
The \module{email} package provides a standard parser that understands
most email document structures, including MIME documents. You can
pass the parser a string or a file object, and the parser will return
to you the root \class{Message} instance of the object structure. For
simple, non-MIME messages the payload of this root object will likely
be a string containing the text of the message. For MIME
messages, the root object will return \code{True} from its
\method{is_multipart()} method, and the subparts can be accessed via
the \method{get_payload()} and \method{walk()} methods.
Note that the parser can be extended in limited ways, and of course
you can implement your own parser completely from scratch. There is
no magical connection between the \module{email} package's bundled
parser and the \class{Message} class, so your custom parser can create
message object trees any way it finds necessary.
The primary parser class is \class{Parser} which parses both the
headers and the payload of the message. In the case of
\mimetype{multipart} messages, it will recursively parse the body of
the container message. Two modes of parsing are supported,
\emph{strict} parsing, which will usually reject any non-RFC compliant
message, and \emph{lax} parsing, which attempts to adjust for common
MIME formatting problems.
The \module{email.Parser} module also provides a second class, called
\class{HeaderParser} which can be used if you're only interested in
the headers of the message. \class{HeaderParser} can be much faster in
these situations, since it does not attempt to parse the message body,
instead setting the payload to the raw body as a string.
\class{HeaderParser} has the same API as the \class{Parser} class.
\subsubsection{Parser class API}
\begin{classdesc}{Parser}{\optional{_class\optional{, strict}}}
The constructor for the \class{Parser} class takes an optional
argument \var{_class}. This must be a callable factory (such as a
function or a class), and it is used whenever a sub-message object
needs to be created. It defaults to \class{Message} (see
\refmodule{email.Message}). The factory will be called without
arguments.
The optional \var{strict} flag specifies whether strict or lax parsing
should be performed. When things like MIME terminating
boundaries are missing, or when messages contain other formatting
problems, the \class{Parser} will raise a
\exception{MessageParseError}, if the \var{strict} flag is \code{True}.
However, when lax parsing is enabled (i.e. \var{strict} is \code{False}),
the \class{Parser} will attempt to work around such broken formatting to
produce a usable message structure (this doesn't mean
\exception{MessageParseError}s are never raised; some ill-formatted
messages just can't be parsed). The \var{strict} flag defaults to
\code{False} since lax parsing usually provides the most convenient
behavior.
\versionchanged[The \var{strict} flag was added]{2.2.2}
\end{classdesc}
The other public \class{Parser} methods are:
\begin{methoddesc}[Parser]{parse}{fp\optional{, headersonly}}
Read all the data from the file-like object \var{fp}, parse the
resulting text, and return the root message object. \var{fp} must
support both the \method{readline()} and the \method{read()} methods
on file-like objects.
The text contained in \var{fp} must be formatted as a block of \rfc{2822}
style headers and header continuation lines, optionally preceded by a
envelope header. The header block is terminated either by the
end of the data or by a blank line. Following the header block is the
body of the message (which may contain MIME-encoded subparts).
Optional \var{headersonly} is as with the \method{parse()} method.
\versionchanged[The \var{headersonly} flag was added]{2.2.2}
\end{methoddesc}
\begin{methoddesc}[Parser]{parsestr}{text\optional{, headersonly}}
Similar to the \method{parse()} method, except it takes a string
object instead of a file-like object. Calling this method on a string
is exactly equivalent to wrapping \var{text} in a \class{StringIO}
instance first and calling \method{parse()}.
Optional \var{headersonly} is a flag specifying whether to stop
parsing after reading the headers or not. The default is \code{False},
meaning it parses the entire contents of the file.
\versionchanged[The \var{headersonly} flag was added]{2.2.2}
\end{methoddesc}
Since creating a message object structure from a string or a file
object is such a common task, two functions are provided as a
convenience. They are available in the top-level \module{email}
package namespace.
\begin{funcdesc}{message_from_string}{s\optional{, _class\optional{, strict}}}
Return a message object structure from a string. This is exactly
equivalent to \code{Parser().parsestr(s)}. Optional \var{_class} and
\var{strict} are interpreted as with the \class{Parser} class constructor.
\versionchanged[The \var{strict} flag was added]{2.2.2}
\end{funcdesc}
\begin{funcdesc}{message_from_file}{fp\optional{, _class\optional{, strict}}}
Return a message object structure tree from an open file object. This
is exactly equivalent to \code{Parser().parse(fp)}. Optional
\var{_class} and \var{strict} are interpreted as with the
\class{Parser} class constructor.
\versionchanged[The \var{strict} flag was added]{2.2.2}
\end{funcdesc}
Here's an example of how you might use this at an interactive Python
prompt:
\begin{verbatim}
>>> import email
>>> msg = email.message_from_string(myString)
\end{verbatim}
\subsubsection{Additional notes}
Here are some notes on the parsing semantics:
\begin{itemize}
\item Most non-\mimetype{multipart} type messages are parsed as a single
message object with a string payload. These objects will return
\code{False} for \method{is_multipart()}. Their
\method{get_payload()} method will return a string object.
\item All \mimetype{multipart} type messages will be parsed as a
container message object with a list of sub-message objects for
their payload. The outer container message will return
\code{True} for \method{is_multipart()} and their
\method{get_payload()} method will return the list of
\class{Message} subparts.
\item Most messages with a content type of \mimetype{message/*}
(e.g. \mimetype{message/delivery-status} and
\mimetype{message/rfc822}) will also be parsed as container
object containing a list payload of length 1. Their
\method{is_multipart()} method will return \code{True}. The
single element in the list payload will be a sub-message object.
\end{itemize}
|