File: analysis.html

package info (click to toggle)
emil 2.1.0-beta9-5
links: PTS
area: main
in suites: slink
size: 1,160 kB
ctags: 587
sloc: ansic: 10,358; yacc: 412; makefile: 329; sh: 182
file content (302 lines) | stat: -rw-r--r-- 14,285 bytes
parent folder | download | duplicates (6)
<HEAD>
<TITLE>EMIL version 2 TUTORIAL</TITLE>
</HEAD>
<BODY>

                                             
<H1>TUTORIAL FOR EMIL VERSION 2.1
</H1>
<EM>Written by Martin Wendel, ITS, Uppsala university.
Martin.Wendel@its.uu.se
</EM>
<HR>
<A HREF=design-spec.html><IMG ALIGN=MIDDLE SRC=arrow_right3.gif></A><A HREF=main.html><IMG ALIGN=MIDDLE SRC=arrow_up2.gif></A><A HREF=problem-statement.html><IMG ALIGN=MIDDLE SRC=arrow_left3.gif></A>


<H2>ANALYSIS</H2>
<H3>Introduction</H3>
<P>Analysis is concerned with building an abstract model of the problem and an object structure 
for solving the problem. The details are stripped off, leaving a highly abstract description how the 
problem can be solved. The goal is to make the problem domain understandable and to provide a 
framework for the design.
<H3>Components of a Message</H3>
<P>A message has a structure. The structure decomposes the message at different levels of 
abstraction. The highest level of abstraction is represented as an <EM>object</EM> describing the entire 
message. Lesser abstract levels consist of objects that may represent the parts of the message etc. 
Each of these objects has relations to <EM>attributes</EM>, <EM>data</EM> and to other 
objects. Typically a message 
object has relations to a header object and a body object (figure 1). Note that the object 
structure itself does not contain any data obtained from the message, it describes the structure of 
the message, using the message format, and the objects of this structure has relations to some 
pieces of data that may be part of the message.
<BR><BR>
<img src="figure1.gif">
<BR>
<PRE>
    Figure 1. Rudimentary structure of a message.
</PRE>
<BR>

<H4>The Message Object</H4>
<P>The Message Object has a few attributes that are common to the entire message. Some of these
attributes are:
<UL>
<LI>The senders address
<LI>The recipient's address
<LI>The subject
<LI>The format type, for example MIME
</UL>
<P>Typically the values of these attributes are found in the envelope and the header of the message.
However, the format type of the message can not always be unambiguously derived from the header,
it may be necessary to perform a scan of the body aswell. 

<H4>The Header Object</H4>
<P>The header of a message is also structured. It may consist of several 
lines of text conforming with a syntactical pattern described by the message format, each having 
some semantic meaning and containing some data or information. 
<P>A header line consists of a<EM> pattern</EM> and some data, or <EM>arguments</EM> 
to the format. These arguments are contained in<EM> data objects</EM> (figure 2). 
<BR><BR>
<IMG SRC="figure2.gif">
<BR>
<PRE>
    Figure 2. The Header Object and it's relations.</PRE>
<BR>
<H4>The Body Object</H4>
<P>The body may also contain 
structured data or information. Data may have a type in which case it is information. Data may 
be encoded and it may be represented in a character set. Data may also be binary and this is a 
distinction from text, see below.

<P>RFC822 defines that the body of a message contains lines of text. Text is typically represented 
in a particular character set. It may be encoded and it may also contain encoded binary parts, this is 
a wide definition of text. MIME elaborates this and defines a grammar for body type and encoding . 
Still, a MIME text is no different from a general text and can thus contain encoded binary parts.

<P>There are two types of bodies; <EM>single part bodies</EM> and 
<EM>multi part bodies</EM>. This distinction is handled by a boolean attribute of the body object. Each 
body object has relations to <EM>body part objects</EM>. When the body object has relations to
more than one body part object it is a multi part body. 

<P>The body part object has some attributes describing the body part. Among those are:
<UL>
<LI>Type
<LI>Encoding
<LI>Size
<LI>Character set (for text)
</UL>

<P>In case of a MIME message those attributes can be retrieved from the data objects in
the header structure <EM>(The attributes are references to other objects)</EM>. In some cases the 
attributes can only be retrieved from the body part
data itself. The character set may also be retrieved from the configuration files.

<P>The body part objects has relations to data objects containing the data of the body part
(figure 3).
<BR><BR>
<IMG SRC="figure3.gif">
<BR>
<PRE>
    Figure 3. The Body Object.</PRE>

<H3>Relations within a Message</H3>

<P>As was described in the previous section a message is a structured body of data. But it is
important to understand the relationships within the structure aswell, understanding the structure
only is not enough.

<H4>Envelope, Format and the Message</H4>

<P>The envelope contains information about the sender and the recipient of a message.
The header of a message contains To-address and From-address but these have no direct relation
to the information in the envelope. While the envelope gives information about the origin and destination
of the message to the MTA (Mail Transfer Agent), the header gives similar, but not necessarily identical, 
information to the recipient.
The relationship of the envelope is to the Message object, the message itself.

<P>The format <EM>(For example MIME)</EM> is a description of the structure and the
syntax of a message. It describes what the message should look like for it to be properly handled
by the MTA and the UA (User Agent). If the message does not conform with the mutually agreed
format, chances are that the transport of the message will fail, and even if transport does not fail
the recipient may not be able to view the message as intended. The format describes the message
but the format is not part of the message. The relationships between the format and the message 
are primarily to the message object, the header line object and the body part object.

<P>The objects within the message also have relations other than those of the structure itself.
One of these is the relationship between the header line data and the body part object;  In a MIME
message filename and type is described in a header line for each body part. Another relation is the
one between the body part object and the body part data; A BinHexed attachment contains file name
and file type within the attachment.
<P>Another relationship is between the body part data and the message object; In a MIME multi part
message the body parts can be seen as message objects, they contain headers and a body. Although
some body parts may contain an empty header, this is actually supported by the structure. This makes
MIME multi part messages very different from other message types with multiple parts. This
difference needs a workaround for the structure to be suitable.

<P>The relations described above are displayed in figure 4. 
<BR><BR>
<IMG SRC="figure4.gif">
<BR>
<PRE>
    Figure 4. Relations between objects within a message. </PRE>
<H3>
The Format
</H3>
<P>The previous section described the format as one object. By looking at the relationships
it is quite easy to see that the format must be described in a more complex manner. Indeed
the structure of the format is not so much different from the structure of the message.
<P>The format needs to declare what kinds of header lines and what types of body part
objects it supports. If dealing with encoded body part data, the format also declares what
methods of encoding are available for use.
<P>Elaborating the format yields figure 5.
<BR><BR>
<IMG SRC="figure5.gif">
<BR>
<PRE>
    Figure 5. Including a structured format object.</PRE>
<H3>
Message Structure
</H3>
<P>Looking back at the basic structure of a message, it is obvious that different formats generate
different structures (figure 6).  The greatest difference is between the unstructured formats of 6a and 6c and
the structured formats of 6b and 6d. Converting between these two methods of structure is not so easy.
Making the problem simpler would be to change the internal representation of the unstructured formats 
towards the representation of the structured formats (figure 7). Here the top level of the message (named
0, <EM>zero</EM>) is common to all formats, while the lower level 1 is only available for structured 
formats. The MIME formats allows an arbitrary depth of the structure while SUN Mailtool allows
only level 0 and 1.
<BR><BR>
<IMG SRC="figure6.gif">
<BR>
<PRE>
    Figure 6. Basic message structures. The object names are abbreviated 
    according to: M = message object, H = header object, B = body object, 
    P = body part object, D = data object. 6a shows a single part message. 
    6b shows a message with a single part attachment in SUN Mailtool 
    format. 6c shows a multipart unstructured format message. 6d shows 
    a multipart message according to MIME and SUN Mailtool.
</PRE><BR><BR><BR>
<IMG SRC="figure7.gif">
<BR>
<PRE>
    Figure 7. Preparing structure levels in the message structure.
</PRE>

<P>Using this representation it will be possible to structure a message similarly for
different formats. A non structured format must ignore the effects of message objects
other than level 0 while a SUN Mailtool format ignores level 2 and deeper. Care must
be taken while constructing the structure that the recursiveness of this representation
is strictly controlled, avoiding unwanted loops. A message object generated because of
a single body part object should be marked so that the descendant of the body part object
one level below always is a data object.

<H3>Boundaries and other delimiters</H3>
<P>One of the main problems of parsing a message is identifying the delimiters of the
body parts. The end delimiter of a header is easy to find; a CRLF on a line on it's own.
The end delimiter of a text is somewhat more hard to find.
<P>If the body parts of an unstructured  message is divided into two groups, parts of type text and parts
of type encoded data, the classification becomes:
<UL>

<LI>Text - This is the default type, thus the start condition is met when the start condition for
an encoding is not. The end  is met when there is a start of an encoding or at end of data (this should
work in a data buffer aswell, therefore end of file is not generally applicable).

<LI>Encodings - An encoding is, as far as this document is concerned, a BinHexed or uuencoded
body of data. The start conditions for these are the fairly unique starting strings used by these
encodings. The end condition is also an encoding specific issue. However, the start condition is 
only met when the body of the encoding conforms with the syntax of the encoding format. This
makes three stages that must be correct (start, body and end) for the start condition to be fulfilled.

</UL>
<P>There are two more methods of decomposing a message:

<UL>

<LI>Unique boundaries - This is the method used by MIME. A unique boundary string acts as
delimiter of the body parts. If the uniqueness of the boundary is not fulfilled, the message is
corrupt. Because of it's nested capabilities MIME is also provided with a unique boundary
as end condition both of the body part and of the multi part structure.

<LI>Size specification - This is used by SUN Mailtool, together with a non unique boundary
string. A size specification is really only an end condition, SUN Mailtool uses a boundary as
start condition.

</UL>

<H3>Format recognition</H3>
<P>When a message comes in Emil must be able to recognize the structure and format
information to use when parsing it. Other information that can be of great importance is
the default character set used by the sender. MIME and SUN Mailtool formats are
specified in the message itself by specific header lines. When the message is not
in either of those formats a default character set for the sender must be used.

<P>

<H3>Functional Model</H3>
<P>The differences between input and output, although at the same level of abstraction, is 
to great to span in a single step. Emil uses a multipass design to accomplish transformation. 
Because of this, it is possible to divide the problem of transformation into several smaller problems 
making it easier to grasp.
<P>Decomposing the overall function of Emil into clearly defined functions yields:
<OL>
<STRONG><LI>Load and parse message (tag objects with corresponding format name)</STRONG>
<OL>
<LI>Get sender and recipient<EM> (these are specified as arguments)</EM>
<LI>Get target format<EM> (lookup recipient in the target database)</EM>
<LI>Get sender's default charset <EM>(lookup sender in the target database)</EM>
<LI>Load message<EM> (into a single character buffer)</EM>
<LI>Parse header <EM>(unfold and structure into tokens)</EM>
<LI>Try formats <EM>(compare header with the patterns defined in the formats)</EM>
<LI>Parse body <EM>(look for boundaries and other delimiters)</EM>
<UL>
<LI>Assign body part descriptors, boundaries and method of decomposition
<LI>In case of a MIME multipart body return to 5 until end delimiter is found
<LI>In case of a SUN Mailtool multipart body return to 5 until end of buffer
<LI>Structure body into body part objects
</UL>
</OL>
<STRONG><LI>Apply conversion of data objects</STRONG>
<OL>
<LI>Compare object descriptors with target format
<LI>Convert non comforming encodings
<UL>
<LI>Decode into a common representation
<LI>Encode into the specified target encoding
</UL>
</OL>
<STRONG><LI>Add target format</STRONG>
<OL>
<LI>Add target headers
<LI>Add target method of decomposition and boundaries
</OL>
<STRONG><LI>Output message (use only objects tagged with the target format name)</STRONG>
</OL>
  
<P>This is a rough model of the functions performed by Emil.

<HR>
<A HREF=design-spec.html><IMG ALIGN=MIDDLE SRC=arrow_right3.gif></A><A HREF=main.html><IMG ALIGN=MIDDLE SRC=arrow_up2.gif></A><A HREF=problem-statement.html><IMG ALIGN=MIDDLE SRC=arrow_left3.gif></A>


<hr size="4" noshade>
<ADDRESS>
<table WIDTH="95%">
<td>
March 1996<p>
<B>ITS Uppsala university</B><BR>
Box 887<BR>
751 08 Uppsala<BR>
SWEDEN<P>
</td>
<td ALIGN="right" VALIGN="middle">
<a href="mailto:Martin.Wendel@its.uu.se">Martin Wendel</a>
</td>
<td ALIGN="left" VALIGN="middle">
<a href="mailto:Martin.Wendel@its.uu.se">
<IMG border="0" SRC="binpobox.gif" ALT="E-Mail: "></a>
</td>
</table>
</ADDRESS>
</body>
</html>