1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
|
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>LamsonProject: Unicode Encoding And Decoding</title>
<link rel="stylesheet" href="/styles/global.css" type="text/css" charset="utf-8" />
<link rel="stylesheet" href="/css/code.css" type="text/css" charset="utf-8" />
<!--[if IE 7]>
<style type="text/css" media="screen">
div#column_left ul.sidebar_menu li div.color{
display: none;
}
</style>
<![endif]-->
<link href="/prettify.css" type="text/css" rel="stylesheet" />
<script type="text/javascript" src="/prettify.js"></script>
</head>
<body onload="prettyPrint()">
<div id="content_centered">
<div id="header">
<h1><img id="logo" src="/images/lamson.png" alt="Lamson Project(TM) - Pipes and aliases are so 1970." /></h1>
<ul id="header_menu">
<li><a href="/">Home</a></li>
<li><a href="/blog/">News</a></li>
<li><a href="/feed.xml">Feed</a></li>
<li><a href="/download.html">Download</a></li>
<li><a href="/docs/">Documentation</a></li>
<li><a href="/docs/api/">API</a></li>
</ul>
</div>
<div id="main_content">
<h1>Unicode Encoding And Decoding</h1>
The world is Unicode, but email existed long before that world. Lamson uses
special encoding/decoding gear that takes just about any nasty horrible pre-Unicode
email it gets and converts it to a perfect little Unicode fantasy. You do all
your work in the Unicode world, which Python favors, and then when you’re ready
to send, Lamson intelligently figures out how to make an email that anyone can
read.
<p>The added advantage of doing this conversion is that Lamson by default also
cleans up all the various tricks spammers use to get around spam checkers. Because
it’s doing a conversion from just first principles of “everything must become Unicode”
the end result is a crystal clear piece of data you can filter. When you then
output the same message, it gets reconverted to sane easily parseable message.</p>
<p>Lamson is also so persistent in this conversion that it can convert nearly every
message that’s valid, and the very tiny percentage it can’t (less 1/1000th of a percent)
are entirely screwed up spam that should not be transmitted anyway.</p>
<blockquote>
<p>This gear apparently doesn’t support pre-Hindic sanskrit, but then again neither
does Python.</p>
</blockquote>
<h2>Why</h2>
<p>Sometimes people ask why Lamson would bother converting everything to Unicode?
The reason is simple: everything you deal with now is Unicode. Either it’s in
a representation like UTF-8, or it’s internally a Python Unicode string. Databases,
web frameworks, <span class="caps">ORM</span>, network protocols, and the entire internet is Unicode.</p>
<p>However, there’s another practical reason. If you were to start using Lamson to
process email, you would end up inventing almost the exact same gear to convert
headers and bodies to Unicode. The only difference is you would do it in a typical
hacker half-assed fashion that only solved your immediate problems, and you wouldn’t
do it in a global useful way.</p>
<p>The Lamson encoding gear is basically what you should be making to deal with email
in a modern language.</p>
<h2>How </h2>
<p>The <a href="http://lamsonproject.org/docs/api/lamson.encoding-module.html">lamson.encoding</a> module
is the main meat of this conversion magic. This code is probably the most dense
part of Lamson since it has to use various parsing tricks and heuristics to figure
out how to get every part of the message into a Unicode string.</p>
<p>The primary trick used is that Lamson does not trust anything it’s given, and when
python fails to convert a header or body, it uses the wonderful <a href="http://chardet.feedparser.org/">chardet</a>
library to guess at what the encoding should be based on the contents. If this fails
then the entire data is considered suspect and the message is rejected.</p>
<p>This turns out to be surprisingly accurate in practice, and it even corrects some
invalid clients that fail to encode Subject lines but still place Chinese encodings
in them. Lamson will take those unencoded headers, fail to convert them to ascii,
run chardet on them to see they are actually Chinese, and then convert them accurately.</p>
<h2>The Rules</h2>
<p>A set of axioms controls how the encoding is done:</p>
<ol>
<li>NO <span class="caps">ENCODING</span> IS <span class="caps">TRUSTED</span>, NO <span class="caps">LANGUAGE</span> IS <span class="caps">SACRED</span>, <span class="caps">ALL</span> <span class="caps">ARE</span> <span class="caps">SUSPECT</span>.</li>
<li>Python wants Unicode, it will get Unicode.</li>
<li>Any email that <span class="caps">CANNOT</span> become Unicode, <span class="caps">CANNOT</span> be processed by Lamson or Python.</li>
<li>Email addresses are <span class="caps">ESSENTIAL</span> to Lamson’s routing and security, and therefore will be canonicalized and properly encoded.</li>
<li>Lamson will therefore try to “upgrade” all email it receives to Unicode internally, and cleaning all email addresses.</li>
<li>It does this by decoding all codecs, and if the codec <span class="caps">LIES</span>, then it will attempt to statistically detect the codec using chardet.</li>
<li>If it can’t detect the codec, and the codec lies, then the email is bad.</li>
<li>All text bodies and attachments are then converted to Python Unicode in the same way as the headers.</li>
<li>All other attachments are converted to raw strings as-is.</li>
</ol>
<p>Once Lamson has done this, your Python handler can now assume that all
MailRequest objects are happily Unicode enabled and ready to go. The rule is:</p>
IF IT <span class="caps">CANNOT</span> BE <span class="caps">UNICODE</span>, <span class="caps">THEN</span> <span class="caps">PYTHON</span> <span class="caps">CANNOT</span> <span class="caps">WORK</span> <span class="caps">WITH</span> IT.
<p>On the outgoing end (when you send a MailResponse), Lamson tries to create the
email it wants to receive by canonicalizing it:</p>
<ol>
<li>All email will be encoded in the simplest cleanest way possible without losing information.</li>
<li>All headers are converted to 'ascii’, and if that doesn’t work, then 'utf-8’.</li>
<li>All text/* attachments and bodies are converted to ascii, and if that doesn’t work, 'utf-8’.</li>
<li>All other attachments are left alone.</li>
<li>All email addresses are normalized and encoded if they have not been already.</li>
</ol>
<h2>Neat Tricks</h2>
<p>The end result of this is that you can now take any email and then convert it to
any modern data format and send it over new protocols. For example, here’s the
code from <a href="http://librelist.com/">librelist.com</a> to implement the <span class="caps">JSON</span> dump
for the <a href="http://librelist.com/browser/">archive browser</a> Javascript interface:</p>
<pre class="code prettyprint">
def json_encoding(base):
ctype, ctp = base.content_encoding['Content-Type']
cdisp, cdp = base.content_encoding['Content-Disposition']
ctype = ctype or "text/plain"
filename = ctp.get('name',None) or cdp.get('filename', None)
if ctype.startswith('text') or ctype.startswith('message'):
encoding = None
else:
encoding = "base64"
return {'filename': filename, 'type': ctype, 'disposition': cdisp,
'format': encoding}
def json_build(base):
data = {'headers': base.headers,
'body': base.body,
'encoding': json_encoding(base),
'parts': [json_build(p) for p in base.parts],
}
if data['encoding']['format'] and base.body:
data['body'] = base64.b64encode(base.body)
return data
def to_json(base):
return json.dumps(json_build(base), sort_keys=True, indent=4)
</pre>
<p>Since this code can assume that everything inside Lamson is fully Unicode, it only needs
to worry about binary items like images and encode those as base64.</p>
<h2>lamson cleanse</h2>
<p>You can also use the <code>lamson cleanse</code> command to take a Maildir or mbox as input, and
then convert it into a cleaned up Maildir as output. Try it on some of your spam
folders to watch the magic.</p>
<h2>Reporting Problems</h2>
<p>If you happen to run into a message that you feel Lamson should accurately
decode, feel free to <a href="/contact.html">report it</a> and I’ll take a look.</p>
</div>
<div id="column_left">
<ul class="sidebar_menu">
<li>
<div class="item">
<div class="color" style="background-color: #ff0000;"> </div>
<a href="/blog/">Latest News</a>
</div>
</li>
<li>
<div class="item">
<div class="color" style="background-color: #ff9900;"> </div>
<a href="/download.html">Download the Gear</a>
</div>
</li>
<li>
<div class="item">
<div class="color" style="background-color: #99cc00;"> </div>
<a href="/docs/getting_started.html">Getting Started</a>
</div>
</li>
<li>
<div class="item">
<div class="color" style="background-color: #3399ff;"> </div>
<a href="/docs/">Documentation</a>
</div>
</li>
<li>
<div class="item">
<div class="color" style="background-color: #ff3399;"> </div>
<a href="/docs/faq.html">Frequently Asked Questions</a>
</div>
</li>
<li>
<div class="item">
<div class="color" style="background-color: #006699;"> </div>
<a href="/about.html">About Lamson</a>
</div>
</li>
<li>
<div class="item">
<div class="color" style="background-color: #0099cc;"> </div>
<a href="/contact.html">Getting Help with Lamson</a>
</div>
</li>
</ul>
<div class="sidebar_item">
<h3>Quick Start</h3>
<p>See the download instructions for information on getting lamson, and read the getting started instructions to start your own application in less than 10 minutes.</p>
</div>
<br/>
<div class="sidebar_item">
<h3>Mailing Lists</h3>
<p>Lamson hosts its own <a href="/lists/">mailing lists</a> as well as provides a free open mailing list
service for anyone who needs one. Simply send an email to the list you want @librelist.com and it will
get you started.</p>
</div>
</div>
<div id="footer">
<div class="footer_content">
Lamson Project(TM) and all material on this site Copyright © 2009 <a href="http://zedshaw.com/" title="Zed Shaw's blog">Zed Shaw</a> unless otherwise stated.<br/>
Website Designed by <a href="http://kenkeiter.com/">Kenneth Keitner</a> and donated to the LamsonProject.
</div>
</div>
<!-- end:centered_content -->
</div>
</body>
</html>
|