1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
|
<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>18.3. Optimizing Regular Expressions</title>
<link rel="stylesheet" href="../diveintopython.css" type="text/css">
<link rev="made" href="mailto:f8dy@diveintopython.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.52.2">
<meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free">
<meta name="description" content="Python from novice to pro">
<link rel="home" href="../toc/index.html" title="Dive Into Python">
<link rel="up" href="index.html" title="Chapter 18. Performance Tuning">
<link rel="previous" href="timeit.html" title="18.2. Using the timeit Module">
<link rel="next" href="dictionary_lookups.html" title="18.4. Optimizing Dictionary Lookups">
</head>
<body>
<table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td id="breadcrumb" colspan="5" align="left" valign="top">You are here: <a href="../index.html">Home</a> > <a href="../toc/index.html">Dive Into Python</a> > <a href="index.html">Performance Tuning</a> > <span class="thispage">Optimizing Regular Expressions</span></td>
<td id="navigation" align="right" valign="top"> <a href="timeit.html" title="Prev: “Using the timeit Module”"><<</a> <a href="dictionary_lookups.html" title="Next: “Optimizing Dictionary Lookups”">>></a></td>
</tr>
<tr>
<td colspan="3" id="logocontainer">
<h1 id="logo"><a href="../index.html" accesskey="1">Dive Into Python</a></h1>
<p id="tagline">Python from novice to pro</p>
</td>
<td colspan="3" align="right">
<form id="search" method="GET" action="http://www.google.com/custom">
<p><label for="q" accesskey="4">Find: </label><input type="text" id="q" name="q" size="20" maxlength="255" value=" "> <input type="submit" value="Search"><input type="hidden" name="cof" value="LW:752;L:http://diveintopython.org/images/diveintopython.png;LH:42;AH:left;GL:0;AWFID:3ced2bb1f7f1b212;"><input type="hidden" name="domains" value="diveintopython.org"><input type="hidden" name="sitesearch" value="diveintopython.org"></p>
</form>
</td>
</tr>
</table>
<!--#include virtual="/inc/ads" -->
<div class="section" lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a name="soundex.stage1"></a>18.3. Optimizing Regular Expressions
</h2>
</div>
</div>
<div></div>
</div>
<div class="abstract">
<p>The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to
do this?
</p>
</div>
<p>If you answered “<span class="quote">regular expressions</span>”, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should
be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain.
Also for performance reasons.
</p>
<p>This code fragment from <tt class="filename">soundex/stage1/soundex1a.py</tt> checks whether the function argument <tt class="varname">source</tt> is a word made entirely of letters, with at least one letter (not the empty string):
</p>
<div class="informalexample"><pre class="programlisting">
allChars = string.uppercase + string.lowercase
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[%s]+$'</span> % allChars, source):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p>How does <tt class="filename">soundex1a.py</tt> perform? For convenience, the <tt class="literal">__main__</tt> section of the script contains this code that calls the <tt class="filename">timeit</tt> module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for
each:
</p>
<div class="informalexample"><pre class="programlisting"><span class='pykeyword'>
if</span> __name__ == <span class='pystring'>'__main__'</span>:
<span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
<span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
statement = <span class='pystring'>"soundex('%s')"</span> % name
t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
<span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
<p>So how does <tt class="filename">soundex1a.py</tt> perform with this regular expression?
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1a.py</span>
<span class="computeroutput">Woo W000 19.3356647283
Pilgrim P426 24.0772053431
Flingjingwaller F452 35.0463220884</span>
</pre></div>
<p>As you might expect, the algorithm takes significantly longer when called with longer names. There will be a few things we
can do to narrow that gap (make the function take less relative time for longer input), but the nature of the algorithm dictates
that it will never run in constant time.
</p>
<p>The other thing to keep in mind is that we are testing a representative sample of names. <tt class="literal">Woo</tt> is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. <tt class="literal">Pilgrim</tt> is a normal case, of average length and a mixture of significant and ignored letters. <tt class="literal">Flingjingwaller</tt> is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range
of different cases.
</p>
<p>So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters
(<tt class="literal">A-Z</tt> in uppercase, and <tt class="literal">a-z</tt> in lowercase), we can use a shorthand regular expression syntax. Here is <tt class="filename">soundex/stage1/soundex1b.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[A-Za-z]+$'</span>, source):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p><tt class="filename">timeit</tt> says <tt class="filename">soundex1b.py</tt> is slightly faster than <tt class="filename">soundex1a.py</tt>, but nothing to get terribly excited about:
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1b.py</span>
<span class="computeroutput">Woo W000 17.1361133887
Pilgrim P426 21.8201693232
Flingjingwaller F452 32.7262294509</span>
</pre></div>
<p>We saw in <a href="../refactoring/refactoring.html" title="15.3. Refactoring">Section 15.3, “Refactoring”</a> that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across
function calls, we can compile it once and use the compiled version. Here is <tt class="filename">soundex/stage1/soundex1c.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
isOnlyChars = re.compile(<span class='pystring'>'^[A-Za-z]+$'</span>).search
<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> isOnlyChars(source):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p>Using a compiled regular expression in <tt class="filename">soundex1c.py</tt> is significantly faster:
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1c.py</span>
<span class="computeroutput">Woo W000 14.5348347346
Pilgrim P426 19.2784703084
Flingjingwaller F452 30.0893873383</span>
</pre></div>
<p>But is this the wrong path? The logic here is simple: the input <tt class="varname">source</tt> needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each
character, and do away with regular expressions altogether?
</p>
<p>Here is <tt class="filename">soundex/stage1/soundex1d.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> source:
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
<span class='pykeyword'>for</span> c <span class='pykeyword'>in</span> source:
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> (<span class='pystring'>'A'</span> <= c <= <span class='pystring'>'Z'</span>) <span class='pykeyword'>and</span> <span class='pykeyword'>not</span> (<span class='pystring'>'a'</span> <= c <= <span class='pystring'>'z'</span>):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p>It turns out that this technique in <tt class="filename">soundex1d.py</tt> is <span class="emphasis"><em>not</em></span> faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1d.py</span>
<span class="computeroutput">Woo W000 15.4065058548
Pilgrim P426 22.2753567842
Flingjingwaller F452 37.5845122774</span>
</pre></div>
<p>Why isn't <tt class="filename">soundex1d.py</tt> faster? The answer lies in the interpreted nature of <span class="application">Python</span>. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this
loop is written in <span class="application">Python</span>, and runs through the <span class="application">Python</span> interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted.
Regular expressions are never the right answer... except when they are.
</p>
<p>It turns out that <span class="application">Python</span> offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book.
The method is called <tt class="methodname">isalpha()</tt>, and it checks whether a string contains only letters.
</p>
<p>This is <tt class="filename">soundex/stage1/soundex1e.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
<span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p>How much did we gain by using this specific method in <tt class="filename">soundex1e.py</tt>? Quite a bit.
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1e.py</span>
<span class="computeroutput">Woo W000 13.5069504644
Pilgrim P426 18.2199394057
Flingjingwaller F452 28.9975225902</span>
</pre></div>
<div class="example"><a name="d0e39318"></a><h3 class="title">Example 18.3. Best Result So Far: <tt class="filename">soundex/stage1/soundex1e.py</tt></h3><pre class="programlisting"><span class='pykeyword'>
import</span> string, re
charToSoundex = {<span class='pystring'>"A"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"B"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"C"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"D"</span>: <span class='pystring'>"3"</span>,
<span class='pystring'>"E"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"F"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"G"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"H"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"I"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"J"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"K"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"L"</span>: <span class='pystring'>"4"</span>,
<span class='pystring'>"M"</span>: <span class='pystring'>"5"</span>,
<span class='pystring'>"N"</span>: <span class='pystring'>"5"</span>,
<span class='pystring'>"O"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"P"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"Q"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"R"</span>: <span class='pystring'>"6"</span>,
<span class='pystring'>"S"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"T"</span>: <span class='pystring'>"3"</span>,
<span class='pystring'>"U"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"V"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"W"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"X"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"Y"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"Z"</span>: <span class='pystring'>"2"</span>}
<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
<span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
source = source[0].upper() + source[1:]
digits = source[0]
<span class='pykeyword'>for</span> s <span class='pykeyword'>in</span> source[1:]:
s = s.upper()
digits += charToSoundex[s]
digits2 = digits[0]
<span class='pykeyword'>for</span> d <span class='pykeyword'>in</span> digits[1:]:
<span class='pykeyword'>if</span> digits2[-1] != d:
digits2 += d
digits3 = re.sub(<span class='pystring'>'9'</span>, <span class='pystring'>''</span>, digits2)
<span class='pykeyword'>while</span> len(digits3) < 4:
digits3 += <span class='pystring'>"0"</span>
<span class='pykeyword'>return</span> digits3[:4]
<span class='pykeyword'>if</span> __name__ == <span class='pystring'>'__main__'</span>:
<span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
<span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
statement = <span class='pystring'>"soundex('%s')"</span> % name
t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
<span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
</div>
<table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td width="35%" align="left"><br><a class="NavigationArrow" href="timeit.html"><< Using the timeit Module</a></td>
<td width="30%" align="center"><br> <span class="divider">|</span> <a href="index.html#soundex.divein" title="18.1. Diving in">1</a> <span class="divider">|</span> <a href="timeit.html" title="18.2. Using the timeit Module">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="dictionary_lookups.html" title="18.4. Optimizing Dictionary Lookups">4</a> <span class="divider">|</span> <a href="list_operations.html" title="18.5. Optimizing List Operations">5</a> <span class="divider">|</span> <a href="string_manipulation.html" title="18.6. Optimizing String Manipulation">6</a> <span class="divider">|</span> <a href="summary.html" title="18.7. Summary">7</a> <span class="divider">|</span>
</td>
<td width="35%" align="right"><br><a class="NavigationArrow" href="dictionary_lookups.html">Optimizing Dictionary Lookups >></a></td>
</tr>
<tr>
<td colspan="3"><br></td>
</tr>
</table>
<div class="Footer">
<p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython.org">Mark Pilgrim</a></p>
</div>
</body>
</html>
|