1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<title>Drinking TagSoup by Example</title>
<style type="text/css">
pre {
border: 2px solid gray;
padding: 1px;
padding-left: 5px;
margin-left: 10px;
background-color: #eee;
}
pre.define {
background-color: #ffb;
border-color: #cc0;
}
body {
font-family: sans-serif;
}
h1, h2, h3 {
font-family: serif;
}
h1 {
color: rgb(23,54,93);
border-bottom: 1px solid rgb(79,129,189);
padding-bottom: 2px;
font-variant: small-caps;
text-align: center;
}
a {
color: rgb(54,95,145);
}
h2 {
color: rgb(54,95,145);
}
h3 {
color: rgb(79,129,189);
}
p.rule {
background-color: #ffb;
padding: 3px;
margin-left: 50px;
margin-right: 50px;
}
</style>
</head>
<body>
<h1>Drinking TagSoup by Example</h1>
<p style="text-align:right;margin-bottom:25px;">
by <a href="http://community.haskell.org/~ndm/">Neil Mitchell</a>
</p>
<p>
TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.
</p>
<p>
This document gives two particular examples of scraping information from the web, while a few more may be found in the <a href="http://community.haskell.org/~ndm/darcs/tagsoup/TagSoup/Sample.hs">Sample</a> file from the darcs repository. The examples we give are:
</p>
<ol>
<li>Obtaining the Hit Count from Haskell.org</li>
<li>Obtaining a list of Simon Peyton-Jones' latest papers</li>
<li>A brief overview of some other examples</li>
</ol>
<p>
The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock!
</p>
<p>
This library was written without knowledge of the Java version of <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a>. They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information.
</p>
<h3>Acknowledgements</h3>
<p>
Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann, Dino Morelli, Emily Mitchell, Gwern Branwen.
</p>
<h3>Version History</h3>
<p>A changelog is kept at <a href="https://github.com/ndmitchell/tagsoup/blob/master/CHANGES.txt">https://github.com/ndmitchell/tagsoup/blob/master/CHANGES.txt</a>.</p>
<h2>Potential Bugs</h2>
<p>
There are two things that may go wrong with these examples:
</p>
<ul>
<li><i>The Websites being scraped may change.</i> There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, its only a few minutes work.</li>
<li><i>The <tt>openURL</tt> method may not work.</i> This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use <tt>wget</tt> to download the page locally, then use <tt>readFile</tt> instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead.
</li>
</ul>
<h2>Haskell Hit Count</h2>
<p>
Our goal is to develop a program that displays the Haskell.org hit count. This example covers all the basics in designing a basic web-scraping application.
</p>
<h3>Finding the Page</h3>
<p>
We first need to find where the information is displayed, and in what format. Taking a look at the <a href="http://www.haskell.org/haskellwiki/Haskell">front web page</a>, when not logged in, we see:
</p>
<pre>
<ul id="f-list">
<li id="lastmod"> This page was last modified on 9 September 2013, at 22:38.</li>
<li id="viewcount">This page has been accessed 6,985,922 times.</li>
<li id="copyright">Recent content is available under <a href="/haskellwiki/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">a simple permissive license</a>.</li>
<li id="privacy"><a href="/haskellwiki/HaskellWiki:Privacy_policy" title="HaskellWiki:Privacy policy">Privacy policy</a></li>
<li id="about"><a href="/haskellwiki/HaskellWiki:About" title="HaskellWiki:About">About HaskellWiki</a></li>
<li id="disclaimer"><a href="/haskellwiki/HaskellWiki:General_disclaimer" title="HaskellWiki:General disclaimer">Disclaimers</a></li>
</ul>
</pre>
<p>
So we see the hit count is available. This leads us to rule 1:
</p>
<p class="rule">
<b>Rule 1:</b><br/>
Scrape from what the page returns, not what a browser renders, or what view-source gives.
</p>
<p>
Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are two ways to get the page as it will appear to your program.
</p>
<h4>Using the HTTP package</h4>
<p>
We can write a simple HTTP downloader with using the <a href="http://hackage.haskell.org/package/HTTP">HTTP package</a>:
</p>
<pre>
import Network.HTTP
openURL x = getResponseBody =<< simpleHTTP (getRequest x)
main = do src <- openURL "http://www.haskell.org/haskellwiki/Haskell"
writeFile "temp.htm" src
</pre>
<p>
Now open <tt>temp.htm</tt>, find the fragment of HTML containing the hit count, and examine it.
</p>
<h4>Using the <tt>tagsoup</tt> Program</h4>
<p>
Tagsoup installs both as a library and a program. The program contains all the examples mentioned on this page, along with a few other useful functions. In order to download a URL to a file:
</p>
<pre>
$ tagsoup grab http://www.haskell.org/haskellwiki/Haskell > temp.htm
</pre>
<h3>Finding the Information</h3>
<p>
Now we examine both the fragment that contains our snippet of information, and the wider page. What does the fragment has that nothing else has? What algorithm would we use to obtain that particular element? How can we still return the element as the content changes? What if the design changes? But wait, before going any further:
</p>
<p class="rule">
<b>Rule 2:</b><br/>
Do not be robust to design changes, do not even consider the possibility when writing the code.
</p>
<p>
If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change.
</p>
<p>
So now, lets consider the fragment from above. It is useful to find a tag which is unique just above your snippet - something with a nice "id" property, or a "class" - something which is unlikely to occur multiple times. In the above example, "viewcount" as the id seems perfect.
</p>
<pre>
haskellHitCount = do
src <- openURL "http://haskell.org/haskellwiki/Haskell"
let count = fromFooter $ parseTags src
putStrLn $ "haskell.org has been hit " ++ count ++ " times"
where fromFooter = filter isDigit . innerText . take 2 . dropWhile (~/= "<li id=viewcount>")
</pre>
<p>
Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of <tt>Tag</tt>s with <tt>parseTags</tt>. The <tt>fromFooter</tt> function does the interesting thing, and can be read left to right:
</p>
<ul>
<li>First we throw away everything (<tt>dropWhile</tt>) until we get to an <tt>li</tt> tag containing <tt>id=viewcount</tt>. The <tt>(~==)</tt> operator is different from standard equality, allowing additional attributes to be present. We write <tt>"<li id=viewcount>"</tt> as syntactic sugar for <tt>TagOpen "li" [("id","viewcount")]</tt>. If we just wanted any open tag with the given id we could have written <tt>(~== TagOpen "" [("id","viewcount")])</tt> and this would have matched. Any empty strings in the second element of the match are considered as wildcards.</li>
<li>Next we take two elements, the <tt><li></tt> tag and the text node immediately following.</li>
<li>We call the <tt>innerText</tt> function to get all the text values from inside, which will just be the text node following the <tt>viewcount</tt>.</li>
<li>We keep only the numbers, getting rid of the surrounding text and the commas.</li>
</ul>
<p>
This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup.
</p>
<p class="rule">
<b>Rule 3:</b><br/>
TagSoup is for extracting information where structure has been lost, use more structured information if it is available.
</p>
<h2>Simon's Papers</h2>
<p>
Our next very important task is to extract a list of all Simon Peyton Jones' recent research papers off his <a href="http://research.microsoft.com/en-us/people/simonpj/">home page</a>. The largest change to the previous example is that now we desire a list of papers, rather than just a single result.
</p>
<p>
As before we first start by writing a simple program that downloads the appropriate page, and look for common patterns. This time we want to look for all patterns which occur every time a paper is mentioned, but no where else. The other difference from last time is that previous we grabbed an automatically generated piece of information - this time the information is entered in a more freeform way by a human.
</p>
<p>
First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple <tt>take</tt>/<tt>drop</tt> pair:
</p>
<pre>
takeWhile (~/= "<a name=haskell>") $
drop 5 $ dropWhile (~/= "<a name=current>") tags
</pre>
<p>
This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section:
</p>
<pre>
map f $ sections (~== "<A>") $ ...
</pre>
<p>
Remember that the function to select all tags with name "A" could have been written as <tt>(~== TagOpen "A" [])</tt>, or alternatively <tt>isTagOpenName "A"</tt>. Afterwards we map each item with an <tt>f</tt> function. This function needs to take the tags starting just after the link, and find the text inside the link.
</p>
<pre>
f = dequote . unwords . words . fromTagText . head . filter isTagText
</pre>
<p>
Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the <tt>filter</tt> drops all those that are not, until we find a pure text node. The <tt>unwords . words</tt> deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist.
</p>
<p>
For completeness, we now present the entire example:
</p>
<pre>
spjPapers :: IO ()
spjPapers = do
tags <- fmap parseTags $ openURL "http://research.microsoft.com/en-us/people/simonpj/"
let links = map f $ sections (~== "<A>") $
takeWhile (~/= "<a name=haskell>") $
drop 5 $ dropWhile (~/= "<a name=current>") tags
putStr $ unlines links
where
f :: [Tag] -> String
f = dequote . unwords . words . fromTagText . head . filter isTagText
dequote ('\"':xs) | last xs == '\"' = init xs
dequote x = x
</pre>
<h2>Other Examples</h2>
<p>
Several more examples are given in the Example file, including obtaining the (short) list of papers from my site, getting the current time and a basic XML validator. All can be invoked using the <tt>tagsoup</tt> executable program. All use very much the same style as presented here - writing screen scrapers follow a standard pattern. We present the code from two for enjoyment only.
</p>
<h3>My Papers</h3>
<pre>
ndmPapers :: IO ()
ndmPapers = do
tags <- fmap parseTags $ openURL "http://community.haskell.org/~ndm/downloads/"
let papers = map f $ sections (~== "<li class=paper>") tags
putStr $ unlines papers
where
f :: [Tag] -> String
f xs = fromTagText (xs !! 2)
</pre>
<h3>UK Time</h3>
<pre>
currentTime :: IO ()
currentTime = do
tags <- fmap parseTags $ openURL "http://www.timeanddate.com/worldclock/city.html?n=136"
let time = fromTagText (dropWhile (~/= "<strong id=ct>") tags !! 1)
putStrLn time
</pre>
</body>
</html>
|