File: in-depth-guide-html-scraping.md

package info (click to toggle)
python-recipe-scrapers 15.9.0-1
  • links: PTS
  • area: main
  • in suites: forky, sid
  • size: 246,580 kB
  • sloc: python: 13,214; makefile: 3
file content (216 lines) | stat: -rw-r--r-- 8,873 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# In Depth Guide: HTML Scraping

!!! warning "Under Construction"
    This section is being updated. Some information may be outdated or inaccurate.

The preferred method of scraping recipe information from a web page is to use the schema.org
Recipe data. This is a machine readable, structured format intended to provide a standardised
method of extracting information. However, whilst most recipe websites use the schema.org Recipe
format, not all do, and for those websites that do, it does not always include all the information
we are looking for. In these cases, we can use HTML scraping to extract the information from
the HTML markup.

## `soup`

Each scraper has a `BeautifulSoup` object that can be accessed using the `self.soup` attribute.
The `BeautifulSoup` object is a representation of the web page HTML that has been parsed into a
format that we can query and extract information from.

The [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the best resource for learning how to use `BeautifulSoup`
objects to interact with HTML documents.

This guide covers a number of common patterns that are used in this library.

## `_schema_cls` and `_opengraph_cls`

It should rarely be necessary to override the default behaviour of schema.org and OpenGraph
metadata retrieval; recipe websites should generally adhere to their respective standard formats
when including metadata on their webpages.  However, bugs/mistakes do happen - if you need to
override the implementations provided by the `SchemaOrg` and `OpenGraph` classes, you can subclass
from those and add a `_schema_cls` or `_opengraph_cls` attribute to your scraper class to instruct
the library to use them instead.

## Finding a single element

The `self.soup.find()` function returns the first element matching the arguments. This is useful if
you are trying to extract some information that should only occur once, for example the prep time
or total time.

```python
# To find a particular element
self.soup.find("h1") # Returns the first h1 element

# To find an element with particular class (note the underscore at the end of class_)
self.soup.find(class_"total-time") # Returns the first element with total-time class.

# To find an element with a particular ID
self.soup.find(id="total-time")

# You can include multiple arguments to be more specific
# To find the first h1 element with "title" class
self.soup.find("h1", class_="title")
```

`self.soup` returns a `bs4.element.Tag` object. Usually we just want the text from the selected
element and the best way to do that is to use `.get_text()`.

```python
title_tag = self.soup.select("h1") # bs4.element.Tag object
title_text = title_tag.get_text() # str
```

`.get_text()` will get the text from all child elements, as it would appear in your browser, so
there is no need to iterate through all the children, call `.get_text()` on each one, then join
the results afterwards.

As an example, consider one of the ingredients in [this recipe](https://rainbowplantlife.com/instant-pot-jackfruit-curry/#wprm-recipe-container-5618). The markup looks like this:

```html
<li class="wprm-recipe-ingredient" style="list-style-type: none;" data-uid="0">
  <span class="wprm-checkbox-container">
    <input type="checkbox" id="wprm-checkbox-1" class="wprm-checkbox" aria-label="&nbsp;1 tablespoon coconut oil (or oil of choice)">
    <label for="wprm-checkbox-1" class="wprm-checkbox-label">
      <span class="sr-only screen-reader-text wprm-screen-reader-text">▢ </span>
    </label>
  </span>
  <span class="wprm-recipe-ingredient-amount">1</span>
  <span class="wprm-recipe-ingredient-unit">tablespoon</span>
  <span class="wprm-recipe-ingredient-name">coconut oil</span>
  <span class="wprm-recipe-ingredient-notes wprm-recipe-ingredient-notes-normal">(or oil of choice)</span>
</li>
```

We can select this element using its tag and class (we're pretending this recipe only has this one
ingredient), and extract the text like so:

```python
ingredient_tag = self.soup.find("li", class_="wprm-recipe-ingredient")
ingredient_text = ingredient_tag.get_text()
# '1 tablespoon coconut oil (or oil of choice)'
```

The Beautiful Soup documentation for `find` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find).

### Normalizing strings

A convenience function called `normalize_string()` is provided in the `_utils` package. This
function will convert any characters escaped for HTML to their actual character (e.g. `&amp;`
to `&`) and remove unnecessary white space. It is best practice to always use this when extracting
text from the HTML.

```python
from ._utils import normalize_string

# ...
ingredient_tag = self.soup.find("li", class_="wprm-recipe-ingredient")
ingredient_text = normalize_string(ingredient_tag.get_text())
```

### Getting yields

A convenience function called `get_yields()` is provided in the `_utils` package. This function
accepts a `str` or `bs4.element.Tag` and will return the yield, handling many of the common
formats yields can appear in and normalizing them to a standard format.

```python
from ._utils import get_yields

# ...
yield_tag = self.soup.find(class_="wprm-recipe-servings")
yield_text = get_yields(yield_tag)
# or
yield_text = get_yields(yield_tag.get_text())
# both return '4 servings'
```

### Getting times

A convenience function called `get_minutes()` is provided in the `_utils` package. This function
accepts a `str` or `bs4.element.Tag` and will return the number of minutes as an `int`. This
function handles a number of common formats that times can be expressed in.

```python
from ._utils import get_minutes

# ...
prep_time_tag = self.soup.find(class_="wprm-recipe-prep_time-minutes")
prep_time_value = get_minutes(prep_time_tag)
# or
prep_time_value = get_minutes(prep_time_tag.get_text())
# both return 25
```

## Finding multiple elements

Some information in a recipe, like the ingredients or instructions, come in the form of lists where
we need to find multiple elements with the same attributes. We can use `self.soup.find_all()` for
this. `find_all` uses the same arguments as `find`, it just returns a list of `bs4.element.Tag`
objects with all the matching elements.

Using the same site as above, we can find all the ingredients like so

```python
ingredient_tags = self.soup.find_all("li", class_="wprm-recipe-ingredient")
ingredients_text = [normalize_string(tag.get_text()) for tag in ingredient_tags]
"""
[
 '2 (20-ounce // 565g) cans of jackfruit (in water or brine)*',
 '1 tablespoon coconut oil (or oil of choice)',
 '1 1/2 teaspoons cumin seeds',
 '1 1/2 teaspoons black mustard seeds (can substitute brown mustard seeds)',
 '1 large yellow onion, diced',
 ...
]
"""
```

The Beautiful Soup documentation for `find_all` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all).

## Using CSS selectors

If you are already familiar with CSS selectors, then you can use `select()` to achieve the same
result as `find_all()`, or `select_one()` to achieve the same result as `find`.

```python
# Match all li elements with wprm-recipe-ingredient class
ingredient_tag = self.soup.select("li.wprm-recipe-ingredient")
```

The Beautiful Soup documentation for `select` is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors-through-the-css-property). MDN has a guide on
CSS selectors [here](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors).

## Finding elements using a partial attribute

Sometimes you might want to find elements using a part of an attribute. This is particularly
helpful for websites that automatically generate CSS in a way that appends a random string to
the end of class names.

An example of this is [cooking.nytimes.com](https://cooking.nytimes.com/recipes/1024605-cumin-and-cashew-yogurt-rice). If we wanted to select the yield element from
this page, we could use the class `ingredients_recipeYield__DN65p`. However when the website is
updated in the future, the `DN65p` at the end of the class name is likely to change, so we only
want to use part of the class name.

There are two ways we can do this:

### Using `find`

Instead of using a string in the arguments we pass to `find`, we can use a regular expression
instead.

```python
yield_tag = self.soup.find(class_=re.compile("ingredients_recipeYield"))
yield_text = yield_tag.get_text()
# Yield:4 servings
```

### Using `select`

CSS also supports partial attribute matching. MDN has a useful guide [here](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors).

```python
# Select elements where class contains 'ingredients_recipeYield'
yield_tag = self.soup.select("[class*='ingredients_recipeYield']")

# Select tags where class starts with 'ingredients_recipeYield'
yield_tag = self.soup.select("[class^='ingredients_recipeYield']")
```