File: in-depth-guide-scraper-functions.md

package info (click to toggle)
python-recipe-scrapers 15.9.0-1
  • links: PTS
  • area: main
  • in suites: forky, sid
  • size: 246,580 kB
  • sloc: python: 13,214; makefile: 3
file content (385 lines) | stat: -rw-r--r-- 11,523 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
# In Depth Guide: Scraper Functions

!!! warning "Under Construction"
    This section is being updated. Some information may be outdated or inaccurate.

Each website scraper has a number of functions that return information about the recipe that has
been scraped. Due to the variability in how recipes are written, not all of them are always
applicable. These functions fall into three categories:

1. Mandatory functions

   These functions can be expected to be available for all Scraper classes and combined provide the
    majority of the information for a recipe.

2. Inherited functions

   These functions are always available for all Scraper classes. They are implemented in the
   `AbstractScraper` base class and rarely require overriding in the Scraper class.

3. Optional functions

   These functions provide extra information about a recipe, from the particular websites that
   support them.

All of the examples below come from <https://www.bbcgoodfood.com/recipes/monster-cupcakes>.

```py
>>> from recipe_scrapers import scrape_html
>>> scraper = scrape_html(html=None, org_url="https://www.bbcgoodfood.com/recipes/monster-cupcakes", online=True)
```

## Mandatory functions

### `author() -> str`

Returns the author of the recipe. This is typically a person's name but can sometimes be an
organization or the name of the website the recipe came from. If the recipe does not explicitly
define an author, this should return the name of the website.

```py
>>> scraper.author()
'Good Food team'
```

### `host() -> str`

Returns the host of the website the Scraper class is for. This is a constant `str` set in each
Scraper class.

```py
>>> scraper.host()
'bbcgoodfood.com'
```

### `description() -> str`

Returns a description of the recipe. This is normally a sentence or short paragraph describing the
recipe. Often the website defines the description, but sometimes it has to be inferred from the
page content.

```py
>>> scraper.description()
'Let your little monsters do their worst, decorating these spooky Halloween treats'
```

### `image() -> str`

Returns the URL to the main image associated with the recipe, usually a photograph of the completed
recipe.

```py
>>> scraper.image()
'https://images.immediate.co.uk/production/volatile/sites/30/2020/08/recipe-image-legacy-id-405483_12-cee017a.jpg?resize=768,574'
```

### `ingredients() -> List[str]`

Returns the ingredients needed to make the recipe as a `list` of `str`. Each element of the list is
usually a single sentence stating an ingredient, the required amount and any additional comments.
The elements of the list should mirror the ingredients written on the website and should not include
non-ingredient sentences such as sub-headings.

```py
>>> scraper.ingredients()
[
    '250g self-raising flour',
    '25g cocoa powder',
     '175g light muscovado sugar',
     '85g unsalted butter , melted',
     '5 tbsp vegetable or sunflower oil',
     '150g pot fat-free natural yogurt',
     '1 tsp vanilla extract',
     '3 large eggs',
     '85g unsalted butter , softened',
     '1 tbsp milk',
     '½ tsp vanilla extract',
     '200g icing sugar , sifted',
     'food colourings (optional)',
     'sweets and sprinkles, to decorate'
]
```

### `instructions() -> str`

Returns a single `str` containing all instruction steps. Where a recipe has multiple instructions,
each step is separated in the returned `str` by a newline character (`\n`).

```py
>>> scraper.instructions()
'Heat oven to 190C/170C fan/gas 5 and line a 12-hole muffin tin with deep cake cases. Put all the cake ingredients into a large bowl and beat together with electric hand beaters until smooth. Spoon the mix into the cases, then bake for 20 mins until risen and a skewer inserted into the middle comes out dry. Cool completely on a rack. Can be made up to 3 days ahead and kept in an airtight container, or frozen for up to 1 month.\nFor the frosting, work the butter, milk and vanilla into the icing sugar until creamy and pale. Colour with food colouring, if using, then create your own gruesome monster faces using sweets and sprinkles.'
```

> [!IMPORTANT]
>
> Occasionally, a recipe will have steps that have new lines within them. At the moment, this
> librarycannot distinguish between a newline within a step and a newline between steps.

### `title() -> str`

Returns the title of the recipes, usually a short sentence or phrase.

```py
>>> scraper.title()
'Monster cupcakes'
```

### `total_time() -> int`

Returns the total time required to complete the recipe, in minutes.

```py
>>> scraper.total_time()
50
```

### `yields() -> str`

Returns the number of items or servings the recipe will make. This `str` includes the quantity and
unit of the yield, for example: 4 servings, 6 items, 12 cookies.

```py
>>> scraper.yields()
'12 items'
```

## Inherited functions

### `canonical_url() -> str`

Returns the canonical URL for the scraped recipe. The canonical URL is the direct URL (defined by
the website) at which the recipe can be found. This URL will generally not contain any query
parameters or fragments, except those required to load the recipe.

```py
>>> scraper.canonical_url()
'https://www.bbcgoodfood.com/recipes/monster-cupcakes'
```

### `ingredient_groups() -> List[IngredientGroup]`

Returns a `list` of groups of ingredients. Some recipes on some websites will present the
ingredients in groups, where each group contains the ingredients required for a particular aspect
of the recipe.

Each element of the returned `list` is an `IngredientGroup` object. An `IngredientGroup` object has
a `purpose` (for example, *for the sauce*) and a `list` of ingredients.

> [!IMPORTANT]
>
> All scrapers inherit this function. By default, it returns a single group with purpose of `None`
> and the ingredients set to the output of `ingredients()`. This function should be overridden in
> scrapers for website that use ingredient groups. See [this guide](in-depth-guide-ingredient-groups.md) for help on implementing
> this.

```py
>>> scraper.ingredient_groups()
[
    IngredientGroup(
        ingredients=[
            '250g self-raising flour',
            '25g cocoa powder',
            '175g light muscovado sugar',
            '85g unsalted butter , melted',
            '5 tbsp vegetable or sunflower oil',
            '150g pot fat-free natural yogurt',
            '1 tsp vanilla extract', '3 large eggs'
        ],
        purpose=None),
    IngredientGroup(
        ingredients=[
            '85g unsalted butter , softened',
            '1 tbsp milk',
            '½ tsp vanilla extract',
            '200g icing sugar , sifted',
            'food colourings (optional)',
            'sweets and sprinkles, to decorate'
        ],
        purpose='For the frosting and decorating')
]
```

### `instruction_list()`

Return a `list` of instructions. This `list` is generated by splitting the output of
`instructions()` on newline characters.

```py
>>> scraper.instructions_list()
[
    'Heat oven to 190C/170C fan/gas 5 and line a 12-hole muffin tin with deep cake cases. Put all the cake ingredients into a large bowl and beat together with electric hand beaters until smooth. Spoon the mix into the cases, then bake for 20 mins until risen and a skewer inserted into the middle comes out dry. Cool completely on a rack. Can be made up to 3 days ahead and kept in an airtight container, or frozen for up to 1 month.',
     'For the frosting, work the butter, milk and vanilla into the icing sugar until creamy and pale. Colour with food colouring, if using, then create your own gruesome monster faces using sweets and sprinkles.'
]
```

### `language() -> str`

Returns the language of recipe page, as defined within the page's HTML.
This is typically a two-letter BCP 47 language code, such as 'en' for English or 'de' for German,
but may also include the dialect or variation, such as 'en-US' for American English.

The language code is based on BCP 47 standards.
For a comprehensive list of BCP 47 language codes, refer to this GitHub Gist:
<https://gist.github.com/typpo/b2b828a35e683b9bf8db91b5404f1bd1>

```py
>>> scraper.language()
'en'
```

### `links() -> List[Dict[str, str]]`

Returns a `list` of all links found in the page HTML defined in an `<a>` element. For each link,
the attributes of the HTML element are returned as a `dict`.

```py
>>> scraper.links()
[
    {
        'class': ['popup-toggle-button'],
        'aria-label': 'Toggle main-navigation popup',
          'aria-haspopup': 'true',
          'href': '#main-navigation-popup'
    },
    ... # etc.
]
```

### `site_name() -> str | None`

Returns the website name, as defined in the page's HTML. If the page does not define this, this
function returns `None`

```py
>>> scraper.site_name()
None
```

### `to_json() -> List[str, str]`

Returns the output of all functions implemented by this scraper as a `dict`.

```py
>>> scraper.to_json()
{
    'author': 'Good Food team',
    'canonical_url': 'https://www.bbcgoodfood.com/recipes/monster-cupcakes',
    'category': 'Treat',
    ... # etc.
}
```

## Optional functions

### `category() -> str`

Semi-structured field that can contain a mix of cuisine type (for example, country names),
mealtime (breakfast/dinner/etc) and dietary properties (gluten-free, vegetarian). The value is
defined by the website, so it may overlap with other scraper functions (e.g. `cuisine()`).

```py
>>> scraper.category()
'Treat'
```

### `cook_time() -> int`

Returns the time to cook the recipe in minutes, excluding any time to prepare ingredients.

```py
>>> scraper.cook_time()
20
```

### `cuisine() -> str`

Returns the cuisine of the recipe.

```py
>>> scraper.cuisine()
# Not implemented!
```

### `nutrients() -> Dict[str, str]`

Returns a `dict` of nutrition information. Each nutrition item is usually given per unit of yield,
e.g. per servings, per item. The keys of the `dict` are the nutrients (including calories) and the
values are the amount of that nutrient, including the unit.

```py
>>> scraper.nutrients()
{
    'calories': '389 calories',
     'fatContent': '19 grams fat',
     'saturatedFatContent': '9 grams saturated fat',
     'carbohydrateContent': '53 grams carbohydrates',
     'sugarContent': '36 grams sugar',
     'fiberContent': '1 grams fiber',
     'proteinContent': '5 grams protein',
     'sodiumContent': '0.3 milligram of sodium'
}
```

### `prep_time() -> int`

Returns the time to prepare the ingredients for the recipe in minutes.

```py
>>> scraper.prep_time()
30
```

### `ratings() -> float`

Returns the recipe rating. When available, this is usually the average of all the ratings given to
the recipe on the website.

```py
scraper.ratings()
# Not implemented!
```

### `ratings_count() -> float`

Returns the total number of ratings contributed to the recipes rating.

```py
scraper.ratings_count()
# Not implemented!
```

### `equipment() -> List[str] | None`

Returns a list of cooking equipment needed for provided recipe.

```py
>>> scraper.equipment()
['Mixing Bowl', 'Whisk', 'Baking Tray']
```

### `cooking_method() -> str`

Returns the method of cooking the recipe.

```py
>>> scraper.cooking_method()
'Stovetop'
```

### `keywords() -> list`

Returns a `list` of keywords associated with a recipe.

```py
>>> scraper.keywords()
['easy', 'quick', 'dinner']
```

### `dietary_restrictions"() -> List[str] | None`

Returns the dietary restrictions specified by the recipe.

```py
>>> scraper.dietary_restrictions()
['Vegan Diet', 'Vegetarian Diet']
```