File: how-to-develop-a-scraper.md

package info (click to toggle)
python-recipe-scrapers 15.9.0-1
  • links: PTS
  • area: main
  • in suites: forky, sid
  • size: 246,580 kB
  • sloc: python: 13,214; makefile: 3
file content (171 lines) | stat: -rw-r--r-- 4,241 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# How To Develop a New Scraper

!!! warning "Under Construction"
    This section is being updated. Some information may be outdated or inaccurate.


## Find a website

First, check if the website is already supported:

- Check the [Supported Sites](../getting-started/supported-sites.md)
- Or verify programmatically:

```python
from recipe_scrapers import SCRAPERS
# Check if site is supported
print(SCRAPERS.get("bbcgoodfood.com"))
```

!!! note "Track Your Progress"
    Create an [issue](https://github.com/hhursev/recipe-scrapers/issues/new/choose) to track
    your work.

## Setup Repository

Fork the [recipe-scrapers repository](https://github.com/hhursev/recipe-scrapers) on GitHub and
follow these steps:

!!! tip "Quick Setup"
    ```sh
    # Clone your fork
    git clone https://github.com/YOUR-USERNAME/recipe-scrapers.git
    cd recipe-scrapers

    # Set up Python environment
    python -m venv .venv
    source .venv/bin/activate  # On Windows use: .venv\Scripts\activate
    python -m pip install --upgrade pip
    pip install -e ".[all]"
    ```

Create a new branch:

```sh
git checkout -b site/website-name
```

!!! tip "Run Tests"
    ```sh
    python -m unittest

    # Optional: Parallel testing
    pip install unittest-parallel
    unittest-parallel --level test
    ```

## Generate Scraper Files

### 1. Select Recipe URL

!!! tip "Recipe Selection"
    Choose a recipe with multiple instructions when possible. Single-instruction recipes may
    indicate parsing errors, unless [explicitly handled](https://github.com/hhursev/recipe-scrapers/blob/98ead6fc6e9653805b01539a3f46fbfb4e096136/tests/test_allrecipes.py#L147-L150).

### 2. Check Schema Support

Test if the site uses [Recipe Schema](https://schema.org/Recipe):

```python
from urllib.request import urlopen
from recipe_scrapers import scrape_html

url = "https://example.com/your-recipe"
html = urlopen(url).read().decode("utf-8")

scraper = scrape_html(html, url, wild_mode=True)
print(scraper.schema.data)  # Empty dict if schema not supported
```

### 3. Generate Files

```sh
python generate.py <ClassName> <URL>
```

`<URL>` should be the recipe page you selected in the first step. The script
downloads this recipe and uses it to create the initial test data.

This creates:

- Scraper file in `recipe_scrapers/`
- Test files in `tests/test_data/<host>/`

## Implementation

=== "With Recipe Schema"
    ```python
    from recipe_scrapers import scrape_html

    scraper = scrape_html(html, url)
    print(scraper.title())
    print(scraper.ingredients())
    ```

=== "Without Recipe Schema"
    ```python
    def title(self):
        return self.soup.find('h1').get_text()
    ```

!!! info "Resources"
    - [Scraper Functions Guide](in-depth-guide-scraper-functions.md)
    - [HTML Scraping Guide](in-depth-guide-html-scraping.md)
    - [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Testing

### 1. Update Test Data

Edit `tests/test_data/<host>/test.json`:
```json
{
    "host": "<host>",
    "canonical_url": "...",
    "site_name": "...",
    "author": "...",
    "language": "...",
    "title": "...",
    "ingredients": "...",
    "instructions_list": "...",
    "total_time": "...",
    "yields": "...",
    "image": "...",
    "description": "..."
}
```
### Test Data Population Help

The HTML file generated by `generate.py` can be used to help you fill in the required fields within the test JSON file

```python
from pathlib import Path
from recipe_scrapers import scrape_html
import json

html = Path("tests/test_data/<host>/<TestFileName>.testhtml").read_text(encoding="utf-8")
scraper = scrape_html(html, "<URL>")
print(json.dumps(scraper.to_json(), indent=2, ensure_ascii=False))
```

This will print the output returned by the scraper to your terminal for reference

### 2. Run Tests

```sh
python -m unittest -k <ClassName.lower()>
```

!!! warning "Edge Cases"
    Test with multiple recipes to catch potential edge cases.

## Submit Changes

1. Commit your work:
```sh
git add -p  # Review changes
git commit -m "Add scraper for example.com"
git push origin site/website-name
```

2. Create a pull request at [recipe-scrapers](https://github.com/hhursev/recipe-scrapers/pulls)