1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
|
# How To Develop a New Scraper
!!! warning "Under Construction"
This section is being updated. Some information may be outdated or inaccurate.
## Find a website
First, check if the website is already supported:
- Check the [Supported Sites](../getting-started/supported-sites.md)
- Or verify programmatically:
```python
from recipe_scrapers import SCRAPERS
# Check if site is supported
print(SCRAPERS.get("bbcgoodfood.com"))
```
!!! note "Track Your Progress"
Create an [issue](https://github.com/hhursev/recipe-scrapers/issues/new/choose) to track
your work.
## Setup Repository
Fork the [recipe-scrapers repository](https://github.com/hhursev/recipe-scrapers) on GitHub and
follow these steps:
!!! tip "Quick Setup"
```sh
# Clone your fork
git clone https://github.com/YOUR-USERNAME/recipe-scrapers.git
cd recipe-scrapers
# Set up Python environment
python -m venv .venv
source .venv/bin/activate # On Windows use: .venv\Scripts\activate
python -m pip install --upgrade pip
pip install -e ".[all]"
```
Create a new branch:
```sh
git checkout -b site/website-name
```
!!! tip "Run Tests"
```sh
python -m unittest
# Optional: Parallel testing
pip install unittest-parallel
unittest-parallel --level test
```
## Generate Scraper Files
### 1. Select Recipe URL
!!! tip "Recipe Selection"
Choose a recipe with multiple instructions when possible. Single-instruction recipes may
indicate parsing errors, unless [explicitly handled](https://github.com/hhursev/recipe-scrapers/blob/98ead6fc6e9653805b01539a3f46fbfb4e096136/tests/test_allrecipes.py#L147-L150).
### 2. Check Schema Support
Test if the site uses [Recipe Schema](https://schema.org/Recipe):
```python
from urllib.request import urlopen
from recipe_scrapers import scrape_html
url = "https://example.com/your-recipe"
html = urlopen(url).read().decode("utf-8")
scraper = scrape_html(html, url, wild_mode=True)
print(scraper.schema.data) # Empty dict if schema not supported
```
### 3. Generate Files
```sh
python generate.py <ClassName> <URL>
```
`<URL>` should be the recipe page you selected in the first step. The script
downloads this recipe and uses it to create the initial test data.
This creates:
- Scraper file in `recipe_scrapers/`
- Test files in `tests/test_data/<host>/`
## Implementation
=== "With Recipe Schema"
```python
from recipe_scrapers import scrape_html
scraper = scrape_html(html, url)
print(scraper.title())
print(scraper.ingredients())
```
=== "Without Recipe Schema"
```python
def title(self):
return self.soup.find('h1').get_text()
```
!!! info "Resources"
- [Scraper Functions Guide](in-depth-guide-scraper-functions.md)
- [HTML Scraping Guide](in-depth-guide-html-scraping.md)
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
## Testing
### 1. Update Test Data
Edit `tests/test_data/<host>/test.json`:
```json
{
"host": "<host>",
"canonical_url": "...",
"site_name": "...",
"author": "...",
"language": "...",
"title": "...",
"ingredients": "...",
"instructions_list": "...",
"total_time": "...",
"yields": "...",
"image": "...",
"description": "..."
}
```
### Test Data Population Help
The HTML file generated by `generate.py` can be used to help you fill in the required fields within the test JSON file
```python
from pathlib import Path
from recipe_scrapers import scrape_html
import json
html = Path("tests/test_data/<host>/<TestFileName>.testhtml").read_text(encoding="utf-8")
scraper = scrape_html(html, "<URL>")
print(json.dumps(scraper.to_json(), indent=2, ensure_ascii=False))
```
This will print the output returned by the scraper to your terminal for reference
### 2. Run Tests
```sh
python -m unittest -k <ClassName.lower()>
```
!!! warning "Edge Cases"
Test with multiple recipes to catch potential edge cases.
## Submit Changes
1. Commit your work:
```sh
git add -p # Review changes
git commit -m "Add scraper for example.com"
git push origin site/website-name
```
2. Create a pull request at [recipe-scrapers](https://github.com/hhursev/recipe-scrapers/pulls)
|