1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
|
"""
scrapy.linkextractors
This package contains a collection of Link Extractors.
For more info see docs/topics/link-extractors.rst
"""
from __future__ import annotations
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from collections.abc import Iterable
from re import Pattern
# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
# archives
"7z",
"7zip",
"bz2",
"rar",
"tar",
"tar.gz",
"xz",
"zip",
# images
"mng",
"pct",
"bmp",
"gif",
"jpg",
"jpeg",
"png",
"pst",
"psp",
"tif",
"tiff",
"ai",
"drw",
"dxf",
"eps",
"ps",
"svg",
"cdr",
"ico",
"webp",
# audio
"mp3",
"wma",
"ogg",
"wav",
"ra",
"aac",
"mid",
"au",
"aiff",
# video
"3gp",
"asf",
"asx",
"avi",
"mov",
"mp4",
"mpg",
"qt",
"rm",
"swf",
"wmv",
"m4a",
"m4v",
"flv",
"webm",
# office suites
"xls",
"xlsm",
"xlsx",
"xltm",
"xltx",
"potm",
"potx",
"ppt",
"pptm",
"pptx",
"pps",
"doc",
"docb",
"docm",
"docx",
"dotm",
"dotx",
"odt",
"ods",
"odg",
"odp",
# other
"css",
"pdf",
"exe",
"bin",
"rss",
"dmg",
"iso",
"apk",
"jar",
"sh",
"rb",
"js",
"hta",
"bat",
"cpl",
"msi",
"msp",
"py",
]
def _matches(url: str, regexs: Iterable[Pattern[str]]) -> bool:
return any(r.search(url) for r in regexs)
def _is_valid_url(url: str) -> bool:
return url.split("://", 1)[0] in {"http", "https", "file", "ftp"}
# Top-level imports
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor as LinkExtractor
__all__ = [
"IGNORED_EXTENSIONS",
"LinkExtractor",
]
|