1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
|
#!/usr/bin/python
# encoding= utf-8
# Check for broken links in html data files.
#
# Usage:
#
# links_check [DIR]
#
# If DIR is specified, look only for files in DIR, otherwise, look for all
# the files.
#
# The script make a HEAD http request for each link found, and print a
# message in case of an error.
# Copyright (C) 2016 Guillaume Chereau
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Suite 500, Boston, MA
# 02110-1335, USA.
import fnmatch
import os
import sys
import requests
from bs4 import BeautifulSoup
urls = set()
sources = {}
# Get the list of all files matching '*.utf8'
base = "./"
if len(sys.argv) == 2:
base = sys.argv[1] + "/"
files = []
for root, dirnames, filenames in os.walk(base):
for filename in fnmatch.filter(filenames, '*.utf8'):
files.append(os.path.join(root, filename))
print "Got %d files" % len(files)
# Retrieve all the http links.
for f in files:
soup = BeautifulSoup(open(f))
for link in soup.findAll('a'):
url = link.get('href')
if not url or url.startswith('mailto') or url.startswith('#'):
continue
url = unicode(url)
urls.add(url)
sources.setdefault(url, set()).add(f)
print "Got %d urls" % len(urls)
print
# Test each link one by one.
for url in urls:
try:
# We ignore the 301 return code, even though it would be nice to
# replace the links with the new url.
r = requests.head(url, allow_redirects=True)
s = r.status_code
if s == 200:
continue
except Exception:
s = u"err"
print url.encode('utf-8')
print s
print "Found in:"
for f in sources[url]:
print f
print
sys.stdout.flush()
|