1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174
|
#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
# SPDX-FileCopyrightText: © 2004 Tristan Seligmann and Jonathan Jacobs
# SPDX-FileCopyrightText: © 2012 Bastian Kleineidam
# SPDX-FileCopyrightText: © 2015 Tobias Gruetzmacher
"""
Script to get ComicFury comics and save the info in a JSON file for further
processing.
"""
import sys
from urllib.parse import urlsplit
import scriptutil
class ComicFuryUpdater(scriptutil.ComicListUpdater):
# Absolute minumum number of pages a comic may have (restrict search space)
MIN_COMICS = 90
dup_templates = ('ComicSherpa/%s', 'Creators/%s', 'GoComics/%s',
'KeenSpot/%s', 'Arcamax/%s')
langmap = {
'german': 'de',
'spanish': 'es',
'italian': 'it',
'japanese': 'ja',
'french': 'fr',
'portuguese': 'pt',
}
# names of comics to exclude
excluded_comics = (
# unsuitable navigation
"AlfdisAndGunnora",
"AnAmericanNerdInAnimatedTokyo",
"AngryAlien",
"BoozerAndStoner",
"Bonejangles",
"ConradStory",
"Crossing",
"ChristianHumberReloaded",
"CorkAndBlotto",
"Democomix",
"ErraticBeatComics",
"EnergyWielders",
"EvilBearorg",
"Fiascos",
"FateOfTheBlueStar",
"FPK",
"Fanartgyle",
"FrigginRandom",
"GoodbyeKitty",
"GoodSirICannotDraw",
"HighlyExperiMental",
"IfAndCanBeFlowers",
"JournalismStory",
"JohnsonSuperior",
"Keel",
"JudgeDredBasset",
"LomeathAndHuilii",
"MNPB",
"LucidsDream",
"MadDog",
"Minebreakers",
"MoonlightValley",
"MyImmortalFool",
"NATO",
"NothingFits",
"OptimisticFishermenAndPessimisticFishermen",
"Old2G",
"NothingFitsArtBlog",
"OutToLunchTheStingRayWhoreStory",
"Pandemonium",
"Pewfell",
"ProjectX",
"Ratantia",
"RealLifeTrips",
"Sandgate",
"Secondpuberty",
"Seconds",
"SlightlyEccentricOrigins",
"StardustTheCat",
"StrangerThanFiction",
"TalamakGreatAdventure",
"TheBattalion",
"TheBends",
"TheDailyProblem",
"TheMansionOfE",
"ThePainter",
"TheSeekers",
"TheTrialsOfKlahadOfTheAbyss",
"TheStickmen",
"ThornsInOurSide",
"TopHeavyVeryBustyPinUpsForAdults",
"USBUnlimitedSimulatedBody",
"TylerHumanRecycler",
"UAF",
"WhenPigsFly",
"YeOldeLegotimeTheatre",
# no content
"Angst",
"TheDevonLegacyPrologue",
# images gone
"BaseballCapsAndTiaras",
"BiMorphon",
"CROSSWORLDSNEXUS",
"DevilSpy",
"Fathead",
"GOODBYEREPTILIANS",
"KevinZombie",
"KindergardenCrisIs",
"NoSongsForTheDead",
"RequiemShadowbornPariah",
"SandboxDrama",
"STICKFODDER",
"TezzleAndZeek",
"TheRealmOfKaerwyn",
# broken HTML
"CrossingOver",
# unique html
"IKilledTheHero",
"PowerOfPower",
"Schizmatic",
"WakeTheSleepers",
"WeightOfEternity",
# moved
"OopsComicAdventure",
)
def handle_url(self, url):
"""Parse one search result page."""
data = self.get_url(url)
for comicdiv in self.xpath(data, '//div[d:class("webcomic-result")]'):
comiclink = self.xpath(comicdiv, './div[d:class("webcomic-result-title")]/a')[0]
comicurl = comiclink.attrib['href']
name = comiclink.text
info = self.xpath(comicdiv, './/span[d:class("stat-value")]')
# find out how many images this comic has
count = int(info[0].text.strip())
self.add_comic(name, comicurl, count)
nextlink = self.xpath(data, '//div[d:class("search-next-page")]/a')
if nextlink:
return nextlink[0].attrib['href']
else:
return None
def collect_results(self):
"""Parse all search result pages."""
# Sort by page count, so we can abort when we get under some threshold.
url = ('https://comicfury.com/search.php?query=&lastupdate=0&' +
'completed=1&fn=2&fv=2&fs=2&fl=2&sort=0')
print("Parsing search result pages...", file=sys.stderr)
while url:
url = self.handle_url(url)
def get_entry(self, name, entry):
url = entry
sub = urlsplit(url).hostname.split('.', 1)[0]
return f"cls('{name}', '{sub}'),"
if __name__ == '__main__':
ComicFuryUpdater(__file__).run()
|