1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
|
#! /usr/bin/perl
################################################################################
#
# File: mkconc.pl
# RCS: $Header: /home/matthew/cvs/bible-kjv-4.10/makeconc.pl,v 2.0 2003/01/08 15:29:52 matthew Exp $
# Description: make Bible concordance: translation of Chip Chapin's ksh script
# Author: Chris Eich, SRSD
# Created: Wed Dec 23 11:00:18 1992
# Modified: Wed Dec 23 15:49:23 1992 (Chip Chapin) chip@hpclbis
# Language: perl
# Status: Experimental (Do Not Distribute)
#
################################################################################
#
# Revisions:
#
# Wed Dec 23 15:19:45 1992 (Chip Chapin) chip@hpclbis
# Received from Chris Eich, replaces "makeconcordance" script.
# Made use of stopwords conditional.
###############################################################################
use IO::Handle
# Putting . on PATH ensures that the bible program will be found.
$ENV{'PATH'} =~ s/^:*/.:/;
$PROG = 'bible';
# Read a list of stop words, if any, one per line.
if (open(STOP, "$ARGV[0]")) {
print "Excluding stopwords ($ARGV[0]) from concordance.\n";
while (<STOP>) {
# Ignore comments, mark stop word if one is found.
$stopword{$&}++ if !/^#/ && /[a-z]+/;
}
close(STOP);
} else {
print "All words will be included in concordance (no stopwords).\n";
}
# Generate plain text file, one "record" (e.g. bible verse) per line.
# Fill %lines and $count tables, which are keyed by words.
open(BIBLE, "bible.rawtext");
<BIBLE>; #discard the header line
while (<BIBLE>) {
s/^\S+\s+//; # Cut off the record reference that starts each line.
tr/A-Z/a-z/; # Downcase.
tr/a-z/ /c; # Turn non-alpha into space.
%seenonthisline = ();
for $word (split(' ')) {
next if $stopword{$word};
$count{$word}++; # Move below next line to count per-line.
next if $seenonthisline{$word}++;
#the header line discard still leaves $. 1 higher than we want
$lines{$word} .= " " . ($. - 1);
}
}
die $! if BIBLE->error();
# Create raw concordance, listing the lines where each word occurs.
open(RAWCONC, "> $PROG.rawconcordance") || die "$PROG.rawconcordance: $!\n";
for $word (sort keys %lines) {
print RAWCONC $word, $lines{$word}, "\n";
}
close(RAWCONC);
# Also create a wordcounts file, which gives the number of lines in
# which each word occurs. Note that we ARE counting cases where the
# same word is used several times in the same record. See the comment
# above for "$count{$word}++" to change this to per-record.
open(COUNTS, "| sort -nrk 2 > $PROG.wordcounts");
while (($word, $count) = each %count) {
print COUNTS $word, "\t", $count, "\n";
}
close(COUNTS);
__END__
# Next ... create a binary form of the raw concordance.
# This is handled by "makeconcfile", a program invoked from the
# BRS makefile.
# so we're all done now.
# Interesting statistic: 89198 chars in all the words in the Bible,
# 617371 word-verse occurrances
# from...
# awk '{chars += length($1); counts += $2}
# END {print "chars=" chars " counts=" counts}' bible.wordcounts
# end
###############################################################################
# Gnu Emacs variables...
#
# Local Variables:
# mode: perl
# eval: (auto-fill-mode 0)
# default-header-comment-character: ?#
# header-prefix: "#! /usr/bin/perl"
# header-suffix: "#"
# header-comment-character: ?#
# end:
|