## File: abstract.plaintex

package info (click to toggle)
sgb 1:20030623-3
 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247 % EXTENDED ABSTRACT DESCRIBING THE STANFORD GRAPHBASE \magnification\magstep1 \advance\vsize by 1.5\baselineskip \parskip3pt plus 1pt \font\sc=cmcsc10 \def\disleft#1:#2:#3\par{\par\hangindent#1\noindent \hbox to #1{#2 \hfill \hskip .1em}\ignorespaces#3\par} \def\TeX{T\hbox{\hskip-.1667em\lower.424ex\hbox{E}\hskip-.125em X}} \def\biba{\par\parindent 40pt\hangindent 60pt} \centerline{\bf The Stanford GraphBase: A Platform for Combinatorial Computing} \smallskip \centerline{\sl Donald E. Knuth, Stanford University} \noindent A highly portable collection of programs and data is now available to researchers who study combinatorial algorithms and data structures. All files are in the public domain and usable with only one restriction: They must not be changed! A~change file'' mechanism allows local customization while the master files stay intact. The programs are intended to be interesting in themselves as examples of literate programming.'' Thus, the Stanford GraphBase can also be regarded as a collection of approximately 30 essays for programmers to enjoy reading, whether or not they are doing algorithmic research. The programs are written in {\tt CWEB}, a~combination of \TeX\ and~C that is easy to use by anyone who knows those languages and easy to read by anyone familiar with the rudiments of~C. (The {\tt CWEB} system is itself portable and in the public domain.) Four program modules constitute the {\it kernel\/} of the GraphBase: { \biba {\sc gb\_$\,$flip} is a portable random number generator; \biba {\sc gb\_$\,$graph} defines standard data structures for graphs and includes routines for storage allocation; \biba {\sc gb\_$\,$io} reads data files and makes sure they are uncorrupted; \biba {\sc gb\_$\,$sort} is a portable sorting routine for 32-bit keys in linked lists of nodes. } \noindent All of the other programs rely on {\sc gb\_$\,$graph} and some subset of the other three parts of the kernel. A dozen or so {\it generator modules\/} construct graphs that are of special interest in algorithmic studies. For example, {\sc gb\_$\,$basic} contains 12~subroutines to produce standard graphs, such as the graphs of queen moves on $d$-dimensional rectangular boards with wrap-around'' on selected coordinates. Another generator module, {\sc gb\_$\,$rand}, produces several varieties of random graphs. Each graph has a unique identifier that allows researchers all over the world to work with exactly the same graphs, even when those graphs are random.'' Repeatable experiments and standard benchmarks will therefore be possible and widely available. Most of the generator modules make use of {\it data sets}, which the author has been collecting for 20~years in an attempt to provide interesting and instructive examples for some forthcoming books on combinatorial algorithms ({\sl The Art of Computer Programming}, Volumes 4A, 4B, and~4C). For example, one of the data sets is {\tt words.dat}, a~collection of 5-letter words of English that the author believes is complete'' from his own reading experience. Each word is accompanied by frequency counts in various standard corpuses of text, so that the most common terms can be singled out if desired. {\sc gb\_$\,$words} makes a subset of words into a graph by saying that two words are adjacent when they agree in~4 out of~5 positions. Thus, we can get from {\tt words} to {\tt graph} in seven steps: \disleft 30pt:: {\tt words, wolds, golds, goads, grads, grade, grape, graph.} \noindent This is in fact the shortest such chain obtainable from {\tt words.dat}. A dozen or so {\it demonstration modules\/} are also provided, as illustrations of how the generated graphs can be used. For example, the {\sc ladders} module is an interactive program to construct chains of 5-letter words like the one just exhibited, using arbitrary subsets of the data. If we insist on restricting our choices to the 2000 most common words, instead of using the entire collection of about 5700, the shortest path from {\tt words} to {\tt graph} turns out to have length~20: \disleft 30pt:: {\tt words, lords, loads, leads, leaps, leapt, least,} \vskip-5pt \disleft 30pt:: {\tt lease, cease, chase, chose, chore, shore, shone,} \vskip-5pt \disleft 30pt:: {\tt phone, prone, prove, grove, grave, grape, graph.} Several variations on this theme have also been implemented. If we consider the distance between adjacent words to be alphabetic distance, for example, the shortest path from {\tt words} to {\tt graph} turns out to be \disleft 30pt:: {\tt words} (3) {\tt woods} (16) {\tt goods} (14) {\tt goads} (3) {\tt grads} (14) {\tt grade} (12) {\tt grape} (3) {\tt graph}, \noindent total length 65. The {\tt LADDERS} module makes use of another GraphBase module called {\sc gb\_$\,$dijk}, which carries out Dijkstra's algorithm for shortest paths and allows the user to plug in arbitrary implementations of priority queues so that the performance of different queuing methods can be compared. The graphs produced by {\sc gb\_$\,$words} are undirected. Other generator modules, like {\sc gb\_$\,$roget}, produce directed graphs. Roget's famous {\sl Thesaurus\/} of 1882 classified all concepts into 1022 categories, which we can call the vertices of a graph; an arc goes from~$u$ to~$v$ when category~$u$ contains a cross reference to category~$v$ in Roget's book. A~demonstration module called {\sc roget\_$\,$components} determines the strong components of graphs generated by {\sc gb\_$\,$roget}. This program is an exposition of Tarjan's algorithm for strong components and topological sorting of directed graphs. Similarly, world literature leads to further interesting families of undirected graphs via the {\sc gb\_$\,$books} module. Five data sets {\tt anna.dat}, {\tt david.dat}, {\tt homer.dat}, {\tt huck.dat}, and {\tt jean.dat} give information about {\sl Anna Karenina}, {\sl David Copperfield}, {\sl The Iliad}, {\sl Huckleberry Finn}, and {\sl Les Mis\'erables}. As you might expect, the characters of each work become the vertices of a graph. Two vertices are adjacent if the corresponding characters encounter each other, in selected chapters of the book. A~demonstration program called {\sc book\_$\,$components} finds the blocks (i.e., biconnected components) of these graphs using the elegant algorithm of Hopcroft and Tarjan. Another module, {\sc gb\_$\,$games}, generates graphs based on college football scores. All the games from the 1990 season between America's leading 120 teams are recorded in {\tt games.dat}; this data leads to cliquey'' graphs, because most of the teams belong to leagues and they play every other team in their league. The overall graph is, however, connected. A~demonstration module called {\sc football} finds long chains of scores, to prove for instance that Stanford might have trounced Harvard by more than 2000 points if the two teams had met---because Stanford beat Notre Dame by~5, and Notre Dame beat Air Force by~30, and Air Force beat Hawaii by~24, and \dots~, and Yale beat Harvard by~15. (Conversely, a~similar proof'' also ranks Harvard over Stanford by more than 2000 points.) No good algorithm is known for finding the optimum solution to problems like this, so the data provides an opportunity for researchers to exhibit better and better solutions with better and better techniques as algorithmic progress is made. The {\sc gb\_$\,$econ} module generates directed graphs based on the flow of money between industries in the US economy. A~variety of graphs can be obtained, as the economy can be divided into any number of sectors from~2 to~79 in this model. A~demonstration program {\sc econ\_$\,$order} attempts to rank the sectors in order from suppliers'' to consumers,'' namely to permute rows and columns of a matrix so as to minimize the sum of entries above the diagonal. A reasonably efficient algorithm for this problem is known, but it is very complicated. Two heuristics are implemented for comparison, one greedy'' and the other cautious.'' Greed appears to be victorious, at least in the economic sphere. The highway mileage between 128 North American cities appears in {\tt miles.dat}, and the {\sc gb\_$\,$miles} module generates a variety of graphs from~it. Of special interest is a demonstration module called {\sc miles\_$\,$span}, which computes the minimum spanning trees of graphs output by {\sc gb\_$\,$miles}. Four algorithms for minimum spanning trees are implemented and compared, including some that are theoretically appealing but do not seem to fare so well in practice. An approach to comparison of algorithms called mem counting'' is shown in this demonstration to be an easily implemented machine-independent measure of efficiency that gives a reasonably fair comparison between competing techniques. A generator module called {\sc gb\_$\,$raman} produces Ramanujan graphs,'' which are important because of their role as expander graphs, useful for communication. A~demonstration module called {\sc girth} computes the shortest circuit and the diameter of Ramanujan graphs. Notice that some graphs, like those produced by {\sc gb\_$\,$basic} or {\sc gb\_$\,$raman}, have a rigid mathematical structure; others, like those produced by {\sc gb\_$\,$roget} or {\sc gb\_$\,$miles}, are more organic'' in nature. It is interesting and important to test algorithms on both kinds of graphs, in order to see if there is any significant difference in performance. A generator module called {\sc gb\_$\,$gates} produces graphs of logic circuits. One such family of graphs is equivalent to a simple RISC chip, a~programmable microcomputer with a variable number of registers. Using such a meta-network'' of gates, algorithms for design automation can be tested for a range of varying parameters. A~demonstration module {\sc take\_$\,$risc} simulates the execution of the chip on a sample program. Another meta-network of gates will perform parallel multiplication of $m$-bit numbers by $n$-bit numbers or by an $n$-bit constant; the {\sc multiply} module demonstrates these circuits. Planar graphs are generated by {\sc gb\_$\,$plane}, which includes among other things an implementation of the best currently known algorithm for Delaunay triangulation. Pixel data can lead to interesting bipartite graphs. Leonardo's {\sl Gioconda\/} is represented by {\tt lisa.dat}, an array of pixels that is converted into graphs of different kinds by {\sc gb\_$\,$lisa}. A~demonstration routine {\sc assign\_$\,$lisa} solves the assignment problem by choosing one pixel in each row and in each column so that the total brightness of selected pixels is maximized. Although the assignment problem being solved here has no relevance to art criticism or art appreciation, it does have great pedagogical value, because there is probably no better way to understand the characteristics of a large array of numbers than to perceive the array as an image. A~module called {\sc gb\_$\,$save} converts GraphBase graphs to and from an ASCII format that readily interfaces with other systems for graph manipulation. For further information see {\sl The Stanford GraphBase}, published by ACM Press in 1993. The book could also be called Fun and games with the Stanford GraphBase,'' because the demonstration programs are great toys to play with. Indeed, the author firmly believes that the best serious work is also good fun. We needn't apologize if we enjoy doing research.\looseness-1 \bye