1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245
|
(documentation from: <url:http://www.geocities.com/mparker762/clawk.html>)
CLAWK Package
This package implements nearly all the features provided by the Unix AWK
language, albeit with a pretty lisp-y flavor. In addition, it provides a large
set of macros for doing more complicated processing beyond that provided by
AWK.
To give you an idea of the flavor of CLAWK programming, here's an example from
Ch 1 of The Awk Programming Language that calculates the number of employees,
total pay for all employees, and average pay.
{ pay = pay + $2 * $3 }
END { print NR, "employees"
print "total pay is", pay
print "average pay is", pay/NR
}
The book gives the text file "emp.data" as example input.
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Suzie 4.25 18
When the awk program (called say, SAMPLE.AWK) is executed with the command
$ awk -f sample.awk emp.data
it produces the following output:
6 employees
total pay is 337.5
average pay is 56.25
Here's a straightforward translation of SAMPLE.AWK using the CLAWK package:
(defawk sample (&aux (pay 0))
(t (incf pay ($* $2 $3)))
(END ($print *NR* "employees")
($print "total pay is" pay)
($print "average pay is " (/ pay *NR*))) )
When this function is executed with (sample "emp.data"), it produces the same
output:
6 employees
total pay is 337.5
average pay is 56.25
NIL
(the NIL is the return value from the SAMPLE function).
That's nearly as concise as the original AWK. Here's a version that doesn't use
the DEFAWK macro, and tries to be a bit more Lispish.
(defun sample (filename &aux (pay 0))
(do-file-fields (filename (name payrate hrsworked))
(declare (ignore name))
(incf pay (* (num payrate) (num hrsworked))))
(format t "~%~D employees" *NR*)
(format t "~%total pay is ~F" pay)
(format t "~%average pay is ~F" (/ pay *NR*)) )
... which prints out the same thing as above.
Here's another example from The AWK Programming Language that shows off the
regular expressions a little better. This program performs some basic validity
checks on a Unix password file.
BEGIN {
FS = ";" }
NF != 7 {
printf("line %d, does not have 7 fields: %s\n", NR, $0)}
$1 ~ /[^A-Za-z0-9]/ {
printf("line %d, nonalphanumeric user id: %s\n", NR, $0)}
$2 == "" {
printf("line %d, no password: %s\n", NR, $0)}
$3 ~ /[^0-9]/ {
printf("line %d, nonnumeric user id: %s\n", NR, $0)}
$4 ~ /[^0-9]/ {
printf("line %d, nonnumeric group id: %s\n", NR, $0)}
$6 !~ /^\// {
printf("line %d, invalid login directory: %s\n", NR, $0)}
And this is the same program using CLAWK:
(defawk checkpass ()
(BEGIN
(setq *FS* ";"))
((/= *NF* 7)
(format t "~%line ~D, does not have 7 fields: ~S" *NR* $0))
((~ $1 #/[^A-Za-z0-9]/)
(format t "~%line ~D, nonalphanumeric user id: ~S" *NR* $0))
(($== $2 "")
(format t "~%line ~D, no password: ~S" *NR* $0))
((~ $3 #/[^0-9]/)
(format t "~%line ~D, nonnumeric user id: ~S" *NR* $0))
((~ $4 #/[^0-9]/)
(format t "~%line ~D, nonnumeric group id: ~S" *NR* $0))
((!~ $6 #/^\//)
(format t "~%line ~D, invalid login directory: ~S" *NR* $0)) )
Which you must admit is awfully close to the original AWK. Since this uses the
#/ readmacro you must evaluate (clawk:install-regex-syntax) before evaluating
this sample.
CLAWK summary
The INSTALL-REGEX-SYNTAX function sets up a readmacro on #\/ that parses to an
unescaped #\/, and interns the resulting string in the current package. This
(a) gives a nice AWK-ish syntax (b) results in an object that can be printed
readably and interacts well in the listener even without Dynamic Windows, and
(c) gives CLAWK a convenient place to hook the compiled matcher and some
associated information, speeding things up nicely. You can still specify these
things using the |...| syntax, but it's not quite as nice, since #\| is a
common character in regex patterns.
The INSTALL-CMD-SYNTAX function sets up a readmacro on #` that parses to an
unescaped ` character, and expands into a call to
CLAWK::CALL-SYSTEM-CATCHING-OUTPUT. This function sends the command to the
shell and returns an input stream containing the output of the command. This
stream can then be used by CLAWK:FOR-STREAM-LINES or any of the CL stream
functions. The readmacro is defined for all non-Symbolics systems, although
currently only LispWorks is supported by CALL-SYSTEM-CATCHING-OUTPUT.
The various special variables *FS*, *NR*, *NF*, *RSTART*, *RLENGTH*, etc are
fairly obvious if you've ever used AWK, as are the SUB, GSUB, SPLIT, INDEX,
MATCH, SUBSTR, and ~ and !~ functions. $SUB, $GSUB, $SPLIT, $INDEX, $MATCH, and
$SUBSTR do the same, but they attempt to coerce the various parameters to the
right type (so you can pass in a string for a parameter that expects a number).
WITH-PATTERNS is a macro that cleans up the syntax for using
dynamically-generated regex patterns. A future extension is for this to use the
fast regex compiler instead of the closure-based regex compiler when the
compiler is available and the pattern is a string literal.
WITH-FIELDS is a macro for doing line-splitting into variables. WITH-FIELDS
takes a optional field list (in destructuring-bind syntax) and an optional
string (default is the current line -- see DO-LINES) and field-separator
(default is *FS*). If you don't give a set of variables to use, it will use the
$ variables. It splits the line using SPLIT, and executes the body forms in an
environment containing the variables. The $n variables are locally bound
special, so they will revert on exit.
WITH-SUBMATCHES is a macro for processing register submatches. It takes a list
of variables and a set of body forms, and binds the variables to the strings
corresponding to the register submatches (or nil if the register didn't match),
and evaluates the body forms in this environment. This is particularly handy
for use in the consequent clauses of match-case.
MATCH-CASE is similar to CASE except that the value must be a string, and the
cases are strings or regex symbols. Handy for simulating the implicit outer
match structure of an AWK program, without being limited to just the toplevel.
I tend to find that my serious AWK programs pretty quickly devolve into a BEGIN
clause that calls a bunch of functions to do the real work, and this macro
(along with the next few) really make this cleaner; allowing me to reuse that
nifty auto-looping / splitting / matching mechanism wherever I need it.
MATCH-WHEN takes a set of forms similar to the toplevel CLAWK forms (with
BEGIN, END, and other pattern clauses), but unlike the toplevel it does no
looping. It executes all the BEGIN clauses first, then conditionally executes
each of the pattern clauses, then executes all the END clauses. It winds up
looking a lot like COND, except that it recognizes several special flavors of
test expression, and is basically equivalent to
(progn (when test1 . consequent1)
(when test2 . consequent2)
...)
As with real AWK, the default action is to print the current line.
DO-FILE-LINES is a macro that takes a filename and a set of body forms. It
opens the file in read mode, and loops over the lines in the file binding the
line variable to the current line and executing the body forms. It doesn't do
any line splitting, though it maintain the *NR*, *FNR*, and $0 variables.
Executing (NEXT) within the body will restart the body at the next input line.
DO-STREAM-LINES is a closely related version that takes a stream (or t for the
standard input stream).
DO-FILE-FIELDS is a macro that takes a filename, an optional field-list, and a
set of body forms. It opens the file in read mode, and loops over the lines in
the file splitting them into the specified field variables (or the $n variables
if no field list was given), and executes the body forms in this new
environment. Because of the splitting overhead, it's not as snappy as
FOR-FILE-LINES. Executing (NEXT) within the body will restart the processing at
the next input line.
DO-STREAM-FIELDS is a closely related version that takes a stream (or t for the
standard input stream).
WHEN-STREAM-FIELDS and WHEN-FILE-FIELDS implement most of the CLAWK toplevel as
a reusable special form. It takes an input stream (or file) and an optional
field list, and a set of MATCH-WHEN clauses. The BEGIN clauses will be executed
before the file is opened, the pattern clauses will be executed as with
MATCH-WHEN, and the END clauses will executed after the file is closed, and the
body. One limitation is it doesn't currently support range patterns (it's
pretty simple to add, I just haven't gotten around to it).
DEFAWK is the highest-level macro, and one that closely mimics the awk
toplevel. It takes a name, a parameter list consisting of &key and &aux
parameters, and a set of MATCH-WHEN clauses. It binds the parameter list to the
variable ARGS, and evaluates the BEGIN clauses. The BEGIN clauses are free to
modify ARGS. Then it loops through the (possibly modified) ARGS, executing the
pattern clauses for each stream and pathname in the list. It handles docstrings
correctly.
The handling of the parameter list is perhaps the oddest thing about the DEFAWK
macro. It is intended to closely match the behavior of a command-line AWK. A
pleasant difference is that keyword parameters are automatically parsed out,
instead of requiring you to do it manually like in real AWK.
The $+ , $-, $*, $/, $REM, $EXPT, $++, #=, $<, $>, $<=, $>=, $/=, $MIN, $MAX,
$ZEROP, $LENGTH, $PRINT functions can take numbers and strings and do the AWK
thing, coercing the arguments back and forth as necessary. $++ is the
concatenation operator (in AWK this is indicated by space, which clearly
doesn't work in the Lisp world). $LENGTH is overloaded on associative arrays
(i.e. hashtables) to return the HASH-TABLE-COUNT.
The generic functions INT, STR, and NUM handle the common coercions. AWK does
provide an INT function, but string coercions are done by concatenating the
value with the empty string, and numeric coercions are done by adding 0 to the
value. While these kludges also work in CLAWK, the NUM and STR functions are to
be preferred.
The $ARRAY, $AREF, $FOR $IN and $DELETE functions (and macros) implement
AWK-like associative arrays (including the amusing SUBSEP behavior for multiple
indices), except that unlike AWK there's nothing to keep you from nesting these
"arrays".
$0, $1, ..., $20 are symbol-macros that access the fields of the current line.
$#0, $#1, ..., $#20 do also, but they attempt to interpret the value as a
number.
%0, %1, ..., %20 are symbol-macros that index into the recent submatches. %#0,
%#1, ..., %#20 do also, but they attempt to interpret the submatch value as a
number.
This system also defines a clawk-user package.
|