File: README

package info (click to toggle)
cl-awk 1-4
links: PTS
area: main
in suites: stretch
size: 136 kB
sloc: lisp: 1,012; makefile: 52; sh: 28
file content (245 lines) | stat: -rw-r--r-- 11,021 bytes
parent folder | download | duplicates (2)
(documentation from: <url:http://www.geocities.com/mparker762/clawk.html>)

CLAWK Package

This package implements nearly all the features provided by the Unix AWK
language, albeit with a pretty lisp-y flavor. In addition, it provides a large
set of macros for doing more complicated processing beyond that provided by
AWK.

To give you an idea of the flavor of CLAWK programming, here's an example from
Ch 1 of The Awk Programming Language that calculates the number of employees,
total pay for all employees, and average pay.

        { pay = pay + $2 * $3 }
    END { print NR, "employees"
          print "total pay is", pay
          print "average pay is", pay/NR
        }

The book gives the text file "emp.data" as example input.

Beth    4.00    0
Dan     3.75    0
Kathy   4.00    10
Mark    5.00    20
Mary    5.50    22
Suzie   4.25    18

When the awk program (called say, SAMPLE.AWK) is executed with the command

$ awk -f sample.awk emp.data

it produces the following output:

6 employees
total pay is 337.5
average pay is 56.25


Here's a straightforward translation of SAMPLE.AWK using the CLAWK package:

    (defawk sample (&aux (pay 0))
      (t   (incf pay ($* $2 $3)))
      (END ($print *NR* "employees")
           ($print "total pay is" pay)
           ($print "average pay is " (/ pay *NR*))) )

When this function is executed with (sample "emp.data"), it produces the same
output:

6 employees 
total pay is 337.5 
average pay is 56.25 
NIL

(the NIL is the return value from the SAMPLE function).

That's nearly as concise as the original AWK. Here's a version that doesn't use
the DEFAWK macro, and tries to be a bit more Lispish.

    (defun sample (filename &aux (pay 0))
      (do-file-fields (filename (name payrate hrsworked))
        (declare (ignore name))
        (incf pay (* (num payrate) (num hrsworked))))
      (format t "~%~D employees" *NR*)
      (format t "~%total pay is ~F" pay)
      (format t "~%average pay is ~F" (/ pay *NR*)) )

... which prints out the same thing as above.

Here's another example from The AWK Programming Language that shows off the
regular expressions a little better. This program performs some basic validity
checks on a Unix password file.

    BEGIN {
        FS = ";" }
    NF != 7 {
        printf("line %d, does not have 7 fields: %s\n", NR, $0)}
    $1 ~ /[^A-Za-z0-9]/ {
        printf("line %d, nonalphanumeric user id: %s\n", NR, $0)}
    $2 == "" {
        printf("line %d, no password: %s\n", NR, $0)}
    $3 ~ /[^0-9]/ {
        printf("line %d, nonnumeric user id: %s\n", NR, $0)}
    $4 ~ /[^0-9]/ {
        printf("line %d, nonnumeric group id: %s\n", NR, $0)}
    $6 !~ /^\// {
        printf("line %d, invalid login directory: %s\n", NR, $0)}

And this is the same program using CLAWK:

    (defawk checkpass ()
      (BEGIN
       (setq *FS* ";"))
      ((/= *NF* 7)
       (format t "~%line ~D, does not have 7 fields: ~S" *NR* $0))
      ((~ $1 #/[^A-Za-z0-9]/)
       (format t "~%line ~D, nonalphanumeric user id: ~S" *NR* $0))
      (($== $2 "")
       (format t "~%line ~D, no password: ~S" *NR* $0))
      ((~ $3 #/[^0-9]/)
       (format t "~%line ~D, nonnumeric user id: ~S" *NR* $0))
      ((~ $4 #/[^0-9]/)
       (format t "~%line ~D, nonnumeric group id: ~S" *NR* $0))
      ((!~ $6 #/^\//)
       (format t "~%line ~D, invalid login directory: ~S" *NR* $0)) )

Which you must admit is awfully close to the original AWK. Since this uses the
#/ readmacro you must evaluate (clawk:install-regex-syntax) before evaluating
this sample.

CLAWK summary

The INSTALL-REGEX-SYNTAX function sets up a readmacro on #\/ that parses to an
unescaped #\/, and interns the resulting string in the current package. This
(a) gives a nice AWK-ish syntax (b) results in an object that can be printed
readably and interacts well in the listener even without Dynamic Windows, and
(c) gives CLAWK a convenient place to hook the compiled matcher and some
associated information, speeding things up nicely. You can still specify these
things using the |...| syntax, but it's not quite as nice, since #\| is a
common character in regex patterns.

The INSTALL-CMD-SYNTAX function sets up a readmacro on #` that parses to an
unescaped ` character, and expands into a call to
CLAWK::CALL-SYSTEM-CATCHING-OUTPUT. This function sends the command to the
shell and returns an input stream containing the output of the command. This
stream can then be used by CLAWK:FOR-STREAM-LINES or any of the CL stream
functions. The readmacro is defined for all non-Symbolics systems, although
currently only LispWorks is supported by CALL-SYSTEM-CATCHING-OUTPUT.

The various special variables *FS*, *NR*, *NF*, *RSTART*, *RLENGTH*, etc are
fairly obvious if you've ever used AWK, as are the SUB, GSUB, SPLIT, INDEX,
MATCH, SUBSTR, and ~ and !~ functions. $SUB, $GSUB, $SPLIT, $INDEX, $MATCH, and
$SUBSTR do the same, but they attempt to coerce the various parameters to the
right type (so you can pass in a string for a parameter that expects a number).

WITH-PATTERNS is a macro that cleans up the syntax for using
dynamically-generated regex patterns. A future extension is for this to use the
fast regex compiler instead of the closure-based regex compiler when the
compiler is available and the pattern is a string literal.

WITH-FIELDS is a macro for doing line-splitting into variables. WITH-FIELDS
takes a optional field list (in destructuring-bind syntax) and an optional
string (default is the current line -- see DO-LINES) and field-separator
(default is *FS*). If you don't give a set of variables to use, it will use the
$ variables. It splits the line using SPLIT, and executes the body forms in an
environment containing the variables. The $n variables are locally bound
special, so they will revert on exit.

WITH-SUBMATCHES is a macro for processing register submatches. It takes a list
of variables and a set of body forms, and binds the variables to the strings
corresponding to the register submatches (or nil if the register didn't match),
and evaluates the body forms in this environment. This is particularly handy
for use in the consequent clauses of match-case.

MATCH-CASE is similar to CASE except that the value must be a string, and the
cases are strings or regex symbols. Handy for simulating the implicit outer
match structure of an AWK program, without being limited to just the toplevel.
I tend to find that my serious AWK programs pretty quickly devolve into a BEGIN
clause that calls a bunch of functions to do the real work, and this macro
(along with the next few) really make this cleaner; allowing me to reuse that
nifty auto-looping / splitting / matching mechanism wherever I need it.

MATCH-WHEN takes a set of forms similar to the toplevel CLAWK forms (with
BEGIN, END, and other pattern clauses), but unlike the toplevel it does no
looping. It executes all the BEGIN clauses first, then conditionally executes
each of the pattern clauses, then executes all the END clauses. It winds up
looking a lot like COND, except that it recognizes several special flavors of
test expression, and is basically equivalent to

    (progn (when test1 . consequent1)
           (when test2 . consequent2)
           ...)

As with real AWK, the default action is to print the current line.

DO-FILE-LINES is a macro that takes a filename and a set of body forms. It
opens the file in read mode, and loops over the lines in the file binding the
line variable to the current line and executing the body forms. It doesn't do
any line splitting, though it maintain the *NR*, *FNR*, and $0 variables.
Executing (NEXT) within the body will restart the body at the next input line.

DO-STREAM-LINES is a closely related version that takes a stream (or t for the
standard input stream).

DO-FILE-FIELDS is a macro that takes a filename, an optional field-list, and a
set of body forms. It opens the file in read mode, and loops over the lines in
the file splitting them into the specified field variables (or the $n variables
if no field list was given), and executes the body forms in this new
environment. Because of the splitting overhead, it's not as snappy as
FOR-FILE-LINES. Executing (NEXT) within the body will restart the processing at
the next input line.

DO-STREAM-FIELDS is a closely related version that takes a stream (or t for the
standard input stream).

WHEN-STREAM-FIELDS and WHEN-FILE-FIELDS implement most of the CLAWK toplevel as
a reusable special form. It takes an input stream (or file) and an optional
field list, and a set of MATCH-WHEN clauses. The BEGIN clauses will be executed
before the file is opened, the pattern clauses will be executed as with
MATCH-WHEN, and the END clauses will executed after the file is closed, and the
body. One limitation is it doesn't currently support range patterns (it's
pretty simple to add, I just haven't gotten around to it).

DEFAWK is the highest-level macro, and one that closely mimics the awk
toplevel. It takes a name, a parameter list consisting of &key and &aux
parameters, and a set of MATCH-WHEN clauses. It binds the parameter list to the
variable ARGS, and evaluates the BEGIN clauses. The BEGIN clauses are free to
modify ARGS. Then it loops through the (possibly modified) ARGS, executing the
pattern clauses for each stream and pathname in the list. It handles docstrings
correctly.

The handling of the parameter list is perhaps the oddest thing about the DEFAWK
macro. It is intended to closely match the behavior of a command-line AWK. A
pleasant difference is that keyword parameters are automatically parsed out,
instead of requiring you to do it manually like in real AWK.

The $+ , $-, $*, $/, $REM, $EXPT, $++, #=, $<, $>, $<=, $>=, $/=, $MIN, $MAX,
$ZEROP, $LENGTH, $PRINT functions can take numbers and strings and do the AWK
thing, coercing the arguments back and forth as necessary. $++ is the
concatenation operator (in AWK this is indicated by space, which clearly
doesn't work in the Lisp world). $LENGTH is overloaded on associative arrays
(i.e. hashtables) to return the HASH-TABLE-COUNT.

The generic functions INT, STR, and NUM handle the common coercions. AWK does
provide an INT function, but string coercions are done by concatenating the
value with the empty string, and numeric coercions are done by adding 0 to the
value. While these kludges also work in CLAWK, the NUM and STR functions are to
be preferred.

The $ARRAY, $AREF, $FOR $IN and $DELETE functions (and macros) implement
AWK-like associative arrays (including the amusing SUBSEP behavior for multiple
indices), except that unlike AWK there's nothing to keep you from nesting these
"arrays".

$0, $1, ..., $20 are symbol-macros that access the fields of the current line.
$#0, $#1, ..., $#20 do also, but they attempt to interpret the value as a
number.

%0, %1, ..., %20 are symbol-macros that index into the recent submatches. %#0,
%#1, ..., %#20 do also, but they attempt to interpret the submatch value as a
number.

This system also defines a clawk-user package.