File: README.asciidoc

package info (click to toggle)
golang-github-blynn-nex 0.0.0%2Bgit.2021.03.30.1a3320dab9-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 232 kB
  • sloc: yacc: 60; makefile: 22; sh: 4
file content (471 lines) | stat: -rw-r--r-- 13,550 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
= Nex =

Nex is a lexer similar to Lex/Flex that:

- generates Go code instead of C code
- integrates with Go's yacc instead of YACC/Bison
- supports UTF-8
- supports nested _structural regular expressions_.

See http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf[Structural
Regular Expressions] by Rob Pike. I wrote this code to get acquainted with Go
and also to explore some of the ideas in the paper. Also, I've always been
meaning to implement algorithms I learned from a compilers course I took many
years ago. Back then, we never coded them; merely understanding the theory was
enough to pass the exam.

Go has a less general http://golang.org/pkg/scanner/[scanner package],
but it is especially suited for tokenizing Go code.

== Installation ==

  $ export GOPATH=/tmp/go
  $ go get github.com/blynn/nex

== Example ==

http://flex.sourceforge.net/manual/Simple-Examples.html[One simple example in
the Flex manual] is a scanner that counts characters and lines. The program is
similar in Nex:

------------------------------------------
/\n/{ nLines++; nChars++ }
/./{ nChars++ }
//
package main
import ("fmt";"os")
func main() {
  var nLines, nChars int
  NN_FUN(NewLexer(os.Stdin))
  fmt.Printf("%d %d\n", nLines, nChars)
}
------------------------------------------

The syntax resembles Awk more than Flex: each regex must be delimited. An empty
regex terminates the rules section and signifies the presence of user code,
which is printed on standard output with `NN_FUN` replaced by the generated
scanner.

Name the above example `lc.nex`. Then compile and run it by typing:

 $ nex -r -s lc.nex

The program runs on standard input and output. For example:

 $ nex -r -s lc.nex < /usr/share/dict/words
 99171 938587

To generate Go code for a scanner without compiling and running it, type:

 $ nex -s < lc.nex  # Prints code on standard output.

or:

 $ nex -s lc.nex  # Writes code to lc.nn.go

The `NN_FUN` macro is primitive, but I was unable to think of another way to
achieve an Awk-esque feel. Purists unable to tolerate text substitution will
need more code:

------------------------------------------
/\n/{ lval.l++; lval.c++ }
/./{ lval.c++ }
//
package main
import ("fmt";"os")
type yySymType struct { l, c int }
func main() {
  v := new(yySymType)
  NewLexer(os.Stdin).Lex(v)
  fmt.Printf("%d %d\n", v.l, v.c)
}
------------------------------------------

and must run `nex` without the `-s` option:

 $ nex lc.nex

We could avoid defining a struct by using globals instead, but even then we
need a throwaway definition of yySymType.

The yy prefix can be modified by adding `-y` option. When using yacc, it must use the same prefix:

 $ nex -p YY lc.nex && go tool yacc -p YY && go run lc.nn.go y.go

== Toy Pascal ==

The Flex manual also exhibits a http://flex.sourceforge.net/manual/Simple-Examples.html[scanner for a toy Pascal-like language],
though last I checked, its comment regex was a little buggy. Here is a
modified Nex version, without string-to-number conversions:

------------------------------------------
/[0-9]+/          { println("An integer:", txt()) }
/[0-9]+\.[0-9]*/  { println("A float:", txt()) }
/if|then|begin|end|procedure|function/
                  { println( "A keyword:", txt()) }
/[a-z][a-z0-9]*/  { println("An identifier:", txt()) }
/\+|-|\*|\//      { println("An operator:", txt()) }
/[ \t\n]+/        { /* eat up whitespace */ }
/./               { println("Unrecognized character:", txt()) }
/{[^\{\}\n]*}/    { /* eat up one-line comments */ }
//
package main
import "os"
func main() {
  lex := NewLexer(os.Stdin)
  txt := func() string { return lex.Text() }
  NN_FUN(lex)
}
------------------------------------------

Enough simple examples! Let us see what nesting can do.

== Peter into silicon ==

In ``Structural Regular Expressions'', Pike imagines a newline-agnostic Awk
that operates on matched text, rather than on the whole line containing a
match, and writes code converting an input array of characters into
descriptions of rectangles. For example, given an input such as:

------------------------------------------
    #######
   #########
  ####  #####
 ####    ####   #
 ####      #####
####        ###
########   #####
#### #########
#### #  # ####
## #  ###   ##
###    #  ###
###    ##
 ##   #
  #   ####
  # #
##   #   ##
------------------------------------------

we wish to produce something like:

------------------------------------------
rect 5 12 1 2
rect 4 13 2 3
rect 3 7 3 4
rect 9 14 3 4
...
rect 10 12 16 17
------------------------------------------

With Nex, we don't have to imagine: such programs are real. Below are practical
Nex programs that strongly resemble their theoretical counterparts.
The one-character-at-a-time variant:

------------------------------------------
/ /{ x++ }
/#/{ println("rect", x, x+1, y, y+1); x++ }
/\n/{ x=1; y++ }
//
package main
import "os"
func main() {
  x, y := 1, 1
  NN_FUN(NewLexer(os.Stdin))
}
------------------------------------------

The one-run-at-a-time variant:

------------------------------------------
/ +/{ x+=len(txt()) }
/#+/{ println("rect", x, x+len(txt()), y, y+1); x+=len(txt()) }
/\n/{ x=1; y++ }
//
package main
import "os"
func main() {
  x, y := 1, 1
  lex := NewLexer(os.Stdin)
  txt := func() string { return lex.Text() }
  NN_FUN(lex)
}
------------------------------------------

The programs are more verbose than Awk because Go is the backend.

== Rob but not robot ==

Pike demonstrates how nesting structural expressions leads to a few simple text
editor commands to print all lines containing "rob" but not "robot". Though Nex
fails to separate looping from matching, a corresponding program is bearable:

------------------------------------------
/[^\n]*\n/ < { isrobot = false; isrob = false }
  /robot/    { isrobot = true }
  /rob/      { isrob = true }
>            { if isrob && !isrobot { fmt.print(lex.Text()) } }
//
package main
import ("fmt";"os")
func main() {
  var isrobot, isrob bool
  lex := NewLexer(os.Stdin)
  NN_FUN(lex)
}
------------------------------------------

The "<" and ">" delimit nested expressions, and work as follows.
On reading a line, we find it matches the first regex, so we execute the code
immediately following the opening "<".

Then it's as if we run Nex again, except we focus only on the patterns and
actions up to the closing ">", with the matched line as the entire input. Thus
we look for occurrences of "rob" and "robot" in just the matched line and set
flags accordingly.

After the line ends, we execute the code following the closing ">" and return
to our original state, scanning for more lines.

== Word count ==

We can simultaneously count lines, words, and characters with Nex thanks to
nesting:
------------------------------------------
/[^\n]*\n/ < {}
  /[^ \t\r\n]*/ < {}
    /./  { nChars++ }
  >      { nWords++ }
  /./    { nChars++ }
>        { nLines++ }
//
package main
import ("fmt";"os")
func main() {
  var nLines, nWords, nChars int
  NN_FUN(NewLexer(os.Stdin))
  fmt.Printf("%d %d %d\n", nLines, nWords, nChars)
}
------------------------------------------

The first regex matches entire lines: each line is passed to the first level
of nested regexes. Within this level, the first regex matches words in the
line: each word is passed to the second level of nested regexes. Within
the second level, a regex causes every character of the word to be counted.

Lastly, we also count whitespace characters, a task performed by the second
regex of the first level of nested regexes. We could remove this statement
to count only non-whitespace characters.

== UTF-8 ==

The following Nex program converts Eastern Arabic numerals to the digits used
in the Western world, and also Chinese phrases for numbers (the analog of
something like "one-hundred and fifty-three") into digits.

------------------------------------------
/[零一二三四五六七八九十百千]+/ { fmt.Print(zhToInt(txt())) }
/[٠-٩]/ {
  // The above character class might show up right-to-left in a browser.
  // The equivalent of 0 should be on the left, and the equivalent of 9 should
  // be on the right.
  //
  // The Eastern Arabic numerals are ٠١٢٣٤٥٦٧٨٩.
  fmt.Print([]rune(txt())[0] - rune('٠'))
}
/./ { fmt.Print(txt()) }
//
package main
import ("fmt";"os")
func zhToInt(s string) int {
  n := 0
  prev := 0
  f := func(m int) {
    if 0 == prev { prev = 1 }
    n += m * prev
    prev = 0
  }
  for _, c := range s {
    for m, v := range []rune("一二三四五六七八九") {
      if v == c {
	prev = m+1
	goto continue2
      }
    }
    switch c {
    case '零':
    case '十': f(10)
    case '百': f(100)
    case '千': f(1000)
    }
continue2:
  }
  n += prev
  return n
}
func main() {
  lex := NewLexer(os.Stdin)
  txt := func() string { return lex.Text() }
  NN_FUN(lex)
}
------------------------------------------

== nex and Go's yacc ==

The parser generated by `go tool yacc` exports so little that it's easiest to
keep the lexer and the parser in the same package.

Here's a yacc file based on the
http://dinosaur.compilertools.net/bison/bison_5.html[reverse-Polish-notation
calculator example from the Bison manual]:

------------------------------------------
%{
package main
import "fmt"
%}

%union {
  n int
}

%token NUM
%%
input:    /* empty */
       | input line
;

line:     '\n'
       | exp '\n'      { fmt.Println($1.n); }
;

exp:     NUM           { $$.n = $1.n;        }
       | exp exp '+'   { $$.n = $1.n + $2.n; }
       | exp exp '-'   { $$.n = $1.n - $2.n; }
       | exp exp '*'   { $$.n = $1.n * $2.n; }
       | exp exp '/'   { $$.n = $1.n / $2.n; }
	/* Unary minus    */
       | exp 'n'       { $$.n = -$1.n;       }
;
%%
------------------------------------------

We must import `fmt` even if we don't use it, since code generated by yacc
needs it. Also, the `%union` is mandatory; it generates `yySymType`.

Call the above `rp.y`. Then a suitable lexer, say `rp.nex`, might be:

------------------------------------------
/[ \t]/  { /* Skip blanks and tabs. */ }
/[0-9]*/ { lval.n,_ = strconv.Atoi(yylex.Text()); return NUM }
/./ { return int(yylex.Text()[0]) }
//
package main
import ("os";"strconv")
func main() {
  yyParse(NewLexer(os.Stdin))
}
------------------------------------------

Compile the two with:

 $ nex rp.nex && go tool yacc rp.y && go build y.go rp.nn.go

For brevity, we work in the `main` package. In a larger project we might want
to write a package that exports a function wrapped around `yyParse()`. This is
fine, provided the parser and the lexer are both in the same package.

Alternatively, we could use yacc's `-p` option to change the prefix from `yy`
to one that begins with an uppercase letter.

== Matching the beginning and end of input ==

We can simulate awk's BEGIN and END blocks with a regex that matches the entire
input:

------------------------------------------
/.*/ < { println("BEGIN") }
  /a/  { println("a") }
>      { println("END") }
//
package main
import "os"
func main() {
  NN_FUN(NewLexer(os.Stdin))
}
------------------------------------------

However, this causes Nex to read the entire input into memory. To solve
this problem, Nex supports the following syntax:

------------------------------------------
<      { println("BEGIN") }
  /a/  { println("a") }
>      { println("END") }
package main
import "os"
func main() {
  NN_FUN(NewLexer(os.Stdin))
}
------------------------------------------

In other words, if a bare '<' appears as the first pattern, then its action is
executed before reading the input. The last pattern must be a bare '>', and its
action is executed on end of input.

Additionally, no empty regex is needed to mark the beginning of the Go program.
(Fortunately, an empty regex is also a Go comment, so there's no harm done if
present.)

== Matching Nuances ==

Among rules in the same scope, the longest matching pattern takes precedence.
In event of a tie, the first pattern wins.

Unanchored patterns never match the empty string. For example,

  /(foo)*/ {}

matches "foo" and "foofoo", but not "".

Anchored patterns can match the empty string at most once; after the match, the
start or end null strings are "used up" so will not match again.

Internally, this is implemented by omitting the very first check to see if the
current state is accepted when running the DFA corresponding to the regex. An
alternative would be to simply ignore matches of length 0, but I chose to allow
anchored empty matches just in case there turn out to be applications for them.
I'm open to changing this behaviour.

== Contributing and Testing ==

Check out this repo (or a clone) into a directory with the following structure:

  mkdir -p nex/src
  cd nex/src
  git clone https://github.com/blynn/nex.git

The Makefile will put the binary into e.g. nex/bin

== Reference ==

  func NewLexer(in io.Reader) *Lexer

  // NewLexerWithInit creates a new Lexer object, runs the given callback on it,
  // then returns it.
  func NewLexerWithInit(in io.Reader, initFun func(*Lexer)) *Lexer

  // Lex runs the lexer. Always returns 0.
  // When the -s option is given, this function is not generated;
  // instead, the NN_FUN macro runs the lexer.
	func (yylex *Lexer) Lex(lval *yySymType) int

  // Text returns the matched text.
  func (yylex *Lexer) Text() string

  // Line returns the current line number.
  // The first line is 0.
  func (yylex *Lexer) Line() int

  // Column returns the current column number.
  // The first column is 0.
  func (yylex *Lexer) Column() int