File: INTRO.txt

package info (click to toggle)
crm114 20100106-10
  • links: PTS
  • area: main
  • in suites: bookworm, bullseye, sid, trixie
  • size: 3,184 kB
  • sloc: ansic: 34,910; sh: 617; makefile: 578; lisp: 208
file content (1465 lines) | stat: -rw-r--r-- 52,793 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
#
#	INTRO.txt - INTRO to the CRM114 DISCRIMINATOR
#
# Copyright 2000-2009 William S. Yerazunis.
# This file is under GPLv3, as described in COPYING.
#

 		    INTRO to the CRM114 DISCRIMINATOR

		 Copyright (c) W.S.Yerazunis, 2000-2009

		    Last update - 2 March 2009

---------------------------------------------------------------------------

	DANGER, WILL ROBINSON!!  TAKE COVER, DR. SMITH!!!!!!!!!!

	CRM114 IS STILL UNDER DEVELOPMENT AND EXPANSION.  YOU MAY
	FIND THAT THE LANGUAGE CHANGES OUT FROM UNDER YOU .  BUGS,
	MISFEATURES, OR EVEN EXPLOITS MAY LURK WITHIN THIS CODE.

		IT IS SUPPLIED "AS-IS", WITH NO WARRANTY!
 		   SEE THE GPL LICENSE FOR DETAILS.
		----------------------------------------

This document is the programmer's introduction to CRM114 Discriminator.

If you are reading this to get information on how to install
CRM114 as a mailfilter, you have the _wrong_ document.

But fear not, we _do_ have the document you want.  The document you want
if you want to know how to install CRM114 as a mailfilter is:

    CRM114_Mailfilter_HOWTO.txt

which will tell you everything you need to know about how to install,
activate, and train the CRM114 mailfilter.



-------------------------------------------------------------------------

	      Before We Begin In Earnest, A Few Choice Quotes:


     "It's not ugly like PERL.  It's a whole different _kind_ of ugly."
		-John Bowker, on hearing the design details.

                        ------------------

  "The CRM-114 Discriminator is designed not to receive at _all_.  That
  is, not unless the message is preceded by the proper 3-letter code
  group."
      - George C. Scott, as General Buck Turgidson, _Dr. Strangelove_

                        ------------------

    C views the entire world as if your only tool is a hammer.  CRM114
	views the world as if your only good tools are a set of
	scissors and a roll of sticky splicing tape.

                        ------------------

      "What is this?  Some kind of grep bitten by a radioactive spider?"
			-me



CRM114 is a language designed to write filters in.  It caters to
filtering email, system log streams, html, and other marginally
human-readable ASCII that may occasion to grace your computer.

CRM114's unique strengths are the data structure (everything is
a string and a string can overlap another string), its ability
to work on truly infinitely long input streams, its ability to
use extremely advanced classifiers to sort text, and the ability
to do approximate regular expressions (that is, regexes that
don't quite match) via the TRE regex library.

CRM114 also sports a very powerful subprocess control facility, and
a unique syntax and program structure that puts the fun back in
programming (OK, you can run away screaming now).  The syntax is
declensional rather than positional; the type of quote marks around
an argument determine what that argument will be used for.

The typical CRM114 program uses regex operations more often
than addition (in fact, math was only added to TRE in the waning
days of 2003, well after CRM114 had been in daily use for over
a year and a half).

In other words, crm114 is a very VERY powerful mutagenic filter that
happens to be a programming language as well.

The filtering style of the CRM-114 discriminator is based on the fact
that most spam, normal log file messages, or other uninteresting data
is easily categorized by a few characteristic patterns ( such as
"Mortgage leads", "advertise on the internet", and "mail-order toner
cartridges".)  CRM114 may also be useful to folks who are on multiple
interlocking mailing lists.

In a bow to Unix-style flexibility, by default CRM114 reads its
input from standard input, and by default sends its output to
standard output.  Note that the default action has a zero-length
output.  Redirection and use of other input or output files is
possible, as well as the use of windowing, either delimiter-based or
time-based, for real-time continuous applications.

CRM114 can be used for other than mail filtering; consider it to be
a version of GREP with super powers.  If perl is a seventy-bladed swiss
army knife, CRM114 is a razor-sharp katana that can talk.


----- How CRM114 Is Different From ...  -----

CRM114 is different than procmail in that:

* CRM114 code is readable by the uninitiated, while procmail code
   looks like modem noise.

* CRM114 allows looping

* CRM114 allows gotos

* CRM114 allows nested statements in a useful way

* CRM114 can learn, if you want.

* CRM114 uses per-match control flags, rather than procmail's
   per-recipe control flags, and the control flags are words, not
   cryptocharacters.

* CRM114 separates mail processing from mail delivery, rather than
  conflating the two.


-----

CRM114 is different from awk / gawk / perl / grep in that:

* CRM114 is entity-oriented, and views the entire input as a
  single structured entity (structure is imposed during processing,
  rather than from within, as in XML); there is no concept of "lines",
  "words", "stanzas" or "records" unless you choose to put them there.

* CRM114 tries to avoid the bizarre syntax, mind-reading, and
  action-at-a-distance of perl;

* CRM114 can learn, if you want.


CRM114 is unique in that:

* CRM114 can use a swept window to manage the amount of data
  retained in each analysis pass; highly useful on log files and
  packet traces.

* CRM114 can learn.


Oh, just for completeness- yes, CRM114 is Turing-complete, as it can
emulate (to within the limits of available memory) a single-tape
Turing machine.  To do this requires an interesting initialization
of the input tape, which is left as an exercise to the reader (backwards
hint: each symbol on the tape has two parts - the logic state, and
a unique identifier; the identifier is used as a marker so that
tape motion "to the left" and "to the right" can be performed.



-----  Anything Else ? -----

Lastly, this guide is just an _introduction_ to CRM114.  It doesn't
explain all of the statements, nor does it fully explain all of the
statements that it does cover.  The QUICKREF quick reference card
makes a much better attempt at covering every capability, at the
expense of a terse format.

If you want the big manual, we have that too- it's on the web page
(but not part of this download; it's big).

And again, CRM114 is GPLed software and a community effort - if you
have an improvement, a bugfix, or even just a bug, please do report
it back on the crm114 mailing list.  You can get on the mailing list
(a closed list, so it won't spam you) via a link on:

   crm114.sourceforge.net



-----  Getting and Installing CRM114  ---------

You should already have the source code.  If you don't, you can fetch
the full kit from Sourceforge.  CRM114 is GPLed, you can use it freely
without asking anyone for permission or paying any licensing
fees.

Open any browser, and go to:

     http://crm114.sourceforge.net

Read the webpage- it will usually have direct clickable links to pull
down both the most recent cutting-edge version of CRM114 (usually for
developers and testers), and the "Recommended for Users" version.

Click on the version you want, and downloading will commence.

Once you have the .gz file(s), you will need to unpack them.
If you have .gz files, type:

	tar -zxvf crm114-whateverversion.tar.gz

and the full source directory will be built in your current directory

Now, cd down into that source directory, become root, and type:

    make install

to build and install the executables and utilities.  If the make
complains of not being able to find the TRE approximate-regex library,
you can either:

    Plan A) install TRE libraries from your distribution.  This is
       recommended, and how to do so varies with your OS. For Ubuntu,
       it is installed with:

       	  sudo apt-get install libtre-dev

or you can:

    Plan B) install TRE libraries manually. Obtain the TRE source directory
       from http://www.laurikari.net/tre/, and compile it statically.

          zcat tre-0.7.5.tar.gz | tar xvf -
	  cd tre-0.7.5
	  ./configure --enable-static
	  make
	  make install

Then try to build CRM114 again.

You can then execute the executable with:

    ./crm [<arg> [<arg> [<arg> [....]]]] .

To install crm114 as a systemwide utility, type "make install" to
install it as /usr/bin/crm so anyone can use it.

Now would be a _good_ time to read the CRM114 QUICK REFERENCE CARD,
which is one of the files you already have.  A lot of it won't make
sense... yet.  But it will, soon enough.


-----  Getting Started -----

Crm114 is a filter, like "grep" or "wc".  It reads from standard
input, and outputs to standard output (mostly- these can be overridden).

By default, crm114 runs your program in the following steps:

   1) it reads your program in
   2) it runs a preprocessor over your program
   3) it runs an incremental microcompiler over your program
   4) it reads standard input until either it hits EOF (^D
      on the keyboard), or until it exhausts the data window size
      (which you can change with the -w parameter; the default at
      version 2003-02-19 is sixteen megabytes).
   5) Then the crm114 runtime system actually runs your program.

Program execution is on a line-by-line, JIT-compiled style.  To speed
things up and detect some errors, CRM114 does a microcompile to
convert your program into a VHL representation which is then
interpreted.  This is not a full compile; since many arguments can
only be evaluated in the dynamic context of a partially-executed
program, a full compilation is not possible in any case.

Put only one statement on a line, if possible (this is the recommended
style).  If you can't, separate the statements with semicolons.

Here's a VERY simple program.

        output /Hello, world! \n/

which accepts an arbitrary input (just hit ^D for now), then outputs

	Hello, world!


Some mechanics- assuming you you want to run these programs as
standalones, make sure the first line of your program is a line that
looks like this:

	#! /usr/bin/crm

If you put this at the start of each file, the shell will know your
program is a CRM114 program and will automagically load CRM114 to run
your program.  You will also need to do a "chmod o+x yourfilename" to
enable the file as an executable.

If you don't want to do both of these things, you can still run
a bare crm114 program as a command-line argument:

	crm filename

If you just want to dash off a one-liner, you can put the whole
program onto the command line between curly braces (the quotes are so
the shell will pass on your program text without doing any
substitutions.)

   crm '-{ output /Hello, world! \n/ ; }'

Here's another version of the same "Hello World" program:

   crm '-{ output /Hello, world! :*:_nl:/ ; }'


Note the ':*:_nl:' at the end of the output line.  It contains two
parts: the value name :_nl:, which is initialized by crm114 to a
newline (to C programmers, it's a '\n' ).  Putting a ":*" on the front
of a value name means "put my value in here instead of my name".  So,
:*:_nl: turns into a newline character when the output statement is
executed.  (nota bene: the ':*:' does this name-to-value translation
only once.  So, if you had a value named :foo: with the value ":*:bar:",
and :bar: had the value "FOOLED YOU", :*:foo would evaluate to
":*:bar:", not to "FOOLED YOU".  If you want to do this multiple
value resubstitution, you have to explicitly ask for this by using
the :+: indirection operator instead of :*: evaluation operator.

Why does CRM114 evaluate variables only once?  It's so that
you can embed any string you want and know what it will evaluate
to.  Notice in the README that there are : vars for several "tricky"
characters.

Note that I said "value name", not variable.  In truth, crm114 _has_
_no_ _variables_; all data storage can be viewed as start/length pairs
indicating ranges of character strings existing on a few huge strings.

The default string (called the default input window buffer) is filled with
stdin (until EOF) during program startup, another string is
initialized with a few standard values, and is available for scratch
use as needed.  (well, _by default_ the input window buffer is filled
from standard input; this can be overridden easily)

All variables are really captured values - these are just start/length
indices into these big strings.  The power of this is that these
captured values can overlap and so the view of the input data as a
contiguous whole is not disrupted.

These overlapping values retain any heirarchial structure you choose
to impose.  For instance, a multipart message can be easily
manipulated, split, some XML file hierarchy can be manipulated, etc.

If you need to, you _can_ create temporary, isolated variables - they
are just other sections of a big string buffer that don't happen to be
part of the input buffer (see ISOLATE, below).

Instead of addition and subtraction, the basic operations in crm114
is the matching of one string against another, the capturing of a
value, and the destructive replacement of one value with another.


----- Matching -----


Here's a simple example of a CRM114 program that does string matching.

	#! /usr/bin/crm
	{
		match /foo/
		output /Hey, there's a foo in the input \n/
	}

Try this program.  Give it any input you want (remember to hit ^D to
signal end-of-file if you are typing input from a keyboard).  The
result will be that the program will either do nothing at all, or it
may print out "Hey, there's a foo in the input".

Note that there's no "if" statement here (or, for that matter, in
_any_ crm program).  The MATCH statement is itself an IF statement.
If the match succeeds, execution continues with the next statement.
If the match fails, then execution skips to the end of the { } block.
This "skip to end of block" is called a FAIL in CRM114 slang.

By the way, if you should ever want to force a fail, there is a "fail"
statement just for that.

Crm114 statements have a general structure that looks like this:

	commands <flags> (vars) [restrictions] /regexes/

You'll find crm114 uses a standardized pattern of commands, then flags
in <>, then vars in (), then substr restrictions in [], then regexes
in // and block structures in {}.  The only required order is that
the command action must come first in a statement (and even that may be
relaxed in the future.)

But, back to programming.  We can change the program just a little, to
look for input files that contain any arbitrary regex-able string.  We
can also change the program to either reject the entire input (and
output nothing - this is the default), or to ACCEPT the entire input
as it currently exists.

As an example, this little program looks for zebras.  If the input
file contains at least one "zebra", it outputs the entire input file.
If it doesn't contain at least one zebra, it outputs nothing.

This program also uses the "accept" statement.  ACCEPT means "take
whatever the current data window is, and write it to standard output."
Many "go/nogo" filters will use ACCEPT as an easy way to ... well,
accept their input as good.

	#! /usr/bin/crm
	{
		match /zebra/
		accept
	}


You don't have to be limited to fixed strings in the match.  You can
use the full Posix Extended match syntax.  (type 'man 7 regex' to see
more, or look in the QUICKREF.txt file).  You can use backreferences,
such as accepting only files that contain a four-letter palindromic
sequence:

	#! /usr/bin/crm
	{
		match /(.)(.)\2\1/
		accept
	}

You can even use approximate matching, such as accept any file that
contains a string that can be converted to "Niagara Falls" in no more
than three inserts, deletes, or substitutions:

	#! /usr/bin/crm
	{
		match /(Niagara Falls){~3}/
		accept
	}


CRM114 is built with the TRE REGEX library as you no
doubt read above, and uses the REG_EXTENDED mode of operation
exclusively.  One (current) limitation of TRE is that if you use
approximate regex matches, you can't use backreferences and vice
versa.  Instead of REG_BASIC, TRE offers the <literal> mode, where
no character has special meaning.

Building CRM114 with the GNU regex library is no longer supported.
GNU regex doesn't support approximate regexes, nor <literal> mode,
and back-references like \1 never seem to work right for me, so it is
no longer included in the source code.

As in most POSIX libraries, the first match possible in a string is
the one found, and given that starting point, the longest match
possible with that starting point is used.  Sub-matches (enclosed in
parenthesis) are similarly located and extended (first found, then
longest with that starting point).  By default, matches can span
lines; the regex /.*/ with no flags will match the full input window.

Some handy POSIX-extended regexes are:

  ^          as first char of a match, matches only at the start
	     of the matchable block (that is, the first character of
	     the string for most matches, and the first character of
	     a line for <nomultiline> matches).

  $          as last char of a match, matches at the end of the matchable
	     block (that is, the last character of the string, and the
	     last character of the line for <nomultiline> matches).

  .   (a period) matches any _single_ character (except start-of-line or
	    end of line "virtual characters", but it does match a newline).

The following are other POSIX expressions, which mostly do what you'd
guess they'd do from their names.

  [[:alnum:]]
  [[:alpha:]]
  [[:blank:]]
  [[:cntrl:]]
  [[:digit:]]
  [[:lower:]]
  [[:upper:]]
  [[:graph:]]  <-- any character that puts ink on paper or lights a pixel
  [[:print:]]  <-- any character that moves the "print head" or cursor.
  [[:punct:]]
  [[:space:]]
  [[:xdigit:]]

Additionally, a '*' means "repeat preceding zero or more times", a
'+' means "repeat one or more times", and a '?' means "repeat zero or one
time".  *?, +?, and ?? are the same, but match the _shortest_ match that
fits, rather than the longest.

You can specify repeat-counts as well.  {N} means match N copies,
{N,M} means any number of copies between N and M inclusive, and {N,}
means match at least N copies.  (N and M are sadly limited to 255 or
less by POSIX.)

TRE extends POSIX with approximate matching - {~N} means with no more
than N insertions, deletions, and substitutions, and {~} means "closest
match, no matter how many errors".  Note that a string of length
Z can be subjected to Z deletions and therefore "match" the empty
string, watch out for this quaint (but mathematically correct)
behavior if you use {~} matches.  You can also specify some relative
costing between insertions, deletions, and substitutions;  QUICKREF.txt
contains some further examples.


-----  Comments -----

Comments in a CRM114 program start with a '#' sign and continue until
either a newline or a "\#".  Note that a ';' (a semicolon) does NOT end
a comment (the reason it doesn't is because the semicolon is too often
found _in_ a comment, whereas \# is pretty rare.

It's a good idea to use "block comments" throughout your CRM114 programs;
even though comments can be deceiving, it's usually better to have them
than not to.


----- Capturing a value from a match -----

We can capture the values matched by the extended regex or even
subparts of the extended regex; any variable name(s) enclosed in
parenthesis in the match statment will be attached to successive
parenthesized subexpressions (note- the first variable name, if it
exists, is always bound to the _entire_ matched stream).

One additional bit before our next example program: crm114 lets you
see the command line inputs.  These are some of the special temporary
values; they appear as :_arg0: through :_argN:, and "positional"
arguments (those _not_ of the form "--name=value") also appear as :_pos0:
through :_posN: .  By looking at these arguments, we can change our
program's behavior from the command line.

Let's re-write a basic grep then:

	#! /usr/bin/crm
	{
		match (:result:) /(:*:_arg2:)/
		output /:*:result:/
	}

which indeed does function pretty much like grep, except it outputs
only the matching string.  This tells us the string was indeed
present in the input stream, but doesn't give us any context.

We can modify the program to work just like grep, by requiring the entire
match to be satisfied on a single line, and by outputting the
entire line found.

To do this, we use a "modifier flag" on the match statement.  Here,
we want the match statement to be restricted to a single line, so
we use the <nomultiline> modifier flag on the match statement.

Since the match is now limited to just the line that contained the
input pattern, we can put a .* both in front and in back of the
actual :*:_arg2: pattern.  ( the pattern ".*" matches the longest string
possible without caring what it's matching.  It's a wildcard string)

Here's the modified program:

	#! /usr/bin/crm
	{
		match < nomultiline > (:result:) /(.*:*:_arg2:.*)/
		output /:*:result:/
	}

This works reasonably well, except it only shows us the first match.
We can fix that with two more pieces:

  -- the "fromend" flag, which tells the match to start looking for a
     match at the end of the previous match,

and

  --the LIAF statement, which tells program execution to go back to
    the start of the most recent program { } block and run again.

(by the way, you can redirect any particular OUTPUT command to a file,
by supplying the file name (or a variable with the right value) in
[square_brackets] before the /output values/.  To append to a file,
put the <append> flag in the OUTPUT statement; otherwise you will
overwrite the contents of the file.

The 'liaf' statement is the reverse of "fail".  LIAF tells the
execution to skip UPWARDS in the program, back to the _start_ of the
enclosing { } block.  You can remember that "liaf" is "fail" spelled
backwards, or you can pretend it stands for Loop Iterate Awaiting
Failure; either works as a mnemonic.

Here's the program with the flags and liaf in place; we also put
in a newline in the output so each separate line appears on a new line:

	#! /usr/bin/crm
	{
		match < nomultiline fromend> (:result:) /(.*:*:_arg2:.*)/
		output /:*:result:\n/
		liaf
	}

and sure enough, it acts like grep (without some of the flags that grep
has), but this version of grep can now do approximate matching.

As long as the MATCH succeeds, execution continues through the OUTPUT
statement and hits the LIAF.  The LIAF statement bounces execution up
to the open '{' statement and execution continues from there, down
onto the MATCH statement again.


[ note: You'll find that if you use this program very much that the
pattern in arg2 is used as a regex.  It's not a literal match, but a
match that allows wildcards.  If you wanted to not allow wildcards,
you'd need to specify <literal> as well as <nomultiline> and < fromend>,
or you can use the \Q directive to specify verbatim quoting; \Q.*\E
specifies the string of a dot followed by a star exactly. ]


-----  ALTERing values  ------

In the "like a grep" program above, it was perfectly fine to keep the
result of the match in the captured value :result: (which remained
part of the input buffer).  Let's see what happens if we surgically
alter that value.

The ALTER statement alters the contents of a captured value by
inserting or deleting characters at the start of the variable till the
variable is the same length as the new value, then overwriting the old
characters with new characters.  The length of the captured value
changes; so do the starts and lengths of any variable that overlaps
the captured variable or that would have been affected by the
insertions or deletions.

Here's an example. This program surgically alters the input, by replacing
the first 'foo' with 'IT'S A BAR NOW'

	#! /usr/bin/crm
	{
		match (:whole_input:) /.*/
		output / The whole input file before ALTERing: \n/
		output /:*:whole_input:/
		output /\n/

		match (:a_foo:) /foo/
		alter (:a_foo:) /IT'S A BAR NOW/

		match (:whole_input:) /.*/
		output / The whole input file after ALTERing: \n/
		output /:*:whole_input:/
	}

Give this program the input:

 apple
 foo
 banana

and you'll get back

 apple
 IT'S A BAR NOW
 banana

As you can see, we've destructively altered the value of :a_foo: to
"IT'S A BAR NOW", and this change is reflected in the entire
input buffer.  (note to students- we really didn't need to rematch
the :whole_input: twice, but we wanted to drive home the fact that
this really was a surgical operation on the main text body, not on
some copy somewhere)

Aside: this program changes only the first foo.  To make it change
_every_ foo, use the LIAF-loop technique above on the match/alter
in the middle.  We also need to initialize our search at the beginning of
the input but not use up any characters; the "match //" statement
does that.  The program crux would now look like:

	...
	   match //
	   {
		match <fromend> (:a_foo:) /foo/
		alter (:a_foo:) /IT'S A BAR NOW/
		liaf
	   }
	...


----- ISOLATE and Isolated Variables -----

The power to surgically alter the input is fine and dandy if we know
precisely what alterations we want to make, but what if we don't want
to mutilate the input, just want to do some specialized searching or
produce a tenative value?  We can do this by ISOLATEing any variable
we want to preserve as separate from the input buffer, and then
putting the desired values into that variable with the ALTER command.

Note that the special ISOLATEd behavior of a variable only lasts as
long as it's not re-assigned by a MATCH.  This is intentional but can
be the source of some misunderstandings because you can ALTER an
ISOLATEd value and you can use its value with :*: and it stays
ISOLATEd, but if you should bind it in a match, its ISOLATEed
property is lost.

An ISOLATEd variable is initialized with the value of a zero-length
string, in case you wondered.  Try this:

  crm '-{ isolate (:foo:) ; output /a:*:foo:z/; }'

(remember to hit ^D so your program doesn't wait for an input that
will never arrive).  You'll get back the result "az", showing that
the value of a freshly isolated variable is a string of length zero.

If you want to set an initial value on an isolated variable, put the
value in /slashes/.  Example:

  crm '-{ isolate (:foo:) / Hi there! / ; output /a:*:foo:z/; }'

which results in:

   a Hi there! z

Lastly, if you ISOLATE a variable that already has a value, the result
is that you make a new copy of the variable.  This is not destructive
of the old copy... it's still there and intact, in case any other variables
happen to be using the same strings.

It is important to remember that setting a captured value
with a MATCH statement really just changes the start and length of
that variable's pointers, it doesn't change any actual strings in
memory.  Setting a captured value with an ALTER statement actually
_does_ change the string in memory.  More precisely, an ALTER leaves
the start location at the same place, but the old string is deleted,
and the new string is inserted.  Other captured variables may well
change as well during an ALTER, it depends on how they overlapped
the ALTERed variable.

Here's an example - this demo file expects you to give it the input string
of "abcdefghijklmnop", so type that in as soon as the program starts
(there is no prompt, just type it in, and then EOF (usually control-D):

   #! /usr/bin/crm
   {
	match <> (:big:) /.*/
	output /----- Whole file -----\n/
	output /:*:big:/
	output /----------------------/
	match <> (:1:) /abcde/
	match <> (:2:) /cde+fg/
	match <> (:3:) /fghij/
	output /\n 1: :*:1:, 2: :*:2:, 3: :*:3: \n/
	output / ---altering--- \n/
	alter (:2:) /CDEEEFG/
	output / 1: :*:1:, 2: :*:2:, 3: :*:3: \n/
	output /----- Whole file -----\n/
	output /:*:big:/
	output /----------------------\n/
	match <> (:big:) /.*/
	output /----- Rematched Whole file -----\n/
	output /:*:big:/
	output /----------------------\n/
   }

Notice how any captured variable that overlapped the ALTERed variable
also changed?  That's both very powerful and rather dangerous- be
careful how you ALTER anything that isn't ISOLATEd.

Input is possible other than via the input window; the 'input'
statement reads a line of input from stdin and puts it into a captured
variable.  This is equivalent to the ALTER statement.  If you don't
want to modify something important, you should ISOLATE this variable
till you have checked the input to be something you want (if the variable
hasn't been captured or ISOLATEd before use, the value is ISOLATEd).

Example:

	#! /usr/bin/crm
	window
	{
	        output /\n ------INPUT TEST ---/
	        input (:x:)
	        output /\n Got: \n:*:x: \n/
	        match [:x:] /foo/
	        output /\n it had a foo/
	}

This little program reads one line of input, outputs the line, and then
searches it for a foo.  If the foo is found, the program confirms this, and
then exits.

Note that match uses [:x:] to specify the input being matched against,
while it uses (:x:) to specify the output of the resulting match.


----- WINDOWing through an infinitely long Input -----

You can control the rate and style of input into the input
window with the WINDOW statement.  By default, crm114 reads input
till the first EOF, and then never reads again.  With WINDOW, you
can read as many times as you want, controlling the input buffer
size as well. (this is _very_ handy when you're writing a filter
to monitor an ever-growing syslog file, or sitting on a logging
port that never EOFs).

The WINDOW statement takes one of three flags (see next paragraph),
and two regex patterns.  It deletes characters in the input window
buffer up to and including the first regex, then reads standard input
until it finds the second regex, appending that to the end of the
input buffer.  Using WINDOW in a loop lets your program inch its
way through an infinitely long file (and yes, we do mean
"infinitely".  The program will process the infinitely long input
file one window's worth at a time. ).

Since regex-matching is slightly expensive in terms of CPU, WINDOW
has three flags that tell it how often to check for the 'got new
input completed' regex.  Those flags are bychar, bychunk, and byeof.
With bychar, the regex is checked on every incoming character (assuming
your input tty is already set to unbuffered operation), bychunk
checks on every input "block" where a "block" is a conveniently large
chunk of I/O, and byeof checks only when an EOF is read.
(don't worry if your input stream is buffered, characters after the
regex are NOT thrown away but saved for the next execution of a
'window' statement.)

One last bit on WINDOWing - if a WINDOW statement is the first
statement in your program that can affect the input window buffer, the
normal crm114 behavior of reading the entire standard input till EOF
is suppressed and your window statement takes over.  If your window
statement doesn't have any arguments, then no input is done, and your
program starts running without waiting for any input at all. Yes, this
is slighty hackish, live with it or come tell me a better way.

Here's an example of a WINDOW - keep reading input, even past EOF,
and look for occurrences of either 'apple' or 'banana'; if either is
found, print a message.  Note that you can't do this with grep because
grep can't re-read past the first EOF, nor can grep mutilate the
output.

	#! /usr/bin/crm
	{
		window <bychar> /\n/ /\n/
		{
			match (:my_fruit:) /apple|banana/
			output /Found fruit: :*:my_fruit: ... good! \n/
		}
		liaf
	}

Now, why would you ever use this?  How about for parsing a syslog file
for security alerts like failed root logins, or attempts to open port
421 ?  :-) Note the liaf-loop above- this is the "recommended" style
to write an infinite loop, or a program that's supposed to run nearly
forever.


----- Matching inside variables -----

We can restrict matching to be inside a particular value
(the value can be isolated).  For example, here's a simple program
that accepts only input files that contain 'apple' in the first
string found that begins with 'START' and ends with 'END'.

	#! /usr/bin/crm
	{
		match (:my_string:) /START.*END/
		match [:my_string:] /apple/
		accept
	}

The bracketed parameters '[:your_variable:]' tell the match statement
to restrict matching to inside the variable mentioned.

One issue- the above example does two things strangely- one, it's
case-sensitive ( "START apple END" works, but "start apple end" doesn't).
Secondly, after it finds the first 'START whatever END', it commits
to using that one, even if a second one exists.

We can fix the first problem by using the "nocase" flag on both
matches, and fix the second problem with a liaf loop.  But, remember
that a liaf-loop runs until one of the toplevel matches fails,
so we need an escape out of the inner match/accept on 'apple'.
Here's the code:

	#! /usr/bin/crm
	{
		match <nocase fromend> (:my_string:) /START.*END/
		{
			match [:my_string:] /apple/
			accept
			exit
		}
		liaf
	}

----- Getting INPUT from other places -----

You can do explicit INPUT of information with the INPUT statement;
the INPUT statement works as follows:

  1) if you don't specify an input filename in square brackets like
  this

      [ myfile.txt ]

  then input will read from stdin (a clearerr() is done first, so if
  you've already hit EOF on stdin, you will be able to read past
  that EOF should more input be available.)

  2) if you specify <byline>, only the first line of the input file
  is read.


----- Getting a quick hashcode -----

At some point, you may want to take a captured value and make some
hashcode or digest.  The HASH statement does this conveniently; HASH
is like ALTER but instead of surgically altering the variable to the
expanded /slashed value/, it expands the slashed value and then takes
a hash of that.  The hash is a 32-bit hash, expressed as an
eight-character hexadecimal string.  You should use HASH in cases
where you need a short index to a long string (for efficiency or
database access), or where you need to provide a hard-to-invert
password check.  (note- because this is only a 32-bit hash, it's
not particularly secure and should be viewed as a "picket fence",
rather than as a "bank vault door".  Adding a "salt value" to the
/slash pattern/ will greatly increase resistance to dictionary
attacks.  Putting a randomly chosen dictionary word and number
in front of the hashed value and another randomly chosen dictionary
number after the hashed value will greatly increase your security;
using a pair of HASHes, with different salt values will also greatly
increase security.

For example:

	#!/usr/bin/crm
	hash (:_dw:) /:*:_dw:/
	accept

will generate a quick-and-dirty hashcode of the input file.

Note that this hash is NOT cryptographically secure; it can be
broken in a few minutes of CPU time on any modern computer desktop.
If you need security, use MD5.


-----  LEARNing and CLASSIFYing -----

The next two statements in crm114 are the hardest to understand,
because they are the 'learn' and 'classify' statements.  These
statements attempt to identify types of inputs based on word and
phrase similarity.  As of build 20020501, all phrases of up to four
words are weighted equally in the classifier, and as of build
20031215, a better weighting (Bayesian/Markov Modeling) is used to get
improved accuracy).  Builds past 20040101 use chains of five words
for yet more accuracy.

The details of all this are explained in the file
"classify_details.txt", but you don't need to understand them to
use the classifiers.

The LEARN statement updates a file of hashed phrase structures with
the contents of the specified [ ] variable.  If you don't specify an
input variable, the default data window :_dw: is used as the input
buffer.  You will have to specify the classname you want to learn, and a
regex that defines what a "word" is.  For english text, a good regex
is [[:graph:]]+ , which is a string of characters that all have some
nonblank, noncontrol characters.  The LEARN statement creates a file
with the same name as the classname to be learned, so watch out and
don't clobber a file you want to keep.

The CLASSIFY statement uses two or more of these classname files from
LEARN to classify an input buffer into types.  As with LEARN, the
CLASSIFY statement accepts a [ ] input variable containing the text to
classify.  If you don't specify an input variable, the default data
window :_dw: is used.  You specify any number of classes (each one
must have a preexisting hashed phrase file) and a regex to define a
word (again, [[:graph:]]+ is a good place to start).

CLASSIFY then compares the input window against each of the classes in
turn.  If the class that best matches the input window occurs _before_
the '|' marker in the list of hashed phrase filenames, 'classify'
succeeds and execution of your program continues with the next line.
If the class that best matches the input window occurs after the '|',
then the classify statement fails to match, and execution skips to the
end of the { } block (just like a match statement).

CLASSIFY can take a second variable (in parens (:here:) like that)
which will be ALTERed to contain a text-formatted set of matching
statistics.  This can be useful if you want to do some sort of
mathematical comparison or checking.

----- IF-THEN-ELSE without IF, THEN, or ELSE -----

MATCH and CLASSIFY can act as IF-statements, but what about
IF-THEN-ELSE situations?  for that matter, how can we implement CASE
statements, where we want one (and only one) of N different
alternatives to execute?

The ALIUS statement provides this functionality.  "Alius" is latin for
"other" or "another" (or, more literally "the other man").

An ALIUS statement looks at the most recently completed bracket-block
of code - if _that_ bracket block failed (exited because a MATCH or
CLASSIFY failed, or because of a FAIL statement), then ALIUS is a
no-op and execution continues with the next statement.  If the most
recently completed bracket block completed successfully (didn't
exit due to a MATCH fail, CLASSIFY fail, or FAIL statement) then
ALIUS itself is a FAIL statement, and causes a skip to the end of
the current (outer) bracket block.  This is a skip, not a FAIL, and
so a surrounding ALIUS on the outer bracket block won't itself FAIL.


Here's an example of ALIUS used for a 3-way case statement:

 #! /usr/bin/crm
 #   test the alius statement
 {
	{
		output /checking for a foo.../
		match /foo/
		output /Found a foo \n/
	}
	alius
	{
		output /no foo... checking for bar,,,/
		match /bar/
		output /Found a bar.  \n/
	}
	alius
	{
		output /neither foo nor bar \n/
	}
 }
 output / That's all, folks! /


When you run this, you'll see that each MATCH test is applied in
sequence, and as soon as a MATCH succeeds (and so has a bracket-block
complete successfully) that's the end of the program's execution.
You _can_ program this with a lot of goto's, but it's much easier
to use ALIUS.

If ALIUS still confuses you, pretend that ALIUS really means

  "IF THAT WORKED, SKIP THE REST OF THIS BLOCK,

      OTHERWISE

   TRY THIS NEXT BIT OF CODE AND SEE IF IT WORKS OR NOT"

which is pretty much what it does.

----- Minion Processes and Syscalls -----

CRM114 has a fairly powerful mechanism for creating and communicating
with subprocesses, called "minion processes".

You can have an unbounded number of minion processes, and minion
processes can run in parallel with CRM114, repeatedly receiving
input from CRM114 and outputting to CRM114.  The minion processes
can also do other things besides talking to CRM114.

Here's an example program that runs some minion processes; the first
one runs "ls" (and gets a file listing), the second runs 'bc', and
uses bc to calculate 1 + 2 + 3.  We then play some games, running "ls
-la", cat-ting into a file, and using asynchronous input to accomodate
slow programs (or those with HUGE outputs).  This program also uses
the 'window' statement by itself to inhibit any reading of standard
input, so this program just goes off and runs without waiting for any
input.

#! /usr/bin/crm
window
{
	isolate (:lsout:)
	output /\n ----- executing an ls -----\n/
	syscall ( ) (:lsout:) /ls/
	output /:*:lsout:/

	isolate (:calcout:)
	output /\n ----- calculating sum of 1 + 2 + 3 using bc -----\n/
	syscall ( 1 + 2 + 3 \n ) (:calcout:) /bc/
	output /:*:calcout:/

	isolate (:lslaout:)
	output /\n ----- executing an ls -la -----\n/
	syscall ( ) (:lslaout:) /ls -la/
	output /:*:lslaout:/

	isolate (:catout:)
	output /\n ----- outputting to a file using cat -----\n/
	syscall ( This is a cat out \n) (:catout:) /cat > e1.out/
	output /:*:catout:/
	#  note that we expect :catout: to be null

	isolate (:c1: :proc:)
	output /\n ----- keeping a process around ----  \n/
	output /\n preparing... :*:proc:/
	syscall <keep> ( a one \n ) ( ) (:proc:) /cat > e2.out/
	output /\n did one... :*:proc:/
	syscall <keep > ( and a two \n ) () (:proc:) //
	output /\n did it again...:*:proc:/
	syscall ( and a three \n) () (:proc:) //
	output /\n and done ...:*:proc: \n/

	output /\n ----- doing asynchronous reads from a minion-----\n/
	isolate (:lslaout:)
	syscall <keep async> () (:lslaout:) (:proc:) /ls -la /dev /
	output /--- got this immediate : \n :*:lslaout: \n ---end-----/
	:async_test_sleeploop:
	output /--- sleeping 1 seconds ---/
	syscall <> () () /sleep 1/
	syscall <keep async> () (:lslaout:) (:proc:) //
	output /--- and got this async : \n :*:lslaout: \n ---end-----/
	{
		###  if we got at least three chars, we should look for more.
		match [:lslaout:] /.../
		goto :async_test_sleeploop:
	}
	syscall <> () (:lslaout:) (:proc:) //
	output /--- and synch : \n :*:lslaout: \n ---end-----/
}


----- INSERTing a file verbatim ------

At some point, you may desire to call a second crm114 program from
the current program.  There are two ways you can do this: either SYSCALL
it (as above), or you can INSERT the program text verbatim into your
current program.  Either works; syscalling keeps the variables and
data windows of the two programs separate, while INSERT actually makes
one big program file.

One issue on INSERT - all INSERTs happen at the very start of program
setup, during preprocessing, and way before micro-compilation and
execution, even before the data window gets loaded from standard
input.  This means that the only variable filenames you can INSERT
into your program are those that are defined via command line
arguments; you can't compute :filename: and then INSERT :*:filename:
in your program (the compiler would get very sick if you tried!).
But you _can_ SYSCALL if you really need this functionality.


----- Doing Math and EVAL -----

At some point, you may need to do math, or evaluate a mathematical
expression.  The EVAL statement does this.

EVAL is like ALTER, but instead of evaluating its arguments left to
right once, it repeatedly evaluates the arguments until they stop
changing (EVAL does do a little bit of smart cacheing so that it can
catch arguments that loop).  EVAL actually keeps a log of the hashes
of each intermediate state and checks this log on each pass of
expansion.  The default as of version 20040210 is 4096 states in the
statelog, and if your program tries to EVAL a string that keeps
changing for more than that number of passes, it's a nonfatal error.

EVAL also defaults to allowing extended var-expansion; in
extended var-expansion the string expansion operator :*: is
retained, but two new ones are added:

	  :#:var:        - returns the number of characters in var

	  :@:math_expr:  - evaluates math_expr and returns the numeric
			   result as a string.

The mathematical expression evaluator can work either in
algebraic notation (with left-to-right precedence, overridden only
by parenthesis), or in RPN notation (like an HP calculator).

If you use a relational mathematics operator like >, =, or <,
then EVAL itself will evaluate the truth status of that operator,
putting a 1 or 0 in for true or false, respectively.
After completing the mathematical evaluation and ALTERing the
result variable (if there is one), EVAL will then do one of
the following:

    - if no relational mathematical operator was used, execution
      continues with the next statement.

    - if a relational mathematical operator was used, and the
      relation result was TRUE, execution continues with the
      next statement.

    - if a relational mathematical operator was used, and the
       relation result was FALSE, then EVAL does a FAIL to
       the end of the bracket-block (and an ALIUS statement
       will see this as a FAIL).

Here's an example:

       #!/usr/bin/crm
       {
		window
		isolate (:z:)
		eval (:z:) / The length of 'foo' is :#:foo: letters /
		output /:*:z: \n/
		eval (:z:) / and (2 * 3) + (4 * 5) is :@: (2 * 3) + (4 * 5):/
		output /:*:z: \n/
	}

which gives you:

  The length of 'foo' is 3 letters
  and (2 * 3) + (4 * 5) is 26

which is as you would expect.

----- FAULT and TRAP -----

CRM114 programs can encounter errors during execution; an error can
often be "fixed up" and execution continued, or at least the program can
clean up and exit gracefully.

Whenever an error occurs, it creates a string that describes the
problem.  This string is normally printed out as the error message.
However, it can be used by the program itself to attempt to fix the
problem before the program itself fails.

The TRAP statement is how a program can catch an error before the
program fails.  The TRAP will "catch" almost any program error that
occurs (and all of these conditions are true):

     - inside the bracket-block that holds the TRAP statement,
     - occurs above the trap statement
     - and the error message describing the error is matched by the
       TRAP statement's regex.

If the TRAP statement's regex doesn't match the error message,
then the next TRAP outward will be activated, and the process repeats.

If no TRAP can handle the error, then your program will exit if the
error was fatal, or print out the error and continue if the error
was just a warning.

If you need to create your own "errors" during a program run, such
as if you find a file is missing or important data is not properly
formatted, you can force an error with the error message of your
choice with the FAULT statement.  The FAULT statement creates the
fault string you describe, which is still matched against the REGEX
in each enclosing TRAP.

If you have two TRAPs in series, the first TRAP gets first try at
matching the FAULT regex, then the second one.

Note that there is no "return from TRAP" - once a trap occurs,
the trap code must GOTO or otherwise properly resume execution
in an appropriate place.  The reason for this is that many TRAPs
really aren't "fixable" in the complete sense; the most that can
be done is to issue an error message and exit gracefully.

Additionally, there are some errors that simply aren't recoverable
in a TRAP.  For example, a fault that occurs during preprocessing
or inside the microcompiler can't be caught by a TRAP, because the
TRAP hasn't been compiled yet.  It's also possible to create a
FAULT situation where attempting to read the fault string itself
causes an error.  In this case, TRAP itself can't function and
the error just forces a sad error message and CRM114 will terminate
without grace or honor.


----- In Conclusion -----

This is the end of the Introduction to CRM114.  There are quite a
few statements and options in the QUICKREF that aren't discussed here
in this document.

Feel free to explore.  If you come up with a good introduction to the
use of a statement or technique, send it to me and I'll put
it here!


That's it.... a basic introduction to CRM114.  Have fun and don't
break anything.


-----  Appendix 1 - Useful Idioms -----

       A Few Useful Idioms:

* - LIAF-looping - Use the liaf (Loop Iterate Awaiting Failure to
iterate your way through the entire input window.  For example:
	...
	{
		match <fromnext> (:what_you_seek:) /a_regex/
		... # your code goes here
		liaf
	}
	...

will execute your code ONCE for each occurrence of the regex
in the input window.


* - null-WINDOWing: The WINDOW statement causes the data window to be
updated... _except_ the "nonsense" WINDOW statement that contains no
cut-to-here regex nor any fill-to-here regex, only when it's the first
executable statement of your program, tells the compiler to _skip_ all
data window input until you specify it later in the program with a
second WINDOW statement (or skip it entirely, if there is no second
WINDOW statement).  Example:

	#!/usr/bin/crm
	{
		window
		output /Hello, world! \n/
	}

doesn't read any input at all.  It just prints out "Hello, world!"


* - file-CATting: to get input from a file rather than from stdin.  The
easiest way to read in an entire file (of reasonable length) is to
"cat" the file into an isolated variable.  E.g.:

	...
	isolate (:my_data:)
	syscall () (:my_data) /cat < whatever_file_I_want.txt /

If the file is truly huge (larger than fits in an I/O buffer), you
can use the <keep> flag to get only as much as will conveniently fit,
e.g.:

	...
	isolate (:some_data: :my_proc:)
     :loop_here:
	syscall <keep> () (:some_data:) (:my_proc:) /cat /var/log/messages/
	#
	# do something useful here.
	#
	goto :loop_here:


If the result can take a long time to produce (say, because it's going
out over the network to a slow server), then the <async> flag reads only
what is available and returns with that, without waiting for an EOF.

	...
	isolate (:some_data: :my_proc:)
     :loop_here:
	syscall <async> () (:some_data:) (:my_proc:) / cat /var/log/messages /
	#
	# do something useful here.
	#
	goto :loop_here:



* - Processes that return more than 256K of text, possibly infinite
amounts...

Here's a way to cope with processes that return more than 256K of text (the
limit for dynamically allocated heap in some kernels is 256K, so that's why
this artificial limit exists).

This example does an ls -la on /dev, which is usually more than 256K long
(typically around 350K as of Linux kernel 2.4.18).  Note that "do the
work" here is to ACCEPT the contents of the data window; we could do
anything else we wanted instead.

	window
	isolate (:p:)
	{
		syscall <keep> () (:_dw:) (:p:)  /ls -la  \/dev /
		#
		# do the work here...
		{
			accept
		}
		match /.+/
		liaf
	}


The important bits of code here are the syscall to launch the process
(notice it's with the KEEP flag), and the subsequent MATCH /.+/ to check
for more output.  If there is more output, the MATCH passes and the LIAF
kicks us back to the start of the { } block.  If the match fails, the LIAF
is skipped and the program exits.  Cute, eh?

Note that this program will fail if the SYSCALLed program simply is
waiting for a slow network, etc.  Since there's no way to determine
whether a program that is just doing a long computation versus one
that is truly wedged (it's a nasty version of the halting problem,
proven by Alan Turing himself to be unsolvable), you'll have to use
some artifice to determine that on a case-specific basis.

Two good things you can try are:

    1) do a SYSCALL to ps(1) with the PID and examine the returned
       string;

    2) do a SYSCALL to sleep(1). for a few seconds and thereby do
       whatever timeout you desire.



* - ALIUS-nesting.  ALIUS checks to see if the most recently finished
bracket-block completed successfully or FAILed out- but ALIUS itself isn't
a FAIL.   So, you can nest ALIUSed conditionals, like this:

  A?
	A1
	or A2?
  B?
	B1
	or B2?


which would look like this:

  {
     {
	match /A/
	{
	   {
 	      match /A1/
	      ...
	   }
           alius
	   {
	      match /A2/
	      ...
	   }
        }
     }
     alius
     {
	match /B/
	{
	   {
 	      match /B1/
	      ...
	   }
           alius
	   {
	      match /B2/
	      ...
	   }
        }
     }
  }

Note how each ALIUS looks at the most recently exited bracket-block,
so nested IF statements don't get confusing (think about how you
would write this in C to see the contrast)


-----

Anyone else have any handy idioms they want to publish?


-----  Things I'd like help on ----

1) if anyone has strong bison-fu, and could give me a hand coming up with
a real parser (not the handcarved crock that's in the current microcompiler)
that would be great.

2) a few programs (like a spamkiller) would be nice... I have one but
it's tailored to *me* .  Suggestions, anyone?  (yes, there's one in
the distro now, read the README on it!  It's about 99.95 per cent
accurate as it stands, on my personal spam mix (for comparison,
SpamAssassin is only around 90% accurate).

	-Bill Yerazunis