File: plm

package info (click to toggle)
ucblogo 5.5-2
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 6,048 kB
  • ctags: 3,560
  • sloc: ansic: 15,086; lisp: 5,588; makefile: 175
file content (1745 lines) | stat: -rw-r--r-- 70,369 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
Understanding the UCBLogo evaluator

Prerequisite: understanding the metacircular evaluator (MCE, SICP 4.1)
Corequisite: understanding the explicit control evaluator (ECE, SICP 5.4)


Contents
--------
1. Review of metacircular evaluator
    1.1 Recursive structure of the evaluator
    1.2 Tail call elimination in the MCE
2. The Explicit-Control Evaluator
    2.1 Typical C-style procedure calling convention
    2.2 Procedure calling convention in the ECE
    2.3 The evaluator uses surprisingly few registers
    2.4 The two-screenful version of the explicit control evaluator
    2.5 Implementing explicit control in C
3. Complicating Issues in UCBLogo
    3.1 Commands vs. operations
    3.2 Error messages and tail call elimination
	3.2a Tailforms
	3.2b Tailforms handled as embedded calls
	3.2c Caller-status arguments to eval-sequence and eval-dispatch
	3.2d Detecting errors before they happen
    3.3 Macros
	3.3a Why the MCE uses special forms
	3.3b Using rewriting instead of special forms
	3.3c Logo continuations
    3.4 Dynamic scope
	3.4a Reasons for lexical scope
	3.4b Reasons for dynamic scope
	3.4c Shallow binding
    3.5 Nonlocal exit
    3.6 Lack of parentheses delimiting procedure calls
	3.6a Tokenization
	3.6b Monadic and dyadic minus
	3.6c More on runparsing
	3.6d Treeification
	3.6e Procedure bodies
	3.6f Changing a procedure's arity
4. Data structures and types
    4.1 Pair-based types
    4.2 Numeric types
    4.3 String types
	4.3a Internal structure of a string
	4.3b How to create a string node
	4.3c Including punctuation characters in words
    4.4 Symbols
    4.5 Quote and colon types
    4.6 Arrays
    4.7 Primitive procedures
5. Garbage collector
    5.1 Node allocation
    5.2 The mark phase
    5.3 GCTWA
    5.4 The sweep phase
    5.5 Per-node gc data


1. Review of metacircular evaluator
-----------------------------------

1.1 Recursive structure of the evaluator

In Logo, as in Scheme, evaluating an expression involves recursively
evaluating other expressions, for two reasons:

	1.  Visible subexpressions, as in (+ (* 3 4))

	2.  Expressions in the body of a procedure invoked by this expression

#1 gives rise to the mutual recursion

	eval -> list-of-values -> eval

while #2 gives rise to

	eval -> apply -> eval-sequence -> eval

As usual with recursive problems, the resulting program is remarkably short,
but it gives rise to complicated processes.

In what follows, we mostly focus on recursive path #2.


1.2 Tail call elimination in the MCE

Since recursion is the main built-in means for expression iterations
(iterative tools such as MAP are written using recursion; for the moment we
ignore Scheme's primitive DO and Logo's primitive REPEAT), it's important that
implementations be able to evaluate tail calls without creating extra frames
for each call.  The proper handling of tail calls in the MCE depends on proper
tail calling in the underlying Scheme, and also on the fact that the
evaluation of the last expression in a sequence is done through a tail call to
EVAL:

	(define (eval-sequence exps env)	; SICP p. 367
	  (cond ((last-exp? exps) (EVAL (FIRST-EXP EXPS) ENV))
		(else (eval (first-exp exps) env)
		      (eval-sequence (rest-exps exps) env))))

To emphasize the point, the evaluator would not handle tail calls properly if
eval-sequence were written this way:

	(define (eval-sequence exps env)	; wrong!
	  (let ((result (eval (first-exp exps) env)))
	    (if (null? (rest-exps exps))
		result
		(eval-sequence (rest-exps exps) env))))

This might seem more elegant because the call to EVAL appears only once in the
definition, but that call isn't a tail call.


2. The Explicit-Control Evaluator
----------------------------------

2.1 Typical C-style procedure calling convention

In typical machine-language programming (including compiled C programs) it
is the responsibility of called procedures to allocate space and save whatever
might otherwise be clobbered by a recursive call.  So the calling sequence (in
the calling procedure) looks like

	<put arguments in registers>
	<procedure-call instruction>

where the second line represents a single machine language instruction that
saves the return address in a register and jumps to the procedure.  The called
procedure looks like this:

	<allocate stack space>
	<save registers we'll use on the stack, including return address>
	... <do the actual work> ...
	<restore saved values>
	<deallocate stack space>
	<jump to return address in some register>


2.2 Procedure calling convention in the ECE

But this won't work if we want a tail-call to avoid allocating new stack
space.  It's the caller, not the callee, who knows that this is a tail call.
Therefore the ECE makes saving the caller's job:

	<allocate space>
	<save registers>
	<put arguments in registers>
	<call procedure>
	<restore saved registers>
	<deallocate space>

When the caller is making a tail call rather than an embedded call, it can
leave out the saving and restoring steps:

	<put arguments in registers>
	<jump to procedure, without saving a return address>

That last line works because the called procedure inherits its return address
from the caller.  In other words, it returns directly to the caller's caller.


2.3 The evaluator uses surprisingly few registers

	exp	  an expression to be evaluated (argument to EVAL)
	env	  the current environment (argument to EVAL and EVAL-SEQUENCE)
	val	  the value of the expression (returned from EVAL)
	continue  the current return address (set by procedure call instruction
		    in most computers, but assigned explicitly here)
	proc	  the procedure being invoked (argument to APPLY)
	argl	  the actual argument values (argument to APPLY)
	unev	  a list of expressions (argument to EVAL-SEQUENCE)

In addition, the SAVE and RESTORE operations use an implicit stack pointer
register, which is never explicitly saved or restored.


2.4 The two-screenful version of the explicit control evaluator

eval-dispatch:	[Corresponds to EVAL in MCE]
	exp is self-evaluating -> val=exp, return to caller
	exp is variable -> val=(lookup exp env), return to caller
	exp is special form -> each one is different, much like MCE
	exp is application:
		SAVE STUFF
		EXP=(CAR EXP)
		CALL EVAL-DISPATCH	;operator
		RESTORE STUFF
		proc=val
		argl='()
		unev=(cdr exp)		;actual argument expressions
		ev-appl-operand-loop:
			unev is empty -> goto apply-dispatch
			exp=(car unev)
			SAVE STUFF
			CALL EVAL-DISPATCH
			RESTORE STUFF
			argl=(cons val argl)
			unev=(cdr unev)
			goto ev-appl-operand-loop

apply-dispatch:	[Corresponds to APPLY in MCE]
	proc is primitive -> do magic, return to caller
compound-apply:
	unev=(procedure-parameters proc)
	env=(procedure-environment proc)
	env=(extend-environment unev argl env)
	unev=(procedure-body proc)
	goto ev-sequence

ev-sequence:	[Corresponds to EVAL-SEQUENCE in MCE]
	exp=(car unev)
	(cdr unev) is empty -> ev-sequence-last-exp
	SAVE STUFF
	CALL EVAL-DISPATCH
	RESTORE STUFF
	unev=(cdr unev)
	goto ev-sequence

ev-sequence-last-exp:
	GOTO EVAL-DISPATCH

The calls to EVAL-DISPATCH are capitalized in the above.  There are four of
them; the first three (operator, operand, body expression) are surrounded by
save and restore operations, but the fourth (final body expression) is just a
goto, without saving registers.  (SICP pp 556-557.)

The version above simplifies the actual ECE by leaving out an optimization in
evaluating the last actual argument expression in a procedure application.

It's important that special forms that evaluate subexpressions, such as IF,
are written to tail-call EVAL-DISPATCH for the last subexpression, so that
	(if (= 2 3) (this-isnt-evaluated) (this-is-tail-called))
works correctly.


2.5 Implementing explicit control in C

The C language provides labels and goto, but the labels are not first-class;
you can't put a label (i.e., a memory address in your code) into a register
or onto the stack.

UCBLogo gets around this restriction by using a kind of pun:  C has separate
namespaces for labels and members of enums -- the same symbol can be used for
both.  So in logo.h there is a declaration of an "enum labels" whose members
have the same names as labels in the evaluator (e.g., begin_seq).  The
newcont macro that's used in the evaluator to store a new continuation (in
the ECE sense) on the stack actually stores this enum (i.e., a small
nonnegative integer).  This stored value is translated into an actual label
at fetch_cont, which says

	switch (x) {
	    begin_seq: goto begin_seq;
	    accumulate_arg: goto accumulate_arg;
	    ...
	}

(The actual order of labels is determined in logo.h; I've tried to put the
more commonly used ones first.)


3. Complicating Issues in UCBLogo
---------------------------------

The SICP evaluators leave out one important complexity of a full Scheme
implementation (the nonlocal transfer of control allowed by CALL/CC, which is
somewhat like the issue raised in Logo by CATCH and THROW).  Also, Logo is
more complicated to interpret than Scheme, for these reasons:

	Commands vs. operations (not every procedure returns a value)

	The desire for error messages that hide tail call elimination

	Macros

	Dynamic scope

	Lack of parentheses delimiting procedure calls

These topics are discussed in the following sections.


3.1 Commands vs. operations

The UCBLogo interpreter contains an object named UNBOUND that can be used as
the return value from a Logo procedure to indicate that, above the abstraction
barrier, there is no return value.

Consider the following Logo procedure:

	to bfn :n :lst
	if :n=0 [output :lst]
	output bfn :n-1 butfirst :lst
	end

The call to BFN in the last line of the body should be a tail call.  This
means that we have to treat the word OUTPUT specially in the interpreter.  To
the user, OUTPUT is a primitive procedure that takes one input.  But we can't
evaluate the input expression and then call OUTPUT, because that would make
the call to BFN an embedded recursion.  (In other words, the call to
EVAL-DISPATCH in 2.4 above that evaluates argument expressions saves and
restores registers.)

Instead, OUTPUT must be treated internally as a special form.  When OUTPUT is
seen, even if there are more unevaluated expressions in the body, its input is
evaluated with a tail call (a goto) to EVAL-DISPATCH.

Similarly, STOP must be treated specially.  EVAL-SEQUENCE looks to see if the
expression *after* the current one is the word STOP; if so, the current
expression is tail-called.  We must catch STOP early to handle things like

	to foo
	IF condition [tail-call-this STOP]
	...more expressions...
	end

Under some circumstances the STOP is seen as the current expression, rather
than as the expression after the current one; in this case, EVAL-SEQUENCE just
returns to its caller.  This situation is exemplified by

	to foo
	IF condition [STOP]
	...more expressions...
	end

In eval.c, OUTPUT and STOP are referred to as "tailforms"; there is a
procedure is_tailform(exp) that returns nonzero (true) if the expression is an
OUTPUT or STOP expression.

Note on terminology:  In above-the-line documentation we distinguish between
*expressions*, which return values, and *instructions*, which don't.  But
internally they're all expressions.

A third tailform is .MAYBEOUTPUT; this tail-calls its input expression, like
OUTPUT; the only difference is that it's not considered an error if the input
expression returns UNBOUND instead of a value.  This brings us to the next
topic:


3.2 Error messages and tail call elimination

First of all, in order to give error messages at all, the interpreter must
maintain some extra information, mainly the names of procedures, in additional
registers.  The main ones are:

	fun		the name of the current procedure
	ufun		the name of the current user-defined procedure
	this_line	the line in ufun's body being evaluated

FUN and UFUN will be unequal if we're evaluating a call to a primitive
procedure.

Logo gives error messages if a return value is expected and not provided
(X DIDN'T OUTPUT TO Y) or vice versa (YOU DON'T SAY WHAT TO DO WITH Z).

These error messages would be straightforward were it not for tail call
elimination.  The obvious place to check is right after a call to
EVAL-DISPATCH; the caller knows whether it wants a value or not, so it
compares the value returned with UNBOUND, giving an error message when
appropriate.  The call to EVAL-DISPATCH for evaluating an argument would
require a return value other than UNBOUND, whereas the one in EVAL-SEQUENCE
for expressions other than the last would require UNBOUND.


3.2a Tailforms

One complicating factor is tailforms.  Consider this Logo program:

	to foo
	output print 3
	end

	to baz
	show foo
	end

When we call BAZ, we should get the error message

	print didn't output to output  in foo

But we don't call PRINT and then call OUTPUT; we just tail-call PRINT directly
from FOO.  So if we're not careful we'll get an incorrect message such as

	print didn't output to foo
	print didn't output to show
	foo didn't output to show

On the other hand, consider this program:

	to foo
	print 3
	end

	to baz
	show foo
	end

This time, when we call BAZ, the message should be

	foo didn't output to show  in baz

and not anything starting "print didn't output..."

To avoid these incorrect error messages, it's not enough to remember "the
current procedure".  We maintain additional information specifically for the
DIDN'T OUTPUT message:

	didnt_get_output	the context (fun, ufun, line) wanting output
	didnt_output_name 	the procedure that should be blamed if no o/p

These should be thought of as arguments to EVAL-DISPATCH.  There are three
special values of didnt_get_output:

	UNBOUND		we don't want an output
	NIL		either output or no output is okay (from .MAYBEOUTPUT)
	(NIL . NIL)	this is a macro, and must output, but not for anyone
			in particular


3.2b Tailforms handled as embedded calls

Sometimes it's not good enough to handle STOP and OUTPUT as tail calls.
Consider the Logo instruction

	repeat 5 [... if :x<0 [output 3] ...]

This is legal inside a procedure body; the OUTPUT outputs from the enclosing
user-defined procedure.  But the evaluation of the expression that is the
second input to REPEAT isn't a tail evaluation, since after it's done the
repeat count must be updated and tested.  So when :X is negative we can't just
tail-eval the 3; that would return 3 as the value of the entire repeated
expression sequence.  This OUTPUT is a nonlocal exit, comparable to a THROW.
So in this case we actually do call a primitive procedure named OUTPUT.
(Never mind for now what OUTPUT does in this case; we'll get back to nonlocal
exit later.)


3.2c Caller-status arguments to eval-sequence and eval-dispatch

To make all this work, we need an extra argument to EVAL-SEQUENCE and an
extra argument to EVAL-DISPATCH; these must be saved and restored with the
other evaluator registers.  The extra argument to EVAL-DISPATCH is
didnt_get_output, mentioned earlier.  The extra argument to EVAL-SEQUENCE is
named val_status and is an integer containing some combination of the
following flag bits:

#define VALUE_OK	 1	/* [instr instr instr exp] */
#define NO_VALUE_OK	 2	/* [instr instr instr instr] */
#define OUTPUT_OK	 4	/* [instr instr OUTPUT exp ...] */
#define STOP_OK		 8	/* [instr instr STOP ...] */
#define OUTPUT_TAIL	16	/* not [repeat n [... output ...]] */
#define STOP_TAIL	32	/* not [repeat n [... stop ...]] */

Here are all the ways in which EVAL-SEQUENCE is called, with the corresponding
values of val_status:

entry to evaluator (from Logo prompt)	NO_VALUE_OK

body of procedure when output wanted	OUTPUT_OK | OUTPUT_TAIL

body of procedure when no o/p wanted	NO_VALUE_OK | STOP_OK | STOP_TAIL

body of procedure for .MAYBEOUTPUT	NO_VALUE_OK | STOP_OK | STOP_TAIL |
					    OUTPUT_OK | OUTPUT_TAIL

RUNRESULT [sequence]			VALUE_OK | NO_VALUE_OK |
					    OUTPUT_OK[**] | STOP_OK[**]

REPEAT n [sequence]			NO_VALUE_OK | 
					    OUTPUT_OK[*] | STOP_OK[*]

APPLY [sequence] args, output wanted	VALUE_OK

APPLY [sequence] args, no o/p wanted 	NO_VALUE_OK |
					    OUTPUT_OK[*] | STOP_OK[*]

[*] means that these flags are on only if they were on for the enclosing
expression sequence.

[**] means that OUTPUT and STOP are actually not okay inside a RUNRESULT,
but the RUNRESULT handler lets STOP or OUTPUT be called and then detects
and reports the error afterwards.

You may wonder why RUNRESULT and REPEAT are special cases here, but not RUN,
IF, and IFELSE.  We'll discuss that shortly.


3.2d Detecting errors before they happen

Sometimes it's not good enough to detect an error after evaluation.
These situations are obscure.  For example:

	to blorp :x
	repeat 5 [pr :x  if :x<0 [op 3]  make "x :x-1]
	end

	? blorp 2
	2
	1
	0
	-1
	You don't say what to do with 3

The OP in BLORP would be legal if the top-level instruction had been
PRINT BLORP 2.  When we see the OP, we know that no output is wanted,
but we have to evaluate the input expression (3) anyway, just in case
it prints something or has some other side effect.  Because the OP is
nested inside a REPEAT, it turns out, we can't just let the error be
detected downstream; we have to flag the error without letting the REPEAT
handler continue.  So we non-tail-evaluate the 3, then after the evaluation
returns, we flag the error.  (Look at label op_want_stop in eval.c to see
how this works.)  This code is used for all cases of OUTPUT when STOP was
wanted, but it's only really necessary in the REPEAT case.


3.3 Macros

In the SICP evaluators, IF is a special form.  This means that the evaluation
of the consequent or alternative of an IF expression happens through a string
of tail calls within the evaluator:

	eval -> eval-if -> eval (of consequent or alternative)

Thus, if the entire IF expression is in tail position, the consequent or
alternative is also tail-evaluated.  (Of course the predicate argument to IF
can't be tail-evaluated, because the job of IF isn't finished when that
argument has been evaluated; the result must still be tested for true or
false, and either the consequent or the alternative must be evaluated.)

SICP handles COND differently; instead of directly tail-evaluating the actions
of the successful COND clause, the entire expression is rewritten as an IF
expression, which is then evaluated:

	(define (eval exp env)		; SICP page 365
	  (cond ...
		((cond? exp) (EVAL (COND->IF EXP) ENV))
		...))

COND->IF is a rewriting procedure that returns a new expression, which is used
as argument to another (tail-recursive) call to EVAL.

They could have handled IF as a sort of rewriting, also, except that the
rewriting procedure would need access to the environment:

	(define (if->exp exp env)	; compare with EVAL-IF, SICP p. 367
	  (if (true? (eval (if-predicate exp) env))
	      (if-consequent exp)
	      (if-alternative exp)))

The general name for such a rewriting procedure is a macro.


3.3a Why the MCE uses special forms

Special forms in the MCE have two purposes.  First, they delay (IF) or
prevent (QUOTE) evaluation of subexpressions; second, they have access to the
caller's environment (DEFINE, SET!).  Neither of these issues comes up in
quite the same way in a Logo interpreter.

Delayed evaluation in Logo is done by quoting (including " and [] notations);
this is why we say

	ifelse :x < 0 [print "negative] [print "positive]

with brackets around the PRINT instructions, but not around the :X < 0
expression.  [In Scheme, parentheses would be used for all three
subexpressions:

	(if (< x 0) (display 'negative) (display 'positive))

and it's the status of IF as a special form that prevents the early evaluation
of the last two subexpressions.]

Access to the caller's environment isn't a problem for Logo primitives because
of dynamic scope; the evaluator's ENV variable, which is a global variable in
the C program, contains the correct environment when any primitive is used.
So Logo's equivalent to DEFINE and SET!, namely MAKE, is an ordinary
primitive.

As a result, Logo has only two special forms: TO, because of its strange
notation, and .MAYBEOUTPUT, because its "input" expression need not return a
value.  (This is the above-the-line view.)  In principle there's nothing
special about RUN, IF, IFELSE, REPEAT, or RUNRESULT.

This simple view of the world doesn't quite hold below the line, though,
because of tail call elimination.  We want

	to foo
	if ... [output tail-call-this]
	...
	end

to be a tail call, so we can't recursively call the evaluator from inside the
IF primitive.


3.3b Using rewriting instead of special forms

Instead we make IF a macro -- that is, a rewriting procedure.  If the
condition is satisfied, it outputs its second input; if not, it outputs the
empty list.  (IFELSE outputs either its second or its third input; RUN just
outputs its input.)  The evaluator recognizes (in APPLY-DISPATCH) that a macro
is being invoked; instead of just returning what the procedure returns, it
splices the return value into the sequence of unevaluated expressions.
What "splices" means, more precisely, is

	unev=(append val unev)

[If you're reading carefully you'll notice that I'm having APPLY do something
to a sequence of expressions, whereas such a sequence exists only in
EVAL-SEQUENCE.  Actually APPLY pushes on the control stack a return address
of macro_return (so that every call to a macro is a non-tail call, by the
way); when we get there, the C instruction

	stopping_flag = MACRO_RETURN;

is executed.  Back in EVAL-SEQUENCE, this flag is noted, and that's where the
splicing is done.  A further complication is that if the macro was in fact
called from tail position, we won't return to EVAL-SEQUENCE, which tail-called
EVAL-DISPATCH, so instead macro_return has to go to begin_seq (which goes to
eval_sequence) itself.  If the macro is called when an argument was expected,
it must return exactly one expression; the MACRO_RETURN flag is caught at
accumulate_arg, and the returned expression is sent to eval_dispatch.]

Why not just treat RUN, IF, and IFELSE as special forms internally (that is,
handle them as special cases within EVAL-DISPATCH) even though we tell users
they're ordinary procedures?  Two reasons:

	1.  The UCBLogo approach allows these to be separate C procedures,
	    outside of eval.c, instead of making the core evaluator know
	    about them.

	2.  Once we have the macro capability, we can fairly easily make it
	    available to users, through the .MACRO primitive.

(Note: As of R5RS, Scheme has a standard rewriting capability available to
users, more complicated than ours, but with the same purposes.)


3.3c Logo continuations

This is the entire story for RUN, IF, and IFELSE.  As we've seen before,
REPEAT and RUNRESULT are more complicated, because they both have more work to
do after their expression-sequence arguments are evaluated.  REPEAT must count
down its repetition count and test for zero.  RUNRESULT modifies the result of
the evaluation, returning [] instead of UNBOUND and [FOO] instead of FOO.

REPEAT and RUNRESULT are still considered macros, but they don't return
rewritten expressions.  Instead they return a special internal data type,
called a "continuation."  (The name is inspired by Scheme's continuations, and
it's a reasonably appropriate name because it does embody "what to do next,"
but it's not an exact analog of Scheme continuations.)  A continuation is a
pair whose car is a (representation of a) label within the evaluator, and
whose cdr is whatever data the code at that label expects (this generally
means the inputs to the macro).  The code at macro_return recognizes that the
return value was a continuation; it sets VAL to the cdr of the continuation
and jumps to the label specified in the car.

Since most of the implementation of these Logo primitives is buried inside the
core evaluator, they are similar to the special forms of the SICP evaluator.
The difference is that they are recognized through the same table lookup used
for ordinary procedures, rather than by special tests within EVAL-DISPATCH.

GOTO and CATCH are also macros with continuations.  GOTO doesn't take an
expression sequence as an input; the sense in which it's a rewriting procedure
is that it "rewrites" its input into the expression sequence consisting of the
part of the current user procedure's body starting at the corresponding TAG
and continuing until the end.  In this case the resulting expression sequence
replaces the previous value of UNEV instead of being spliced into it.

We'll discuss CATCH shortly, under the heading of nonlocal exit, but we're not
quite ready for that.


3.4 Dynamic scope

It's rather a shame, I think, that the second edition of SICP no longer even
mentions dynamic scope, so I have to explain what it means before discussing
how it's implemented.  See also _Computer Science Logo Style_ vol 1 page 49
on dynamic scope, and CSLS vol 3 page 183 on lexical scope.

Within a procedure body, how do we handle references to variables that are not
local to that procedure itself?

In C, any variable reference within a procedure must be either to a variable
local to that procedure or to a global variable; one procedure can never refer
to another procedure's local variables.

In Scheme, one procedure can be defined inside the body of another; the inner
procedure can refer to local variables belonging to the outer one.  This is
called *lexical scope*.  [C is lexically scoped, too, but trivially so, since
there is no nesting of procedure definitions.]

In Logo, when one procedure *invokes* another, the invoked procedure can refer
to local variables belonging to the caller.  This is called *dynamic* scope.

The following would print DYNAMIC in a dynamically-scoped Scheme, but prints
LEXICAL in (the actual) lexically-scoped Scheme:

	(define (a x)
	  (define (b x) (c))
	  (define (c) x)
	  (b 'dynamic))

	(a 'lexical)

Focus on the reference to X in the body of procedure C.  Since C was invoked
by B, its *dynamic environment* is

	C -> B -> A -> global

whereas its *lexical environment* is

	C -> A -> global

because C was defined inside A, not inside B.


3.4a Reasons for lexical scope

Why does almost everyone prefer lexical scope?  (I prefer it, too, under
many circumstances, by the way.)  Three reasons:

	1.  A complier can produce faster compiled code for a lexically scoped
	    language, because the compiler knows which variable is meant by a
	    variable reference without actually running the program.  The
	    rules refer only to the physical position of the procedure's
	    definition within the program, not to who calls whom.  In a
	    dynamically scoped language, variable references can't be resolved
	    until the program is actually running; the same reference might
	    mean two different variables on two calls to the same procedure.

	2.  Lexical scope avoids "name capture" bugs, in which the programmer
	    accidentally uses the same name for two different purposes, so
	    that a procedure thinks it's getting one variable X when it really
	    ends up getting a different variable X because the name was reused
	    in a procedure that (directly or indirectly) invokes it.  This is
	    the reason most often cited.  For the most part I think people who
	    name variables X deserve what they get, but name capture does come
	    up as a problem when writing higher-order functions in Logo, if
	    the names of inputs to the higher-order function are common words
	    that might appear in a template.  See CSLS vol 2 page 204.

	3.  In Scheme, in which lexical scope is combined with first-class
	    procedures (i.e., LAMBDA), that combination allows for persistent
	    local state variables, and therefore for object-oriented
	    programming, without any OOP-specific language syntax.  See
	    SICP 3.1 and 3.2 for details.


3.4b Reasons for dynamic scope

Why, then, does Logo use dynamic scope?  Four reasons:

	1.  It allows for a simple notation for higher-order functions, using
	    first-class expressions rather than first-class procedures.  We
	    can say

		to add.suffix :suffix :words
		output map [word ? :suffix] :words
		end

	    The expression template [WORD ? :SUFFIX] is evaluated inside MAP,
	    not inside ADD.SUFFIX, and yet it has access to the latter's local
	    variable SUFFIX.  In Scheme we'd have to use LAMBDA, a more
	    intimidating notation.

	2.  Some of us think that dynamic scope is just easier for kids and/or
	    novice programmers to think about.  If a variable exists, it's
	    available for use, unless shadowed by a more recently created
	    variable of the same name.  You don't have to think about scope
	    at all, really.

	3.  Debugging support is more natural.  You can tell Logo to PAUSE
	    whenever there's an error.  You then have a Logo prompt, to which
	    you type ordinary Logo instructions, but all the relevant
	    variables are available for inspection.  This is important because
	    the actual error in your program might not be in the procedure
	    that died, but rather in the caller of the caller of the caller of
	    that procedure.  With dynamic scope, the variables of the
	    erroneous procedure are visible.  You don't have to learn special
	    debugging commands that mean "switch to a different environment."

	4.  It allows for a programming style using "semi-global" variables
	    that are local to a top-level procedure.  For example, a program
	    to draw a complicated picture might have a SIZE input, which is
	    then used by its subprocedures, sub-subprocedures, etc.


3.4c Shallow binding

There's an efficiency problem with dynamic scope and recursive procedures.
Consider the following modification of a common example:

	make "one 1

	to fact :n
	if :n = 0 [output :one]
	output :n * fact :n-1
	end

Suppose we ask for FACT 5.  This gives rise to an (embedded) recursive call
for FACT 4, which calls FACT 3, which calls FACT 2, which calls FACT 1, which
calls FACT 0, which says OUTPUT :ONE.

We have to look up the variable name ONE.

Is it local to the FACT 0 environment?  No.  So we look in the dynamically
enclosing environment, for the FACT 1 call.  Is it there?  No, so we look in
the FACT 2 environment, the FACT 3 environment, the FACT 4 environment, the
FACT 5 environment, and finally the global environment, which is where we find
it.  This is quite time-consuming, and would be even more so if we wanted the
factorial of 100.

Here's the solution.  Instead of putting new local bindings in a newly created
environment frame, and linking that to an enclosing environment, as in SICP
3.2 and 4.1.3, we instead keep the currently active bindings in a single
global table.  (Since this table is used for every variable reference, we work
hard at making it efficient, namely by using a hash table rather than the
linear lists of SICP environments.)  We still create new frames for procedure
calls, but what we put in the new frames are *saved* bindings, the ones we
replaced in the global table for our new local variables.  These frames are
used only when the called procedure returns; we restore the previous state of
the global table from the saved values.  This is called *shallow binding*.

The implication of this technique is that every variable reference, local or
global, takes constant time.  The cost is that returning from a procedure call
takes time proportional to the number of local variables in the procedure.

Since all symbols are looked up in the global symbol table, there is no ENV
register in the Logo evaluator.  Instead there are two registers pointing to
a stack of saved values.  var_stack points to the head of the stack, where
new entries are added; var points to the beginning of the current frame, so
that bindings between var and var_stack are the ones that should be restored
when an embedded call to eval_dispatch returns.


3.5 Nonlocal exit

Despite the proper handling of tail calls, a typical program still gives rise
to several nested calls to EVAL-DISPATCH.  Ordinarily these are "unwound" one
by one, as (Logo) procedures return values.  But sometimes we want to unwind
several such recursive evaluations at once.  The prototypical case is a call
to THROW that appears several recursive evaluations below the corresponding
CATCH.

Were it not for shallow binding, we could in fact leapfrog over several nested
evaluations at once, just throwing out their local information, sort of like
setjmp/longjmp in C.  But instead we must restore all of the saved variable
bindings, in the correct order, so we have to let each of the nested EVAL
calls return to its caller.

But we don't want those intermediate levels to keep computing; the whole point
of THROW is to avoid completing some partial evaluations of expression
sequences.  We need some way to tell EVAL-SEQUENCE to return right away,
without evaluating more expressions.

This is done using another evaluator register, called stopping_flag, which
should be viewed as part of the return value from EVAL.  It has the following
possible values:

	RUN		program is running
	STOP		non-tail STOP was seen
	OUTPUT		non-tail OUTPUT was seen
	THROWING	THROW was seen
	MACRO_RETURN	a macro is returning a rewritten expression (see 3.3b)

Logo.h defines shortcuts to test some of these states:

#define NOT_THROWING            (stopping_flag != THROWING)
#define RUNNING                 (stopping_flag == RUN)
#define STOPPING                (stopping_flag == STOP)

It's NOT_THROWING because that turns out to be the most commonly used test.

The very first thing EVAL-SEQUENCE does is

	if (!RUNNING) goto fetch_cont;

so that if we've seen a THROW (or a non-tail STOP or OUTPUT, as explained
in 3.2c) we just return without further evaluation of this sequence.

The continuation for CATCH calls EVAL-SEQUENCE for its second argument, then
if stopping_flag is THROWING and throw_node, a global variable, is equal to
its first argument, it sets stopping_flag back to RUN.  (THROW can take an
optional second input, which is a value for the corresponding CATCH to return.
This works by setting output_node, another global variable, to the thrown
value.  Non-tail OUTPUT also sets output_node.)

The variables throw_node and output_node don't have to be saved and restored
around embedded EVAL calls, because only one THROW or OUTPUT can be active at
a time.

If CATCH catches THROW, who catches non-tail STOP or OUTPUT?  EVAL-SEQUENCE
does, after restoring registers and before looping for the next expression.
And so does REPEAT, so that it won't try repeating its input any more.


3.6 Lack of parentheses delimiting procedure calls

The SICP evaluators see an expression sequence as a list in which each element
is one complete expression, because every non-atomic Scheme expression is
enclosed in parentheses.  So, for example, how many arguments are in this
procedure call?  One less than the number of elements in the list.  Easy.

Logo has a harder time, because expressions need not be parenthesized, and
also because of infix arithmetic notation.  The Logo evaluator sees a
procedure name, realizes that this is the beginning of a procedure call, and
must then look up the procedure in order to know how many input subexpressions
to evaluate.  Furthermore, some procedures take a variable number of inputs,
so the grouping may depend on the use of optional parentheses.

This grouping of tokens into subexpressions is slow, and it's wasteful to do
it repeatedly for expression sequences in the body of a procedure.  So UCBLogo
does it once, the first time the procedure is called, and saves the result.
In effect each Logo procedure is translated into a Scheme procedure, as if it
had been typed in with every subexpression parenthesized and no use of infix.


3.6a Tokenization

But I've skipped a step.  Even the division of the characters on a line into
tokens is problematic.  Here's the classic problem:  Should the hyphen
character automatically be a token by itself?  We'd like

	print 5-2

to work, and so we'd like the 5-2 to be understood as three tokens: 5, -, 2.
On the other hand, we'd like

	print first [555-1212 642-8311 868-9827]

to print 555-1212, not 555.

There are three ways to resolve this dilemma:

	"Old LCSI" style: the second example prints 555.  This is what Apple
	Logo, IBM Logo, and other early LCSI products do.

	"New LCSI" style: the first example gives the error

		I don't know how to 5-2

	This is what Logowriter and later LCSI products do.

	"Terrapin" style: both examples work as desired.

I apologize to non-Americans for naming these after the main USA Logo vendors,
but that's the history I know best.

UCBLogo uses Terrapin-style tokenization, which I think is clearly the best
choice.  To make it work, there are two sets of tokenization rules, which I
call PARSE and RUNPARSE.  These are in fact the names of UCBLogo primitives:

	? show parse "5-2
	[5-2]
	? show runparse "5-2
	[5 - 2]

When a line is read (from keyboard or file), it is parsed.  When and if the
line is evaluated, it's runparsed -- but even then, text within square
brackets is not runparsed.  (Such text will be runparsed if it is ever used as
input to RUN and friends.)

The precise rules are given in more detail in the "Tokenization" section of
the UCBLogo User Manual.


3.6b Monadic and dyadic minus

The minus sign (-) is particularly thorny for tokenization because it has
three different meanings:

(1)  If it begins a token, and the following character is a digit, then it is
part of a negative number token.

(2)  If it follows an open parenthesis or bracket, or if it follows a space
and is not followed by a space, then it is the monadic (one-operand) negation
operator.

(3)  Otherwise, it is the dyadic subtraction operator.

The part about spaces in (2) above comes from the problem that function
argument expressions are separated by spaces, not commas as in most
programming languages.  This, along with infix arithmetic, gives rise to
an ambiguity.  Does

	foo a - b

represent a call to a one-argument function with A minus B as the argument,
or a call to a two-argument function with A as the first argument and
negative B as the second?  The UCBLogo rule is this:

	foo a-b		; dyadic subtract
	foo a - b	; dyadic subtract
	foo a -b	; monadic negate

Once runparsing is done, the spacing information is lost.  So when a monadic
negate is seen during runparsing, it is replaced by the two tokens "0" and
"--":

	? show runparse [a -b]
	[a 0 -- b]

The double minus sign is an infix operator that represents subtraction but
has the highest infix binding strength, so that

	a * 0 -- b

means a*(0-b), not (a*0)-b.  It's an undocumented feature that users can
type |--| to get this tightly-binding subtraction operator; if they type "--"
without vertical bars, they get two separate "-" tokens.


3.6c More on runparsing

Once a list has been runparsed, we don't want to have to do it again, so we
want to replace the old list value with the new one.  But we can't quite do
that, because the non-runparsed version might still be needed:

	to foo :x
	repeat 4 :x
	show :x
	end

	? foo [print 5-2]
	3
	3
	3
	3
	[print 5-2]

We wouldn't want the last line to look like

	[print 5 - 2]

but we also don't want to runparse the list four times.

The solution is somewhat of a kludge.  In UCBLogo every "pair" (as created by
CONS) actually contains three components, not two: the car, the cdr, and the
"object."  At first the list [print 5-2] looks like this:

	Type:	PAIR
	car:	print
	cdr:	[5-2]	; of course this is itself a PAIR with car 5-2
	object:	NIL

After runparsing, it looks like this:

	Type:	RUNPARSE
	car:	print
	cdr:	[5-2]
	object:	[print 5 - 2]

The car and cdr of this runparsed list (the parts seen by ordinary procedures
such as SHOW) are unchanged, but the object is another list, the runparsed
version.  If we ask to runparse this list again, the runparse() procedure will
notice that its argument is of type RUNPARSE and just return it unchanged.

A defect of this approach is that if for some reason we want to runparse
the cdr of the list, we have to do it from scratch, but that rarely happens.

(Another defect, since all Logo nodes are the same size, is that even ordinary
pairs have to have room for an object field, even though most of these fields
are unused.  This is a big, although constant factor, waste of memory.)


3.6d Treeification

Now we're ready to return to the problem posed at the beginning of 3.6:  The
form in which we *really* want this instruction list is

	[print [- 5 2]]

That is, we want it fully parenthesized, with prefix operators.  (I'm using
the word "parenthesized" because that's traditional, but of course what we
really want has neither parentheses nor square brackets, but rather list
structure with sublists.)

Note, by the way, that the Logo instruction

	print [- 5 2]

doesn't subtract 2 from 5!  This instruction, in treeified form, looks like

	[print "[- 5 2]]

except that the quotation mark isn't really there, either; it's a node of type
QUOTED.  The important point is that unlike runparsing, treeification gives
rise to something that is *not* in Logo surface syntax, so we don't show it to
users; there is no TREEIFY primitive.

Treeification is implemented similarly to runparsing; the result is this node:

	type:	TREE
	car:	print
	cdr:	[5-2]
	object:	[print [- 5 2]]


3.6e Procedure bodies

Since a line can contain more than one instruction, and since an instruction
can be continued onto more than one line, why bother having the concept of
lines at all?  Why not, like Scheme, treat newlines the same as spaces?

That's what Microworlds does, so it's clearly possible.

The only reason to maintain the concept of lines as important is for the sake
of the user interface.  When an error occurs inside a procedure body, the
UCBLogo error message includes the line on which the error happened.  Also,
the STEP command causes a procedure to be run one line at a time.

So we might represent a procedure body as a list of lines, each of which is a
list of (treeified, eventually) expressions.  But then we couldn't use a
procedure body as argument to EVAL-SEQUENCE, which expects a list of
expressions, not a list of lists of expressions.

So a procedure body is a list of expressions, but some of the pairs making up
the spine of the list are of type LINE, namely, the first such pair of each
line of the body.  In these pairs, the object field points to the text of the
line (parsed, but not runparsed or treeified).  When EVAL-SEQUENCE notices a
LINE-type pair, it sets the evaluator register this_line to the pair's object.


3.6f Changing a procedure's arity

The *arity* of a procedure is the number of inputs it takes.  In UCBLogo, the
procedure's arity is actually three numbers: the minimum, default, and maximum
number of inputs.

In this section we focus on the default number.

Consider the following example:

	to a
	print (sentence b 1 2 3 4 5)
	end

	to b :x :y
	output :x + :y
	end

When we call A, it should print

	3 3 4 5

The first time we call A, its one body line is treeified, with this result:

	[print [sentence [b 1 2] 3 4 5]]

Now suppose we redefine B:

	to b :x :y :z
	output :x + :y
	end

We have change B's arity, and so the treeification of A is now incorrect!
It should be re-treeified to

	[print [sentence [b 1 2 3] 4 5]]

But if we check the arity of every procedure every time we call A, we've
lost the efficiency advantage we gained by treeification.

The perfect solution would be to maintain a data structure for each procedure
indicating which procedures depend on its arity.  Then, if the procedure is
redefined in a way that changes the arity, we could re-treeify exactly those
procedures that depend on it.  But this would be a lot of trouble; UCBLogo,
like several other Logo implementations, does something simpler.

The principle is that it's pretty rare to redefine a procedure in a way that
changes its arity, and so it's okay if doing that slows things down a lot.
What's important is to keep the common case -- no redefinition -- fast.

The global variable the_generation contains a pair.  Its contents are
unimportant; it's created with cons(NIL, NIL).  What's important is that it's
distinct from any other pair.  Whenever an existing procedure is redefined in 
a way that changes its arity, a new pair replaces the_generation.

Every LINE node in a procedure body contains a pointer to the pair that was in
the_generation at the time this line was treeified.  Whenever this line is
run, its generation pair is compared with the_generation.  If they're not
equal, it means that some procedure has changed arity since the line was
treeified, and so the line is re-treeified.

This means that every line of every procedure must be reparsed after an arity
change!  This produces a noticeable slowdown.  But it doesn't happen often.

(Defining a new procedure doesn't require changing the_generation, because a
procedure can't be called until all its subprocedures are defined.  Remember
that the body is treeified when the procedure is first called, not when it's
defined.  Logo won't save the treeification of a procedure if it has undefined
subprocedures.)


4. Data structures and types
----------------------------

The basic unit of storage in Logo is the node.  Every user-visible data type
(word, number, list, array) and many internal types (runparse, tree,
primitive, line, etc.) are stored in nodes.  A node contains the following
fields:

	type flags			node_type
	info for garbage collector	my_gen, gen_age, mark_gc, next,
					oldyoung_next
	other fields by type

The type field is a collection of bits, not just an enum, so that tests can
easily be made for categories.  For example, many types are special-purpose
pairs, so they all have the NT_LIST bit set.  Numbers, strings, quoted
strings, etc., all have the NT_WORD bit set.  The bit NT_AGGR stands for
"aggregate," which includes lists and arrays.  The types are defined in
logo.h.


4.1 Pair-based types

The ordinary pair is of type CONS.  The other pair types are RUN_PARSE,
TREE, LINE, and CONT (continuation), all of which were discussed in part 3.

The empty list is represented by a zero pointer, but nodetype(0) returns
the type PNIL.

The constructor for pairs is cons(x,y).  The selectors are car, cdr, caar,
cadr, cdar, and cddr.  (Combined forms more than two deep are not provided
in logo.h.)  Ordinary pairs don't use the object field; special pairs are
generally first made with cons and then mutated with setobject(pair,val).


4.2 Numeric types

Numbers are a kind of word (because you can examine their digits with FIRST
etc.), so the numeric type codes include the NT_WORD bit.

The numeric types are INT (32-bit integer) and FLOATT (sic, because FLOAT is
defined in some C header file or other; 64-bit floating point).

Someday I'll get around to bignums (arbitrary-precision integers).

The constructors are make_intnode and make_floatnode; the selectors (to get
back the underlying numeric data) are getint and getfloat.

Note that a string containing all digits counts as a number above the line!
Such strings of digits are created by operations such as

	butfirst "x123

There is a procedure cnv_node_to_numnode that takes a node as its argument
and, if possible, returns an equivalent INT or FLOATT node.  (The argument
may already be a numeric type, in which case it is returned unchanged.)  If
the argument can't be construed as a number, the procedure flags a BAD_DATA
error, so after calling it, you should check if (NOT_THROWING) {...} for
whatever computation uses the number.


4.3 String types

There are three string types: STRING, BACKSLASH_STRING, and VBAR_STRING.


4.3a Internal structure of a string

The string format is based on the goal of making FIRST, LAST, BUTFIRST,
and BUTLAST quick for words.  The characters themselves are not in the string
node, because nodes have a fixed size and strings can be of any length.
Instead, the characters are in a block allocated with malloc() in the
following format:

struct string_block {
    unsigned FIXNUM str_refcnt;
    char str_str[1];	    /* This array will be of variable length really */
};

The string_block needs a reference count because different word nodes may
point to substrings of it.  When a word node is garbage collected, the
reference count of its string_block is decremented; when the count reaches
zero, the block is freed.

The word node itself has three fields:

#define getstrptr(node)         ((node)->n_str)
#define getstrlen(node)         ((node)->n_len)
#define getstrhead(node)        ((node)->n_head)

(These lines are from logo.h, which has getters and setters for all the node
types.)  The strptr is a pointer to the first character of the word (not
necessarily the first character of the string_block).  The strlen is the
number of characters in the word.  The strhead is a pointer to the
string_block, as returned by malloc().

There is a single empty word node, pointed to by the variable Null_Word.
Logo takes pains not to create any other empty word nodes, so that all empty
words will be EQ? and not merely EQUAL?, to simplify comparisons.


4.3b How to create a string node

Words are usually created with

NODE *make_strnode(char *strptr, struct string_block *strhead, int len,
		   NODETYPES typ, char *(*copy_routine)())

This constructor is used in two different ways.  If you already have a
string_block containing the characters you need, and just want to select
a substring of it, you say

	make_strnode(strptr, strhead, len, STRING, <anything>)

because the last argument isn't used.  If you have the characters in a
temporary buffer and have to allocate a string_block for them, you say

	make_strnode(charptr, NULL, len, STRING, copy_routine)

The copy routine is usually strnzcpy:

char *strnzcpy(char *s1, char *s2, int n) {
    strncpy(s1, s2, n);
    s1[n] = '\0';
    return(s1);
}

This is in logodata.c, along with various special-purpose ones with names
like mend_strnzcpy that modify the copied characters in some way.  For
example, one version strips out comments and line continuation characters
because in UCBLogo you can say

	? make "foo "abc;comment
	~ def
	? print :foo
	abcdef


4.3c Including punctuation characters in words

Sometimes users want to include in a word what would ordinarily be
word-delimiting punctuation characters.  There are two ways to do this:

	abc\ def		backslash before special character
	|abc def|		string of characters in vertical bars

Unfortunately both the visible semantics and the underlying implementation
of these has changed over time, so many names of procedures are now wrong.
Originally there was only the backslash form, and it meant what vertical
bars now mean.

First, the above-the-line semantics:  Backslashed characters are treated as
non-punctuation when first seen, but if the word containing a backslashed
character is runparsed, the character has its usual special meaning.  By
contrast, characters within vertical bars are never interpreted specially,
even when runparsed.

Originally, backslashed characters were special forever, and were specially
marked to indicate this.  The UCBLogo primitive BACKSLASHEDP checks for this
special marking.  Alas, backslashed characters are no longer specially marked;
the backslash just gets them into the same word as their neighbors when the
line is first parsed.  This special marking now applies only to punctuation
characters that were entered within vertical bars, so the name should be
BARREDP or something like that.

Now, the ugly below-the-line history.  Originally, a punctuation character
that was intended to be forever special was marked by having its parity bit
set (0200, 0x80).  This works in the USA because all the ASCII characters
have codes less than 128.  Therefore, UCBLogo uses functions

	setparity(char) -> same char with parity bit on
	clearparity(char) -> same char with parity bit off
	getparity(char) -> nonzero if char has parity bit set

to manipulate these marked characters.  Then European Logo users complained
that Logo wasn't properly handling letters with accents, which, it turns out,
are represented using ASCII codes greater than 128.

The new approach uses otherwise-unused ASCII control characters, with codes
less than 32 (040, 0x20, ASCII space).  Some of the control characters are
"format effectors": tab, backspace, formfeed, newline, return.  But the others
are pretty useless, and so UCBLogo uses them to represent "backslashed"
(really vertical barred) punctuation characters.  There are only barely enough
of these unused codes for the Logo punctuation characters, though:

	space, tab, newline, ()[]+-*/=<>"\:;|{}~

(Don't be confused; tab is itself a control code, but backslashed-tab is a
different control code!)

The special node types BACKSLASH_STRING and VBAR_STRING are used if there are
special characters within the word.  In fact most of the interpreter doesn't
worry about these types; the only important use is that compare_nodes() strips
the "parity" from the characters in the strings before comparing them if
either word is of either of these types.


4.4 Symbols

Scheme distinguishes between a string (an array of characters) and a symbol
(an indivisible atom that happens to have a printable name).  Strings are for
text manipulation; symbols are to name variables and the like.

Logo, above the line, uses a single Word type that subsumes both of these
purposes.  Nevertheless, it's useful to maintain a distinction internally.
When you want to find a procedure or variable in the symbol table, you don't
want to spend time comparing strings letter by letter to find it.

The internal equivalent of a Scheme symbol is called an "object" -- an
unfortunate choice since we plan to add object-oriented programming to Logo.
But for now, let's understand that the word "object" refers to this internal
Logo data structure.

An object is a list with the following elements:

	[canonical procedure value plist flags caseobj1 caseobj2 ...]

The "canonical" element is a caseobj, which will be explained later.

The procedure, value, and plist elements are the ones named by this symbol
(the procedure FOO, the variable FOO, and the property list FOO, for example).
These elements are UNDEFINED, UNBOUND, and NIL, respectively, if the symbol
does not name a procedure, a variable, or a property list.

The flags element is an INT node containing the following flags:

#define PROC_BURIED	01
#define VAL_BURIED	02
#define PLIST_BURIED	04
#define PROC_TRACED	010
#define VAL_TRACED	020
#define PLIST_TRACED	040
#define PROC_STEPPED	0100
#define VAL_STEPPED	0200
#define PLIST_STEPPED	0400
#define PROC_MACRO	01000
#define PERMANENT	02000

The meaning of BURIED, TRACED, and STEPPED is explained in the user
documentation of the Logo primitives BURY, TRACE, and STEP.  (It is
meaningless to step a property list, but the flag exists anyway because
various primitives know that these flags come in groups of three.)
PROC_MACRO means that the symbol's procedure is a macro; PERMANENT means that
this symbol should not be deleted during garbage collection, even if it has no
procedure, variable, or property list.  (This feature is used, for example,
for the special variables that users can set to control Logo's behavior, such
as ERRACT and CASEIGNOREDP.  Pointers to these symbols are kept in global
variables so that Logo doesn't have to look them up in the symbol table each
time they're used, so it's important that the symbol not be deleted and
recreated.)

The remaining elements are nodes of type CASEOBJ, for "case object."
In Logo, names of procedures, variables, and property lists are case
insensitive, so FOO and Foo and foo all name the same procedure.  When a
program is runparsed, all the words that might name procedures, etc., are
turned from strings into symbols, so that when the program is run, the named
procedures, etc., can be found quickly, by dereferencing pointers in the
program, without string manipulation.  But we want to remember the exact
way the user typed the name, in case we print it.

This may be clearer with an example.  Consider these instructions:

	make "Foo 3
	print "Foo

In the first instruction, we really don't care that the user said Foo rather
than FOO or some other capitalization, but in the second instruction, we want
to print the word exactly as typed.  But the parser doesn't know anything
about the semantics of MAKE or PRINT, let alone some user-defined procedure
that might do both.  So we want to turn these strings into symbols, but we
also want to remember the original strings.

A case object is a pair whose car is a string and whose cdr is an object.
Given a case object, we can dereference its car to print it, or dereference
its cdr to find the procedure, variable, or property list that it names.

Each object includes all of its case objects as elements, so that when we
delete the object we can delete the CASEOBJs also.  (This is therefore a
circular structure, since the object points to the caseobj and the caseobj
points to the object.)  Each object also includes a "canonical caseobj,"
which is the one with all lower case letters; a canonical caseobj is created
for every object even if the user never types the name that way.  The
canonical caseobj is used for case-independent comparisons.

When an object is created, it is also entered into Logo's global environment,
which is a hash table.  The hash function is based on the canonical string.
This process -- creating the object, creating the caseobj as spelled by the
user, creating the canonical caseobj, and entering the object into the hash
table -- is called *interning* the symbol.  The procedure intern(word) takes a
word of any type (number, string, caseobj, etc.) and returns the caseobj with
the spelling used in the argument.  If the word has already been interned,
intern() finds the existing object in the hash table.  If not, one is created.
If the word has been interned, but not with this spelling (that is, not with
this upper-lower-case pattern), a new caseobj is created that points to the
existing object.

Note that a caseobj is a pair, but it has NT_WORD on and NT_LIST off in its
type identifier, because Logo primitives should treat it as a word, not as a
list.


4.5 Quote and colon types

When a line containing the notations "FOO or :FOO is parsed, these strings
are converted to nodes of type QUOTE and type COLON, respectively.  These
nodes are pairs in which only the car is meaningful; it points to a caseobj
of the word without the leading quote or colon.  Like the CASEOBJ type, the
QUOTE and COLON types are words above the line but pairs internally.


4.6 Arrays

Like strings, arrays are represented as nodes that point to an external block
of memory.  Here are the selectors for array nodes:

#define getarrdim(node)		((node)->n_dim)
#define getarrorg(node)		((node)->n_org)
#define getarrptr(node)		((node)->n_array)

The array dimension is the number of elements in it.  The array origin is the
index used for the first element of the array (1 by default, often 0, but can
be anything).  The array pointer points to the actual data, which is a block
of pointers to nodes.

There is no need for a reference count in array blocks, because there are no
primitives that extract a subarray; if you want one, you must allocate a new
array and copy the desired elements manually.


4.7 Primitive procedures

There are several node type codes for primitive procedures: PRIM, MACRO,
TAILFORM, and INFIX.  Macros are described in section 3.3; tailforms are in
section 3.1; the infix operators are the usual + - * / =.  All other
primitives are of type PRIM.

The fields of a primitive are

#define getprimfun(node)        ((node)->n_pfun)
#define getprimmin(node)        ((node)->n_pmin)
#define getprimmax(node)        ((node)->n_pmax)
#define getprimdflt(node)       ((node)->n_pdef)
#define getprimpri(node)        ((node)->n_ppri)

Fun is a pointer to the C procedure that implements the primitive; min is the
minimum number of inputs; max is the maximum number; dflt is the default
number; pri is the priority of this primitive, used in treeification of
expressions that involve infix operators.  Multiplication and division are
highest priority (tightest binding); then addition and subtraction; then
comparisons; and lowest of all are prefix operators.

The four numbers are stored as short (16-bit) integers, so that all five
fields fit in no more space than the three pointers of a cons node.

All primitive procedures are of C type

	NODE *foo(NODE *args)

They are given a list of actual argument values and must return a (pointer to
a) node, or UNBOUND if the Logo primitive should not return a value.


5. Garbage collector
--------------------

This section is not a primer on garbage collection.  A good general survey
on this subject is

   Paul R. Wilson. Uniprocessor garbage collection techniques. In Proc of
   International Workshop on Memory Management in the Springer-Verlag Lecture
   Notes in Computer Science series., St. Malo, France, September 1992.
   http://www.cs.utexas.edu/users/oops/papers.html#gcsurvey

Even before reading Wilson, read SICP 5.3 if you're a gc novice.

UCBLogo uses a generational, mark-sweep garbage collector.  When the
interpreter needs to allocate a node, it calls newnode(), which looks for an
available node.  If no nodes are free, newnode() calls do_gc(), which disables
pause and stop interrupts and calls gc().  But the main purpose of do_gc is
that it allocates a lot of local variables, so that any register variables in
its caller will be forced onto the C stack.

Do_gc() and gc() take a Boolean argument, which is FALSE for a generational
garbage collection and TRUE for a full garbage collection.  Automatically
triggered gc is generational; a full GC can be requested by the user (see the
user documentation of the Logo GC command) and is implied by the use of the
NODES operation, so that the in-use count shown to the user will be correct.


5.1 Node allocation

All Logo data types are stored in fixed-size blocks of type (struct node),
which is given the typedef name NODE.  Logo allocates many of these nodes at a
time; the number allocated at once can be set by the user (with the
.SETSEGSIZE command), but is initially 2000 for DOS, 4000 for 68000 Mac,
and 16,000 for other platforms.  The global variable segment_list points to a
linked list of segments; a "segment" is a contiguous block of nodes.  The
global variable free_list points to a linked list of unused nodes.  There is
only one free list, not one per segment.

Segments are never deallocated, even if every node in the segment is free.
A new segment is allocated only if a gc fails to free up enough nodes, so that
Logo's total memory use doesn't grow unnecessarily.


5.2 The mark phase

Marking a node can lead to several other nodes that should be marked.  To keep
track of these pending tasks, Logo uses a fixed-size stack, of size 4000 for
DOS, 8000 for 68000 Mac, or 16,000 for other platforms.  Long lists should not
cause overruns of this stack; it's the depth of lists that matters.  If the
stack overflows, garbage collection stops, and the user is advised to save
data and quit.  (Logo maintains a small "reserve tank" of nodes so that this
saving will be possible.)

All node-valued global variables in the interpreter are marked first.  This
includes evaluator registers, some of which are not in individual C variables
but in a (struct registers) block.  The global variable Regs_Node is a node
that points to this register block; the node itself is unused except by the
garbage collector.

Among the C globals are stack, which is the evaluator's data stack (a list of
nodes, not a contiguous block), and var_stack, which is the stack of local
procedure-call frames (containing shadowed bindings as explained in 3.4c
above).

Then the entries in Logo's global symbol table (the hash table) are marked.

Finally, the C stack is examined.  Since we don't know which entries in the
stack are pointers to nodes, every value on the stack is examined to see if it
could possibly be a pointer to a node.  This is done by procedure
valid_pointer(), which first compares the address with the position and size
of each segment.  If the value is within a segment, the procedure then sees
whether it would address the beginning of a node within that segment.  If so,
the garbage collector assumes that the value is indeed a node pointer, and
marks the node.  This procedure is one of the places where optimizing C
compilers tend to produce code that doesn't work.

Some machines (DOS and 68000 Mac) have 32-bit addresses but require only
16-bit alignment.  The garbage collector therefore, on those platforms, looks
at overlapping 32-bit values.  Also, stacks grow upward in some systems and
downward in others; Logo tries to figure this out.  (Almost the first thing
main() does is to set the global variable bottom_stack to the address of its
local variable exec_list; gc() compares this with the address of its own stack
frame to determine the direction of growth.)  This is a good place to look for
problems when porting Berkeley Logo to a new processor or operating system.

All of the above categories of marking are done only for nodes in the current
generation.  The garbage collector then looks for nodes in older generations
that have been modified to point to newer nodes.  To make this check possible,
the list mutators (the C procedures setcar, setcdr, and setobject) check for
old-to-young pointing as they do the mutation, and keep lists of these
old-to-young-pointing nodes.


5.3 GCTWA

If a single-generation gc doesn't yield enough free space, older generations
are examined.  When a full gc is required, Logo also performs a GCTWA (Garbage
Collect Truly Worthless Atoms).  Here's what that means:

Ordinarily, every word read in a Logo instruction is entered into the symbol
table, just in case it will become the name of a procedure, variable, or
property list.  Since the symbol table itself is marked, and since the symbol
table points to all these symbols, the symbols are never found to be free in
the mark phase.  A GCTWA goes through the symbol table looking for entries
that are not, in fact, the names of a procedure, variable, or property list.
(A subtlety is that a symbol's value might be UNBOUND even though it is in
fact a global variable, because the global binding might have been shadowed by
a LOCAL command, creating a local variable that doesn't yet have a value.  But
this isn't a problem because in that case var_stack points to the symbol table
entry, so it will be marked.)  Some symbols have a PERMANENT flag associated
with them, so that the entries won't be removed in GCTWA even if the names are
unused; these are mainly for Logo special variables such as CaseIgnoredP.

The first step in a gctwa is that all caseobj nodes are flagged; in the mark
phase, mark() will not actually mark a flagged caseobj until it is called
twice for the same caseobj.  (The first call to mark() for a flagged caseobj
just clears the flag; the second call does the normal mark operation.)  This
is because every caseobj is pointed to (directly or indirectly) by the hash
table, and we want to know whether this caseobj is pointed to by something
else, usually by appearing in the body of a procedure.

After this special mark phase, the symbol table is examined for unused
symbols.  A symbol is used if it names a variable, procedure, or property
list, or if one of its caseobj nodes has been marked.  Unused symbols are
removed from the symbol table.  Finally, the mark phase is repeated so that
non-caseobj nodes that are part of stale symbols won't be marked.


5.4 The sweep phase

In addition to finding and freeing unused nodes, the sweep phase also promotes
nodes that have survived a certain number of garbage collections to the next
older generation.  This promotion requires checking whether the node points to
other nodes that are now younger than it is.

If the sweep phase doesn't free enough nodes, it tries another generation.
When all generations have been collected, if there still are not enough free
nodes, Logo asks the operating system for another segment.  If that fails, the
user is notified.

"Enough" is one of several tuning values defined in mem.c.  Besides the size
of a segment, there are three more:

/* Number of times to collect at the current GC state before going to
   the next state. Basically the number of times a given generation is
   collected before its members are moved to an older generation */
#define gc_age_threshold 4

/* A new segment of nodes is added if fewer than freed_threshold nodes are
   freed in one GC run */
#define freed_threshold ((long int)(seg_size * 0.4))

/* The number of generations */
#define NUM_GENS 4


5.5 Per-node gc data

Each node contains five words of information to support gc:

    int my_gen; /* Nodes's Generation */ /*GC*/
    int gen_age; /* How many times to GC at this generation */
    long int mark_gc;	/* when marked */
    struct logo_node *next; /* Link together nodes of the same age */ /*GC*/
    struct logo_node *oldyoung_next;

my_gen is a number in the range [0, NUM_GENS-1].  It is zero for newly
allocated nodes, and increases as the node survives multiple GCs.

gen_age is in the range [1, gc_age_threshold].  It is initially
gc_age_threshold, and is decreased by 1 at each gc.  When it hits zero, the
node is promoted to a higher (older) generation, and gen_age is reset to
gc_age_threshold.

mark_gc is initially zero.  To mark a node, it is set equal to current_gc, a
counter that's incremented at each garbage collection.  (If Logo runs long
enough for four billion garbage collections, this won't work.  :-)

next is a pointer to another node in the same generation as this node; the
generations are maintained as linked lists.

oldyoung_next is NIL except for old nodes that point to younger nodes.  Such
nodes are linked together using this field.

That's five words of gc information in a node that has four words of non-gc
information!  There has to be a better way, a project for the list of undone
projects.