File: NEWS

package info (click to toggle)
snowball 3.0.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 1,708 kB
  • sloc: ansic: 15,641; ada: 849; python: 531; cs: 485; pascal: 473; java: 473; javascript: 411; perl: 312; sh: 40; makefile: 17
file content (1549 lines) | stat: -rw-r--r-- 60,265 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
Snowball 3.0.1 (2025-05-09)
===========================

Python
------

* The __init__.py in 3.0.0 was incorrectly generated due to a missing
  build dependency and the list of algorithms was empty.  First reported by
  laymonage.  Thanks to Dmitry Shachnev, Henry Schreiner and Adam Turner for
  diagnosing and fixing.  (#229, #230, #231)

* Add trove classifiers for Armenian and Yiddish which have now been registered
  with PyPI.  Thanks to Henry Schreiner and Dmitry Shachnev.  (#228)

* Update documented details of Python 2 support in old versions.

Snowball 3.0.0 (2025-05-08)
===========================

Ada
---

* Bug fixes:

  + Fix invalid Ada code generated for Snowball `loop` (it was partly Pascal!)
    None of the stemmers shipped in previous releases triggered this bug, but
    the Turkish stemmer now does.

  + The Ada runtime was not tracking the current length of the string
    but instead used the current limit value or some other substitute, which
    manifested as various incorrect behaviours for code inside of `setlimit`.

  + `size` was incorrectly returning the difference between the limit and the
    backwards limit.

  + `lenof` or `sizeof` on a string variable generated Ada code that didn't
    even compile.

  + Fix incorrect preconditions on some methods in the runtime.

  + Fix bug in runtime code used by `attach`, `insert`, `<-` and string
    variable assignment when a (sub)string was replaced with a larger string.
    This bug was triggered by code in the Kraaij-Pohlmann Dutch stemmer
    implementation (which was previously not enabled by default but is now the
    standard Dutch stemmer).

  + Fix invalid code generated for `insert`, `<-` and string variable
    assignment.  This bug was triggered by code in the Kraaij-Pohlmann
    Dutch stemmer implementation (which was previously not enabled by default
    but is now the standard Dutch stemmer).

  + Generate valid code for programs which don't use `among`.  This didn't
    affect code generation for any algorithms we currently ship.

  + If the end of a routine was unreachable code the Snowball compiler
    would think the start of the next routine was also unreachable and would
    not generate it.  This didn't affect code generation for any algorithms we
    currently ship.

* Code quality:

  + Only declare variables A and C when each is needed.

  + Fix indentation of generated declarations.

  + Drop extra blank line before `Result := True`.

C/C++
-----

* Bug fixes:

  + Fix potential NULL dereference in runtime code if we failed to allocate
    memory for the p or S member for a Snowball program which uses one or more
    string variables.  Problem was introduced in Snowball 2.0.0.  Fixes #206,
    reported by Maxim Korotkov.

  + Fix invalid C code generated when a failure is handled in a context with
    the opposite direction to where it happened, for example:

        externals (stem)
        define stem as ( try backwards 'x' )

    This was fixed by changing the C generator to work like all the other
    generators and pre-generate the code to handle failure.

  + Eliminate assumptions that NULL has all-zero bit pattern.  We don't know
    of any current platforms where this assumption fails, but the C standard
    doesn't require an all-zero bit pattern for NULL.  Fixes #207.

* Optimisations:

  + Store index delta for among substring_i field.  This makes trying
    substrings after a failed match slightly faster because we can just add
    the offset to the pointer we already have to the current element.

* Code quality:

  + Improve formatting of generated code.

C#
--

* Bug fixes:

  + Add missing runtime support for testing for a string var at the current
    position when working forwards.  This situation isn't exercised by any of
    the stemming algorithms we currently ship.

  + Adjust generated code to work around a code flow analysis bug in the `mcs`
    C# compiler.

* Code quality:

  + Prune unused `using System.Text;`.

  + Generate C# with UTF-8 source encoding.  This makes the generated code
    easier to follow, which helps during development.  It's also a bit smaller.
    For now codepoints U+0590 and above are still emitted as escape sequences
    to avoid confusing source code rendering when LTR scripts are involved.

Go
--

* Optimisations:

  + Drop some unneeded Go code generated for string `$`.  None of the shipped
    stemmers use string `$`, though the Schinke Latin stemmer algorithm on the
    website does.

* Code quality:

  + Dispatch among result with `switch` instead of an `if` ... `else if` chain
    (which looks like we did because the Go generator evolved from the Python
    generator and Python didn't used to have a switch-like construct.  This
    doesn't make a measurable speed difference so it seems the Go compiler is
    optimising both to equivalent code, but using a switch here seems clearer,
    a better match for the intent, and is a bit simpler to generate.

  + Generate Go with UTF-8 source encoding.  This makes the generated code
    easier to follow, which helps during development.  It's also a bit smaller.
    For now codepoints U+0590 and above are still emitted as escape sequences
    to avoid confusing source code rendering when LTR scripts are involved.

Java
----

* The Java code generated by Snowball requires now requires Java >= 7.  Java 7
  was released in 2011, and Java 6's EOL was 2013 so we don't expect this
  to be a problematic requirement.  See #195.

* Optimisations:

  + We now store the current string in a `char[]` rather than using a
    `StringBuilder` to reduce overheads.  The `getCurrent()` method continues
    to return a Java `String`, but the `char[]` can be accessed using the new
    `getCurrentBuffer()` and `getCurrentBufferLength()` methods.  Patch from
    Robert Muir (#195).

  + Use a more efficient mechanism for calling `among` functions.  Patch from
    Robert Muir (#195).

* Code quality:

  + Consistently put `[]` right after element type for array types, which seems
    the most used style.

  + Fix javac warnings in SnowballProgram.java.

  + Improve formatting of generated code.

Javascript
----------

* Bug fixes:

  + Use base class specified by `-p` in string `$` rather than hard-coding
    `BaseStemmer` (which is the default if you don't specify `-p`).  None of
    the shipped stemmers use string `$`, though the Schinke Latin stemmer
    algorithm on the website does.

* Code quality:

  + Modernise the generated code a bit.  Loosely based on changes proposed in
    #123 by Emily Marigold Klassen.

* Other changes:

  + The Javascript runner is now specified by make variable `JSRUN` instead
    of `NODE` (since node is just one JS implementation).  The default value
    is now `node` instead of `nodejs` (older Debian and Ubuntu packages used
    `/usr/bin/nodejs` because `/usr/bin/node` was already in use by a
    completely different package, but that has since changed).

Pascal
------

* Bug fixes:

  + Add missing semicolons to code generated in some cases for a function which
    always succeeds or always fails.  The new dutch.sbl was triggering this
    bug.

  + If the end of a routine was unreachable code the Snowball compiler
    would think the start of the next routine was also unreachable and would
    not generate it.  This didn't affect code generation for any algorithms we
    currently ship.

* Code quality:

  + Eliminate commented out code generated for string `$`.  None of the shipped
    stemmers use string `$`, though the Schinke Latin stemmer algorithm on the
    website does.

* Other changes:

  + Enable warnings, etc from fpc.

  + Select GNU-style diagnostic format.

Python
------

* Optimisations:

  + Use Python set for grouping checks.  This speeds up running the Python
    testsuite by about 4%.

  + Routines used in `among` are now referenced by name directly in the
    generated code, rather than using a string containing the name.  This
    avoids a `getattr()` call each time an among wants to call a routine.  This
    doesn't seem to make a measurable speed difference, but it's cleaner and
    avoids problems with name mangling.  Suggested by David Corbett in #217.

  + Simplify code generated for `loop`.  If the iteration count is constant and
    at most 4 then iterate over a tuple which microbenchmarking shows is
    faster.  The only current uses of loop in the shipped stemmers are `loop 2`
    so benefit from this.  Otherwise we now use `range(AE)` instead of
    `range (AE, 0, -1)` (the actual value of the loop variable is never
    used so only the number of iterations matter).

* Bug fixes:

  + Correctly handle stemmer names with an underscore.

* Code quality:

  + Generate Python with UTF-8 source encoding.  This makes the generated code
    easier to follow, which helps during development.  It's also a bit smaller.
    For now codepoints U+0590 and above are still emitted as escape sequences
    to avoid confusing source code rendering when LTR scripts are involved.

* Other changes:

  + Set python_requires to indicate to install tools that the generated code
    won't work with Python 3.0.x, 3.1.x and 3.2.x (due to use of `u"foo"`
    string literals).  Closes #192 and #191, opened by Andreas Maier.

  + Add classifiers to indicate support for Python 3.3 and for 3.8 to 3.13.
    Fixes #158, reported by Dmitry Shachnev.

  + Stop marking the wheel as universal, which had started to give a warning
    message.  Patch from Dmitry Shachnev (#210).

  + Stop calling `setup.py` directly which is deprecated and now produces a
    warning - use the `build` module instead.  Patch from Dmitry Shachnev
    (#210).

Rust
----

* Optimisations:

  + Shortcut unnecessary calls to find_among, porting an optimization from the
    C generator.  In some stemming benchmarks this improves the performance
    of the rust english stemmer by about 27%.  Patch from jedav (#202).

* Code quality:

  + Suppress unused_parens warning, for example triggered by the code generated
    for `$x = x*x` (where `x` is an integer).

  + Dispatch `among` result with `match` instead of an `if` ... `else if` chain
    (which looks like we did because the Rust generator evolved from the Python
    generator and Python didn't used to have a switch-like construct.  This
    results in a 3% speed-up for an unoptimised Rust compile but doesn't seem
    to make a measurable difference when optimising so it seems the Rust
    compiler is optimising both to equivalent code.  However using a `match`
    here seems clearer, a better match for the intent, and is a bit simpler to
    generate.

  + Generate Rust with UTF-8 source encoding.  This makes the generated code
    easier to follow, which helps during development.  It's also a bit smaller.
    For now codepoints U+0590 and above are still emitted as escape sequences
    to avoid confusing source code rendering when LTR scripts are involved.

New stemming algorithms
-----------------------

* Add Esperanto stemmer from David Corbett (#185).

* Add Estonian algorithm from Linda Freienthal (#108).

Behavioural changes to existing algorithms
------------------------------------------

* Dutch: Switch to Kraaij-Pohlmann as the default for Dutch.  In case you
  want Martin Porter's Dutch stemming algorithm for compatibility, this is now
  available as `dutch_porter`.  Fixes #1, reported by gboer.

* Dutch (Kraaij-Pohlmann): Fix differences between the Snowball implementation
  and the original C implementation.

* Dutch (Kraaij-Pohlmann): Add a small number of exceptions to the Snowball
  implementation to avoid unwanted conflations.  This addresses all cases so
  far identified which Martin's Dutch stemmer handled better.  Fixes #208.

* Dutch (Porter): The "at least 3 characters" part of the R1 definition was
  actually implemented such that when working in UTF-8 it was "at least 3
  bytes".  We stripped accents normally found in Dutch except for `è` before
  setting R1, and no Dutch words starting `è` seem to stem differently
  depending on encoding, but proper nouns and other words of foreign origin may
  contain other accented characters and it seems better for the stemmer to
  handle such words the same way regardless of the encoding in use.

* English: Replace '-ogist' with '-og' to conflate "geologist" and "geology", etc.
  Suggested by Marc Schipperheijn on snowball-discuss.

* English: Add extra condition to undoubling.  We no longer undouble if the
  double consonant is preceded by exactly "a", "e" or "o" to avoid conflating
  "add"/"ad", "egg"/"eg", "off"/"of", etc.  Fixes #182, reported by Ed Page.

* English: Avoid conflating 'emerge' and 'emergency'.  Reported by Frederick Ross
  on snowball-discuss.

* English: Avoid conflating 'evening' and 'even'.  Reported by Ann B on
  snowball-discuss.

* English: Avoid conflating 'lateral' and 'later'.  Reported by Steve Tolkin on
  snowball-discuss.

* English: Avoid conflating 'organ', 'organic' and 'organize'.

* English: Avoid conflating 'past' and 'paste'.  Reported by Sonny on
  snowball-discuss.

* English: Avoid conflating 'universe', 'universal' and 'university'.  Reported
  by Clem Wang on snowball-discuss.

* English: Handle -eed and -ing exceptions in their respective rules.
  This avoids the overhead of checking for them for the majority of
  words which don't end -eed or -ing.  It also allows us to easily handle
  vying->vie and hying->hie at basically no extra cost.  Reduces the time to
  stem all words in our English word list by nearly 2%.

* French: Remove elisions as first step.  See #187.  Originally reported by
  Paul Rudin and kelson42.

* French: Remove -aise and -aises so for example, "française" and "françaises"
  are now conflated with "français".  Fixes #209.  Originally reported by
  ririsoft and Fred Fung.

* French: Avoid incorrect conflation of `mauvais` (bad) with `mauve` (mauve,
  mallow or seagull); avoid conflating `mal` with `malais`, `pal` with
  `palais`, etc.

* French: Avoid conflating `ni` (neither/nor) with `niais`
  (inexperienced/silly) and `nie`/`nié`/`nier`/`nierais`/`nierons` (to deny).

* French: -oux -> -ou.  Fixes #91, reported by merwok.

* German: Replace with the "german2" variant.  This normalises umlauts ("ä" to
  "ae", "ö" to "oe", "ü" to "ue") which is presumably much less common in
  newly created text than it once was as modern computer systems generally
  don't have the limitations which motivated this, but there will still be
  large amounts of legacy text which it seems helpful for the stemmer to
  handle without having to know to select a variant.

  On our sample German vocabulary which contains 35033 words, 77 words give
  different stems.  A significant proportion of these are foreign words, and
  some are proper nouns.  Some cases definitely seem improved, and quite a few
  are just different but effectively just change the stem for a word or group
  of words to a stem that isn't otherwise generated.  There don't seem any
  changes that are clearly worse, though there are some changes that have both
  good and bad aspects to them.

  Fixes #92, reported by jrabensc.

* German: Don't remove -em if preceded by -syst to avoid overstemming words
  ending -system.  This change means we now conflate e.g. "system" and
  "systemen".  Partly addresses #161, reported by Olga Gusenikova.

* German: Remove -erin and -erinnen suffixes which conflates singular and
  plural female versions of nouns with the male versions.  Fixes #85 and
  partly addresses #161, reported by Olga Gusenikova.

* German: Replace -ln and -lns with -l.  This improves 82 cases in the current
  sample data without making anything worse.  Tests on a larger word list look
  good too.  Partly addresses #161, reported by Olga Gusenikova.

* German: Remove -et suffix when we safely can.  Fixes #200, reported by Robert
  Frunzke.

* Greek: Fix "faulty slice operation" for input `ισαισα`.  The fix changes
  `ισα` to stem to `ισ` instead of the empty string, which seems better (and to
  be what the second paper actually says to do if read carefully).  Fixes #204,
  reported by subnix.

* Italian: Address overstemming of "divano" (sofa) which previously stemmed to
  "div", which is the stem for 'diva' (diva).  Now it is stemmed to 'divan',
  which is what its plural form 'divani' already stemmed to.  Fixes #49,
  reported by francesco.

* Norwegian: Improve stemming of words ending -ers.  Fixes #175, reported by
  Karianne Berg.

* Norwegian: Include more accented vowels - treating "ê", "ò", "ó" and "ô"
  as vowels improves the stemming of a fairly small number of words, but
  there's basically no cost to having extra vowels in the grouping, and some
  of these words are commonly used.  Fixes #218, reported by András Jankovics.

* Romanian: Fix to work with Romanian text encoded using the correct Unicode
  characters.  Romanian uses a "comma below" diacritic on letters "s" and "t"
  ("ș" and "ț").  Before Unicode these weren't easily available so Romanian
  text was written using the visually similar "cedilla" diacritic on these
  letters instead ("ş" and "ţ").  Previously our stemmer only recognised the
  latter.  Now it maps the cedilla forms to "comma below" as a first step.
  Patch from Robert Muir.

* Spanish: Handle -acion like -ación and -ucion like -ución.  It's apparently
  common to miss off accents in Spanish, and there are examples in our test
  vocabulary that these change helps.  Proposed by Damian Janowski.

* Swedish: Replace suffix "öst" with "ös" when preceded by any of 'iklnprtuv'
  rather than just 'l'.  The new rule only requires the "öst" to be in R1
  whereas previously we required all of "löst" to be.  This second tweak
  doesn't seem to affect any words ending "löst" but it conflates a few extra
  cases when combined with the expanded list of preceding letters, and seems
  more logical linguistically (since "ös" is akin to "ous" in English).  Fixes
  #152, reported by znakeeye.

* Swedish: Remove -et/-ets in cases where it helps.  Removing -et can't be done
  unconditionally because many words end in -et where this isn't a suffix.
  However it's a very common suffix so it seems worth crafting a more complex
  condition under which to remove.  Fixes #47.

* Turkish: Remove proper noun suffixes.  For example, `Türkiye'dir` ("it is
  Turkey") is now conflated with `Türkiye` ("Turkey").  Fixes #188.

* Yiddish: Avoid generating empty stem for input "גע" (not a valid word, but
  it's better to avoid an empty stem for any non-empty input).

Optimisations to existing algorithms
------------------------------------

* General change: Use `gopast` everywhere to establish R1 and R2 as it is a
  little more efficient to do so.

* Basque: Use an empty action rather than replacing the suffix with itself
  which seems clearer and is a little more efficient.

* Dutch (Porter): Optimise prelude routine.

* English: Remove unnecessary exception for `skis` as the algorithm stems
  `skis` to `ski` by itself (`skies` and `sky` do still need a special case to
  avoid conflation with `ski` though).

* Hungarian: We no longer take digraphs into account when determining where R1
  starts.  This can only make a difference to the stemming if we removed a
  suffix that started with the last character of the digraph (or with "zs" in
  the case of "dzs"), and that doesn't happen for any of the suffixes we remove
  for any valid Hungarian words.  This simplification speeds up stemming by
  ~2% on the current sample vocabulary list.  See #216.  Thanks to András
  Jankovics for confirming no Hungarian words are affected by this change.

* Lithuanian: Remove redundant R1 check.

* Nepali: Eliminate redundant check_category_2 routine.

* Tamil: Optimise by using `among` instead of long `or` chains.  The generated
  C version now takes 43% less time to processes the test vocabulary.

* Tamil: Remove many cases which can't be triggered due to being handled by
  another case.

* Tamil: Clean up some uses of `test`.

* Tamil: Make `fix_va_start` simpler and faster.

* Tamil: Localise use of `found_a_match` flag.

* Tamil: Eliminate pointless flag changes.

* Turkish: Minor optimisations.

Code clarity improvements to existing algorithms
------------------------------------------------

* Stop noting dates changes were made in comments in the code - we now maintain
  a changelog in each algorithm's description page on the website (and the
  version control history provides a finer grained view).

* Always use `insert` instead of `<+` as the named command seems clearer.

* English: Add comments documenting motivating examples for all exceptional
  cases.

* Lithuanian: Change to recommended latin stringdef codes.  Using common codes
  makes it easier to work across algorithms, but they are more mnemonic so also
  seem clearer when just considering this one algorithm.

* Serbian: Change to recommended latin stringdef codes.  Using common codes
  makes it easier to work across algorithms, but they are more mnemonic so also
  seem clearer when just considering this one algorithm.

* Turkish: Use `{sc}` for s-cedilla and `{i}` for dotless-i to match other
  uses.

Compiler
--------

* Generic code generation improvements:

  + Show Snowball source leafname in "generated" comment at start of files.

  + Add generic reachability tracking machinery.  This facilitates various new
    optimisations, so far the following have been implemented:

    - Tail-calling
    - Simpler code for calling routines which always give the same signal
    - Simpler code when a routine ends in a integer test (this also allows
      eliminating an Ada-specific codegen optimisation which did something
      similar but only for routines which consisted *entirely* of a single
      integer test.
    - Dead code reporting and removal (only in simple cases currently)

    Currently this overlaps in functionality with the existing reachability
    tracking which is implemented on a per-language basis, and only for some
    languages.  This reachability tracking was originally added for Java
    where some unreachable code is invalid and result in a compile time error,
    but then seems to have been copied for some other newer languages which
    may or may not actually need it.  The approach it uses unfortunately
    relies on correctly updating the reachability flag anywhere in the
    generator code where reachability can change which has proved to be a
    source of bugs, some unfixed.  This new approach seems better and with some
    more work should allow us to eliminate the older code.  Fixes #83.

  + Omit check for `among` failing in generated code when we can tell at
    compile time that it can't fail.

  + Optimise `goto`/`gopast` applied to a grouping or inverted grouping (which
    is by far the most common way to use `goto`/`gopast`) for all target
    languages (new for Go, Java, Javascript, Pascal and Rust).

  + We never need to restore the cursor after `not`.  If `not` turns signal `f`
    into `t` then it sets `c` back to its old position; otherwise, `not`
    signals `f` and `c` will get reset by whatever ultimately handles this `f`
    (or the program exits and the position of `c` no longer matters).  This
    slightly improves the generated code for the `english` and `porter`
    stemmers.

  + Don't generate code for undefined or unused routines.

  + Avoid generating variable names and then not actually using them.  This
    eliminates mysterious gaps in the numbering of variables in the generated
    code.

  + Eliminate `!`/`not` from integer test code by generating the inverse
    comparison operator instead for all languages, e.g. for Python we now
    generate

      if self.I_p1 >= self.I_x:

    instead of

      if not self.I_p1 < self.I_x:

    This isn't going to be faster in compiled languages with an optimiser but
    for scripting languages it may be faster, and even if not, it makes for a
    little less work when loading the script.

  + Canonicalise `hop 1` to `next` as the generated code for `next` can be
    slightly more efficient.  This will also apply to `hop` followed by a
    constant expression which Snowball can reduce to `1`.

  + Avoid trailing whitespace in generated files.

  + Fix problems with --comments option:

    - When generating C code we would segfault for code containing `atleast`,
      `hop` or integer tests.
    - Fix missing comments for some commands in some target languages.
    - Fix inconsistent formatting of comments in some target languages.
    - Comments in C are now always on their own line - previously some were
      after at the end of the line and some on their own line which made them
      harder to follow.
    - Emit comments before `among` and before routine/external definitions.

  + Simplify more cases of numeric expressions (e.g. `x * 1` to `x`).

* Improve --help output.

* Division by zero during constant folding now gives an error.

* For `hop` followed by an unexpected token (e.g. `hop hop`) we were
  already emitting a suitable error but would then segfault.

* Emit error for redefinition of a grouping.

* Improve errors for `define` of an undeclared name.  We already peek at the
  next token to decide whether to try to parse as a routine or grouping.
  Previously we parsed as a routine if it was `as`, and a grouping otherwise,
  but routine definitions are more common and a grouping can only start with
  a literal string or a name, so now we assume a routine definition with a
  missing `as` if the next token isn't valid for either.

* Suppress duplicate (or even triplicate) "unexpected" errors for the same
  token when the compiler tried to recover from the error by adjusting the
  parse stare and marking the token to be reparsed, but the same token then
  failed to parse in the new state.

* Fix NULL pointer dereference if an undefined grouping is used in the
  definition of another grouping.

* Fix mangled error for `set` or `unset` on a non-boolean:

  test.sbl:2: nameInvalid type 98 in name_of_type()

* Emit warning if `=>` is used.  The documentation of how it works doesn't
  match the implementation, and it seems it has only ever been used in the
  Schinke stemmer implementation (which assumes the implemented behaviour).
  We've updated the Schinke implementation to avoid it.  If you're using it
  in your own Snowball code please let us know.

* Improve errors for unterminated string literals.

* Fix NULL pointer dereference on invalid code such as `$x = $y`.

* If malloc fails while compiling the compiler will now report the failure
  and exit.  Previously the NULL return from malloc wasn't checked for so
  we'd typically segfault.

* `lenof` and `sizeof` applied to a string variable now mark the variable
  as used, which avoids a bogus error followed by a confusing additional
  message if this is the only use of that variable:

  lenofsizeofbug.sbl:3: warning: string 's' is set but never used
  Unhandled type of dead assignment via sizeof

  This is situation is unlikely to occur in real world code.

* The reported line number for "string not terminated" error was one too high
  in the case where we were in a stringdef (but correct if we weren't).

* Eliminate special handling for among starter.  We now convert the starter
  to be a command before the among, adding an explict substring if there
  isn't one.

* We now warn if the body of a `repeat` or `atleast` loop always signals
  `t` (meaning it will loop forever which is very undesirable for a stemming
  algorithm) or always signals `f` (meaning it will never loop, which seems
  unlikely to be what was intended).

* Release memory in compiler before exit.  The OS will free all allocated
  memory when a process exits, so this memory isn't actually leaked, but it can
  be annoying with when using snowball as part of a larger build process with
  some leak-finding tools.  Patch from jsteemann in #166.

* Store textual data more efficiently in memory during Snowball compilation.
  Previously almost all textual data was stored as 16 bit values, but most
  such data only uses 8 bit character values.  Doubling the memory usage
  isn't really an issue as Snowball programs are tiny, but this also
  complicated code handling such data.  Now only literal strings use the
  16 bit values.

* Fix clang -Wunused-but-set-variable warning in compiler code.

* Fix a few -Wshadow warnings in compiler and enable this warning by default.

* Tighten parsing of `writef()` format strings.  We now error out on
  unrecognised escape codes or if a numbered escape is used with too high a
  number or a non-digit.  This change reveals that the Go and Rust generators
  were using invalid escape ~A - the old writef() code was substituting this
  with just A which is what is wanted so this case was harmless but being
  lenient here could hide bugs, especially when copying code between
  generators as they don't all support the same set of format codes.

Build system
------------

* Turn on Java warnings and make them errors.

* Compile C code with -g by default.  This makes debugging easier, and
  matches the default for at least some other build systems (e.g. autotools).

* Fix "make clean" to remove all built Ada files.

* Clean `stemtest` too.  Patch from Stefano Rivera.

* Add missing `COMMON_FILES` dependency to dist targets.

* GNUmakefile: Tidy up and make more consistent

* GNUmakefile: Make use of $* to improve speed and readability.

* Use $(patsubst ...) instead of sed in .java.class rule which gives cleaner
  make output and is a bit more efficient.

* Add `WERROR` make variable to provide a way to add `-Werror` to existing
  CFLAGS.

libstemmer
----------

Testsuite
---------

* Give a clear error if snowball-data isn't found.  Fixes #196, reported by
  Andrea Maccis.

* Handle not thinning testdata better.  If THIN_FACTOR is set to 1 we no longer
  run gzipped test data through awk.  We also now handle THIN_FACTOR being set
  empty as equivalent to 1 for convenience.

* csharp_stemwords: Correctly handle a stemmer name containing an underscore.

* csharp_stemwords: Make `-i` option optional and read from stdin if omitted,
  like the C version does.

* csharp_stemwords: Process the input line by line which is more helpful for
  interactive testing, and also a little faster.

* Fix Java TestApp to allow a single argument.  The documented command line
  syntax is that you only need to specify the language and there was already
  code to read from stdin if no input file was specified, but at least two
  command line options were required.

* Fix deprecation warning in TestApp.java.

* Optimise TestApp.java by creating fewer objects.  Patch from Robert Muir.

* stemwords.py: We no longer create an empty output file if we fail to open the
  input file.

* stemwords: Improve error message to say "Out of memory or internal error"
  rather than just "Out of memory".

Documentation
-------------

* Include "what is stemming" section in each README.

* Include section on threads in each README.  Based on patch for Python from
  dbcerigo.

* Document that input should be lowercase with composed accents.  See #186,
  reported by 1993fpale.

* Add README section on building, including notes on cross-compiling.  Fixes
  #205, reported by sin-ack.

* CONTRIBUTING.rst: Clarify which charsets to list

* CONTRIBUTING.rst: Add general advice section.  In particular, note to use
  spaces-only for indentation in most cases.  Thanks to Dmitry Shachnev for
  raising this point.

* CONTRIBUTING.rst: Note that UTF-8 is OK in comments.  Thanks to Dmitry
  Shachnev for asking.

* Fix some typos.  Patch from Josh Soref.

* Document that our CI now uses github actions.

* Update link to Greek stemmer PDF.  Patch from Michael Bissett (#33).

Snowball 2.2.0 (2021-11-10)
===========================

New Code Generators
-------------------

* Add Ada generator from Stephane Carrez (#135).

Javascript
----------

* Fix generated code to use integer division rather than floating point
  division.

  Noted by David Corbett.

Pascal
------

* Fix code generated for division.  Previously real division was used and the
  generated code would fail to compile with an "Incompatible types" error.

  Noted by David Corbett.

* Fix code generated for Snowball's `minint` and `maxint` constant.

Python
------

* Python 2 is no longer actively supported, as proposed on the mailing list:
  https://lists.tartarus.org/pipermail/snowball-discuss/2021-August/001721.html

* Fix code generated for division.  Previously the Python code we generated
  used integer division but rounded negative fractions towards negative
  infinity rather than zero under Python 2, and under Python 3 used floating
  point division.

  Noted by David Corbett.

Code quality Improvements
-------------------------

* C/C++: Generate INT_MIN and INT_MAX directly, including <limits.h> from
  the generated C file if necessary, and remove the MAXINT and MININT macros
  from runtime/header.h.

* C#: An `among` without functions is now generated as `static` and groupings
  are now generated as constant.  Patches from James Turner in #146 and #147.

Code generation improvements
----------------------------

* General:

  + Constant numeric subexpressions and constant numeric tests are now
    evaluated at Snowball compile time.

  + Simplify the following degnerate `loop` and `atleast` constructs where
    N is a compile-time constant:

    - loop N C where N <= 0 is a no-op.

    - loop N C where N == 1 is just C.

    - atleast N C where N <= 0 is just repeat C.

    If the value of N doesn't depend on the current target language, platform
    or Unicode settings then we also issue a warning.

Behavioural changes to existing algorithms
------------------------------------------

* german2: Fix handling of `qu` to match algorithm description.  Previously
  the implementation erroneously did `skip 2` after `qu`.  We suspect this was
  intended to skip the `qu` but that's already been done by the substring/among
  matching, so it actually skips an extra two characters.

  The implementation has always differed in this way, but there's no good
  reason to skip two extra characters here so overall it seems best to change
  the code to match the description.  This change only affects the stemming of
  a single word in the sample vocabulary - `quae` which seems to actually be
  Latin rather than German.

Optimisations to existing algorithms
------------------------------------

* arabic: Handle exception cases in the among they're exceptions to.

* greek: Remove unused slice setting, handle exception cases in the among
  they're exceptions to, and turn `substring ... among ...  or substring ...
  among ...` into a single `substring ... among ...` in cases where it is
  trivial to do so.

* hindi: Eliminate the need for variable `p`.

* irish: Minor optimisation in setting `pV` and `p1`.

* yiddish: Make use of `among` more.

Compiler
--------

* Fix handling of `len` and `lenof` being declared as names.

  For compatibility with programs written for older Snowball versions
  len and lenof stop being tokens if declared as names.  However this
  code didn't work correctly if the tokeniser's name buffer needed to
  be enlarged to hold the token name (i.e. 3 or 5 elements respectively).

* Report a clearer error if `=` is used instead of `==` in an integer test.

* Replace a single entry command list with its contents in the internal syntax
  tree.  This puts things in a more canonical form, which helps subsequent
  optimisations.

Build system
------------

* Support building on Microsoft Windows (using mingw+msys or a similar
  Unix-like environment).  Patch from Jannick in #129.

* Split out INCLUDES from CPPFLAGS so that CPPFLAGS can now be overridden by
  the user if required.  Fixes #148, reported by Dominique Leuenberger.

* Regenerate algorithms.mk only when needed rather than on every `make` run.

libstemmer
----------

* The libstemmer static library now has a `.a` extension, rather than `.o`.
  Patch from Michal Vasilek in #150.

Testsuite
---------

* stemtest: Test that numbers and numeric codes aren't damaged by any of the
  algorithms.  Regression test for #66.  Fixes #81.

* ada: Fix ada tests to fail if output differs.  There was an extra `| head
  -300` compared to other languages, which meant that the exit code of `diff`
  was ignored.  It seems more helpful (and is more consistent) not to limit how
  many differences are shown so just drop this addition.

* go: Stop thinning testdata.  It looks like we only are because the test
  harness code was based on that for rust, which was based on that for
  javascript, which was only thinning because it was reading everything into
  memory and the larger vocabulary lists were resulting in out of memory
  issues.

* javascript: Speed up stemwords.js.  Process input line-by-line rather than
  reading the whole file into memory, splitting, iterating, and creating an
  array with all the output, joining and writing out a single huge string.
  This also means we can stop thinning the test data for javascript, which we
  were only doing because the huge arabic test data file was causing out of
  memory errors.  Also drop the -p option, which isn't useful here and
  complicates the code.

* rust: Turn on optimisation in the makefile rather than the CI config.  This
  makes the tests run in about 1/5 of the time and there's really no reason to
  be thinning the testdata for rust.

Documentation
-------------

* CONTRIBUTING.rst: Improve documentation for adding a new stemming algorithm.

* Improve wording of Python docs.

Snowball 2.1.0 (2021-01-21)
===========================

C/C++
-----

* Fix decoding of 4-byte UTF-8 sequences in `grouping` checks.  This bug
  affected Unicode codepoints U+40000 to U+7FFFF and U+C0000 to U+FFFFF and
  doesn't affect any of the stemming algorithms we currently ship (#138,
  reported by Stephane Carrez).

Python
------

* Fix snowballstemmer.algorithms() method (#132, reported by kkaiser).

* Update code to generate trove language classifiers for PyPI.  All the
  natural languages we previously had stemmers for have now been added to
  PyPI's list, but Armenian and Yiddish aren't on it.  Patch from Dmitry
  Shachnev.

Code Quality Improvements
-------------------------

* Suppress GCC warning in compiler code.

* Use `const` pointers more in C runtime.

* Only use spaces for indentation in javascript code.  Change proposed by Emily
  Marigold Klassen in #123, and seems to be the modern Javascript norm.

New Snowball Language Features
------------------------------

* `lenof` and `sizeof` can now be applied to a literal string, which can be
  useful if you want to do calculations on cursor values.

  This change actually simplifies the language a little, since you can now use
  a literal string in any read-only context which accepts a string variable.

Code generation improvements
----------------------------

* General:

  + Fix bugs in the code generated to handle failure of `goto`, `gopast` or
    `try` inside `setlimit` or string-`$`.  This affected all languages (though
    the issue with `try` wasn't present for C).  These bugs don't affect any of
    the stemming algorithms we currently ship.  Reported by Stefan Petkovic on
    snowball-discuss.

  + Change `hop` with a negative argument to work as documented.  The manual
    says a negative argument to hop will raise signal f, but the implementation
    for all languages was actually to move the cursor in the opposite direction
    to `hop` with a positive argument.  The implemented behaviour is
    problematic as it allows invalidating implicitly saved cursor values by
    modifying the string outside the current region, so we've decided it's best
    to fix the implementation to match the documentation.

    The only Snowball code we're aware of which relies on this was the original
    version of the new Yiddish stemming algorithm, which has been updated not
    to rely on this.

    The compiler now issues a warning for `hop` with a constant negative
    argument (internally now converted to `false`), and for `hop` with a
    constant zero argument (internally now converted to `true`).

  + Canonicalise `among` actions equivalent to `()` such as `(true)` which
    previously resulted in an extra case in the among, and for Python
    we'd generate invalid Python code (`if` or `elif` with an empty body).
    Bug revealed by Assaf Urieli's Yiddish stemmer in #137.

  + Eliminate variables whose values are never used - they no longer have
    corresponding member variables, etc, and no code is generated for any
    assignments to them.

  + Don't generate anything for an unused `grouping`.

  + Stop warning "grouping X defined but not used" for a `grouping` which is
    only used to define another `grouping`.

* C/C++:

  + Store booleans in same array as integers.  This means each boolean is
    stored as an int instead of an unsigned char which means 4 bytes instead of
    1, but we save a pointer (4 or 8 bytes) in struct SN_env which is a win for
    all the current stemmers.  For an algorithm which uses both integers and
    booleans, we also save the overhead of allocating a block on the heap, and
    potentially improve data locality.

  + Eliminate duplicate generated C comment for sliceto.

* Pascal:

  + Avoid generating unused variables.  The Pascal code generated for the
    stemmers we ship is now warning free (tested with fpc 3.2.0).

  + Don't emit empty `private` sections.  Cosmetic, but makes the generated
    code a bit easier to follow.

* Python:

  + End `if`-chain with `else` where possible, avoiding a redundant test
    of the variable being switched on.  This optimisation kicks in for an
    `among` where all cases have commands.  This change seems to speed up `make
    check_python_arabic` by a few percent.

New stemming algorithms
-----------------------

* Add Serbian stemmer from stef4np (#113).

* Add Yiddish stemmer from Assaf Urieli (#137).

* Add Armenian stemmer from Astghik Mkrtchyan.  It's been on the website for
  over a decade, and included in Xapian for over 9 years without any negative
  feedback.

Optimisations to existing algorithms
------------------------------------

* kraaij_pohlmann: Use `$v = limit` instead of `do (tolimit setmark v)` since
  this generates simpler code, and also matches the code other algorithm
  implementations use.

  Probably for languages like C with optimising compilers the compiler
  will generate equivalent code anyway, but e.g. for Python this should be
  an improvement.

Code clarity improvements to existing algorithms
------------------------------------------------

* hindi.sbl: Fix comment typo.

Compiler
--------

* Don't count `$x = x + 1` as initialising or using `x`, so it's now handled
  like `$x += 1` already is.

* Comments are now only included in the generated code if command line option
  -comments is specified.

  The comments in the generated code are useful if you're trying to debug the
  compiler, and perhaps also if you are trying to debug your Snowball code, but
  for everyone else they just bloat the code which as the number of languages
  we support grows becomes more of an issue.

* `-parentclassname` is not only for java and csharp so don't disable it if
  those backends are disabled.

* `-syntax` now reports the value for each numeric literal.

* Report location for excessive get nesting error.

* Internally the compiler now represents negated literal numbers as a simple
  `c_number` rather than `c_neg` applied to a `c_number` with a positive value.
  This simplifies optimisations that want to check for a constant numeric
  expression.

Build system
------------

* Link binaries with LDFLAGS if it's set, which is needed for some platform
  (e.g. OpenEmbedded).  Patch from Andreas Müller (#120).

* Add missing dependencies of algorithms.go rule.

Testsuite
---------

* C: Add stemtest for low-level regression tests.

Documentation
-------------

* Document a C99 compiler as a requirement for building the snowball compiler
  (but the C code it generates should still work with any ISO C compiler).

  A few declarations mixed with code crept in some time ago (which nobody's
  complained about), so this is really just formally documenting a requirement
  which already existed.

* README: Explain what Snowball is and what Stemming is (#131, reported by Sean
  Kelly).

* CONTRIBUTING.rst: Expand section on adding a new generator.

* For Python snowballstemmer module include global NEWS instead of
  Python-specific CHANGES.rst and use README.rst as the long description.
  Patch from Dmitry Shachnev (#119).

* COPYING: Update and incorporate Python backend licensing information which
  was previously in a separate file.

Snowball 2.0.0 (2019-10-02)
===========================

C/C++
-----

* Fully handle 4-byte UTF-8 sequences.  Previously `hop` and `next` handled
  sequences of any length, but commands which look at the character value only
  handled sequences up to length 3.  Fixes #89.

* Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`.

Java
----

* TestApp.java:

  - Always use UTF-8 for I/O.  Patch from David Corbett (#80).

  - Allow reading input from stdin.

  - Remove rather pointless "stem n times" feature.

  - Only lower case ASCII to match stemwords.c.

  - Stem empty lines too to match stemwords.c.

Code Quality Improvements
-------------------------

* Fix various warnings from newer compilers.

* Improve use of `const`.

* Share common functions between compiler backends rather than having multiple
  copies of the same code.

* Assorted code clean-up.

* Initialise line_labelled member of struct generator to 0.  Previously we were
  invoking undefined behaviour, though in practice it'll be zero initialised on
  most platforms.

New Code Generators
-------------------

* Add Python generator (#24).  Originally written by Yoshiki Shibukawa, with
  additional updates by Dmitry Shachnev.

* Add Javascript generator.  Based on JSX generator (#26) written by Yoshiki
  Shibukawa.

* Add Rust generator from Jakob Demler (#51).

* Add Go generator from Marty Schoch (#57).

* Add C# generator.  Based on patch from Cesar Souza (#16, #17).

* Add Pascal generator.  Based on Delphi backend from stemming.zip file on old
  website (#75).

New Snowball Language Features
------------------------------

* Add `len` and `lenof` to measure Unicode length.  These are similar to `size`
  and `sizeof` (respectively), but `size` and `sizeof` return the length in
  bytes under `-utf8`, whereas these new commands give the same result whether
  using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in
  the length of the string).  For compatibility with existing code which might
  use these as variable or function names, they stop being treated as tokens if
  declared to be a variable or function.

* New `{U+1234}` stringdef notation for Unicode codepoints.

* More versatile integer tests.  Now you can compare any two arithmetic
  expressions with a relational operator in parentheses after the `$`, so for
  example `$(len > 3)` can now be used when previously a temporary variable was
  required: `$tmp = len $tmp > 3`

Code generation improvements
----------------------------

* General:

  + Avoid unnecessarily saving and restoring of the cursor for more commands -
    `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always
    restore its value, and for C `booltest` (which other languages already
    handled).

  + Special case handling for `setlimit tomark AE`.  All uses of setlimit in
    the current stemmers we ship follow this pattern, and by special-casing we
    can avoid having to save and restore the cursor (#74).

  + Merge duplicate actions in the same `among`.  This reduces the size of the
    switch/if-chain in the generated code which dispatch the among for many of
    the stemmers.

  + Generate simpler code for `among`.  We always check for a zero return value
    when we call the among, so there's no point also checking for that in the
    switch/if-chain.  We can also avoid the switch/if-chain entirely when
    there's only one possible outcome (besides the zero return).

  + Optimise code generated for `do <function call>`.  This speeds up "make
    check_python" by about 2%, and should speed up other interpreted languages
    too (#110).

  + Generate more and better comments referencing snowball source.

  + Add homepage URL and compiler version as comments in generated files.

* C/C++:

  + Fix `size` and `sizeof` to not report one too high (reported by Assem
    Chelli in #32).

  + If signal `f` from a function call would lead to return from the current
    function then handle this and bailing out on an error together with a
    simple `if (ret <= 0) return ret;`

  + Inline testing for a single character literals.

  + Avoiding generating `|| 0` in corner case - this can result in a compiler
    warning when building the generated code.

  + Implement `insert_v()` in terms of `insert_s()`.

  + Add conditional `extern "C"` so `runtime/api.h` can be included from C++
    code.  Closes #90, reported by vvarma.

* Java:

  + Fix functions in `among` to work in Java.  We seem to need to make the
    methods called from among `public` instead of `private`, and to call them
    on `this` instead of the `methodObject` (which is cleaner anyway).  No
    revision in version control seems to generate working code for this case,
    but Richard says it definitely used to work - possibly older JVMs failed to
    correctly enforce the access controls when methods were invoked by
    reflection.

  + Code after handling `f` by returning from the current function is
    unreachable too.

  + Previously we incorrectly decided that code after an `or` was
    unreachable in certain cases.  None of the current stemmers in the
    distribution triggered this, but Martin Porter's snowball version
    of the Schinke Latin stemmer does.  Fixes #58, reported by Alexander
    Myltsev.

  + The reachability logic was failing to consider reachability from
    the final command in an `or`.  Fixes #82, reported by David Corbett.

  + Fix `maxint` and `minint`.  Patch from David Corbett in #31.

  + Fix `$` on strings.  The previous generated code was just wrong.  This
    doesn't affect any of the included algorithms, but for example breaks
    Martin Porter's snowball implementation of Schinke's Latin Stemmer.
    Issue noted by Jakob Demler while working on the Rust backend in #51,
    and reported in the Schinke's Latin Stemmer by Alexander Myltsev
    in #58.

  + Make SnowballProgram objects serializable.  Patch from Oleg Smirnov in #43.

  + Eliminate range-check implementation for groupings.  This was removed from
    the C generator 10 years earlier, isn't used for any of the existing
    algorithms, and it doesn't seem likely it would be - the grouping would
    have to consist entirely of a contiguous block of Unicode code-points.

  + Simplify code generated for `repeat` and `atleast`.

  + Eliminate unused return values and variables from runtime functions.

  + Only import the `among` and `SnowballProgram` classes if they're actually
    used.

  + Only generate `copy_from()` method if it's used.

  + Merge runtime functions `eq_s` and `eq_v` functions.

  + Java arrays know their own length so stop storing it separately.

  + Escape char 127 (DEL) in generated Java code.  It's unlikely that this
    character would actually be used in a real stemmer, so this was more of a
    theoretical bug.

  + Drop unused import of InvocationTargetException from SnowballStemmer.
    Reported by GerritDeMeulder in #72.

  + Fix lint check issues in generated Java code.  The stemmer classes are only
    referenced in the example app via reflection, so add
    @SuppressWarnings("unused") for them.  The stemmer classes override
    equals() and hashCode() methods from the standard java Object class, so
    mark these with @Override.  Both suggested by GerritDeMeulder in #72.

  + Declare Java variables at point of use in generated code.  Putting all
    declarations at the top of the function was adding unnecessary complexity
    to the Java generator code for no benefit.

  + Improve formatting of generated code.

New stemming algorithms
-----------------------

* Add Tamil stemmer from Damodharan Rajalingam (#2, #3).

* Add Arabic stemmer from Assem Chelli (#32, #50).

* Add Irish stemmer from Jim O'Regan (#48).

* Add Nepali stemmer from Arthur Zakirov (#70).

* Add Indonesian stemmer from Olly Betts (#71).

* Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review.

* Add Lithuanian stemmer from Dainius Jocas (#22, #76).

* Add Greek stemmer from Oleg Smirnov (#44).

* Add Catalan and Basque stemmers from Israel Olalla (#104).

Behavioural changes to existing algorithms
------------------------------------------

* Portuguese:

  + Replace incorrect Spanish suffixes by Portuguese suffixes (#1).

* French:

  + The MSDOS CP850 version of the French algorithm was missing changes present
    in the ISO8859-1 and Unicode versions.  There's now a single version of
    each algorithm which was based on the Unicode version.

  + Recognize French suffixes even when they begin with diaereses.  Patch from
    David Corbett in #78.

* Russian:

  + We now normalise 'ё' to 'е' before stemming.  The documentation has long
    said "we assume ['ё'] is mapped into ['е']" but it's more convenient for
    the stemmer to actually perform this normalisation.  This change has no
    effect if the caller is already normalising as we recommend.  It's a change
    in behaviour they aren't, but 'ё' occurs rarely (there are currently no
    instances in our test vocabulary) and this improves behaviour when it does
    occur.  Patch from Eugene Mirotin (#65, #68).

* Finish:

  + Adjust the Finnish algorithm not to mangle numbers.  This change also
    means it tends to leave foreign words alone.  Fixes #66.

* Danish:

  + Adjust Danish algorithm not to mangle alphanumeric codes. In particular
    alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000,
    space1999) are no longer mangled.  See #81.

Optimisations to existing algorithms
------------------------------------

* Turkish:

  + Simplify uses of `test` in stemmer code.

  + Check for 'ad' or 'soyad' more efficiently, and without needing the
    strlen variable.  This speeds up "make check_utf8_turkish" by 11%
    on x86 Linux.

* Kraaij-Pohlmann:

  + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient
    than `setmark x $x >= p1`.

Code clarity improvements to existing algorithms
------------------------------------------------

* Turkish:

  + Use , for cedilla to match the conventions used in other stemmers.

* Kraaij-Pohlmann:

  + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same
    `[substring] among (` ... `)` construct we do in other stemmers.

Compiler
--------

* Support conventional --help and --version options.

* Warn if -r or -ep used with backend other than C/C++.

* Warn if encoding command line options are specified when generating code in a
  language with a fixed encoding.

* The default classname is now set based on the output filename, so `-n` is now
  often no longer needed.  Fixes #64.

* Avoid potential one byte buffer over-read when parsing snowball code.

* Avoid comparing with uninitialised array element during compilation.

* Improve `-syntax` output for `setlimit L for C`.

* Optimise away double negation so generators don't have to worry about
  generating `--` (decrement operator in many languages).  Fixes #52, reported
  by David Corbett.

* Improved compiler error and warning messages:

  - We now report FILE:LINE: before each diagnostic message.

  - Improve warnings for unused declarations/definitions.

  - Warn for variables which are used, but either never initialised
    or never read.

  - Flag non-ASCII literal strings.  This is an error for wide Unicode, but
    only a warning for single-byte and UTF-8 which work so long as the source
    encoding matches the encoding used in the generated stemmer code.

  - Improve error recovery after an undeclared `define`.  We now sniff the
    token after the identifier and if it is `as` we parse as a routine,
    otherwise we parse as a grouping.  Previously we always just assumed it was
    a routine, which gave a confusing second error if it was a grouping.

  - Improve error recovery after an unexpected token in `among`.  Previously
    we acted as if the unexpected token closed the `among` (this probably
    wasn't intended but just a missing `break;` in a switch statement).  Now we
    issue an error and try the next token.

* Report error instead of silently truncating character values (e.g. `hex 123`
  previously silently became byte 0x23 which is `#` rather than a
  g-with-cedilla).

* Enlarge the initial input buffer size to 8192 bytes and double each time we
  hit the end.  Snowball programs are typically a few KB in size (with the
  current largest we ship being the Greek stemmer at 27KB) so the previous
  approach of starting with a 10 byte input buffer and increasing its size by
  50% plus 40 bytes each time it filled was inefficient, needing up to 15
  reallocations to load greek.sbl.

* Identify variables only used by one `routine`/`external`.  This information
  isn't yet used, but such variables which are also always written to before
  being read can be emitted as local variables in most target languages.

* We now allow multiple source files on command line, and allow them to be
  after (or even interspersed) with options to better match modern Unix
  conventions.  Support for multiple source files allows specifying a single
  byte character set mapping via a source file of `stringdef`.

* Avoid infinite recursion in compiler when optimising a recursive snowball
  function.  Recursive functions aren't typical in snowball programs, but
  the compiler shouldn't crash for any input, especially not a valid one.
  We now simply limit on how deep the compiler will recurse and make the
  pessimistic assumption in the unlikely event we hit this limit.

Build system
------------

* `make clean` in C libstemmer_c distribution now removes `examples/*.o`.
  (#59)

* Fix all the places which previously had to have a list of stemmers to work
  dynamically or be generated, so now only modules.txt needs updating to add
  a new stemmer.

* Add check_java make target which runs tests for java.

* Support gzipped test data (the uncompressed arabic test data is too big for
  github).

* GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball
  invocations for Java - these are only meaningful when generating C code.

* Pass CFLAGS when linking which matches convention (e.g. automake does it) and
  facilitates use of tools such as ASan.  Fixes #84, reported by Thomas
  Pointhuber.

* Add CI builds with -std=c90 to check compiler and generated code are C90
  (#54)

libstemmer
----------

* Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords.

* Add -O2 to CFLAGS.

* Make generated tables of encodings and modules const.

* Fix clang static analyzer memory leak warning (in practice this code path
  can never actually be taken).  Patch from Patrick O. Perry (#56)

Documentation
-------------

* Added copyright and licensing details (#10).

* Document that libstemmer supports ISO_8859_2 encoding.  Currently hungarian
  and romanian are available in ISO_8859_2.

* Remove documentation falsely claiming that libstemmer supports CP850
  encoding.

* CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and
  new language backends.

* Overhaul libstemmer_python_README.  Most notably, replace the benchmark data
  which was very out of date.