1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505
|
Version 3.9 2024-08-18
* Avoid need for pickled models, resolves security vulnerability CVE-2024-39705
* No longer sort WordNet synsets and relations (sort in calling function when required)
* Add Python 3.12 support
* Many other minor fixes
Thanks to the following contributors to 3.8.2:
Tom Aarsen, Cat Lee Ball, Veralara Bernhard, Carlos Brandt, Konstantin Chernyshev, Michael Higgins,
Eric Kafe, Vivek Kalyan, David Lukes, Rob Malouf, purificant, Alex Rudnick, Liling Tan, Akihiro Yamazaki.
Version 3.8.1 2023-01-02
* Resolve RCE vulnerability in localhost WordNet Browser (#3100)
* Remove unused tool scripts (#3099)
* Resolve XSS vulnerability in localhost WordNet Browser (#3096)
* Add Python 3.11 support (#3090)
Thanks to the following contributors to 3.8.1:
Francis Bond, John Vandenberg, Tom Aarsen
Version 3.8 2022-12-12
* Refactor dispersion plot (#3082)
* Provide type hints for LazyCorpusLoader variables (#3081)
* Throw warning when LanguageModel is initialized with incorrect vocabulary (#3080)
* Fix WordNet's all_synsets() function (#3078)
* Resolve TreebankWordDetokenizer inconsistency with end-of-string contractions (#3070)
* Support both iso639-3 codes and BCP-47 language tags (#3060)
* Avoid DeprecationWarning in Regexp tokenizer (#3055)
* Fix many doctests, add doctests to CI (#3054, #3050, #3048)
* Fix bool field not being read in VerbNet (#3044)
* Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers (#3042)
* Fix encodings of Polish udhr corpus reader (#3038)
* Allow TweetTokenizer to tokenize emoji flag sequences (#3034)
* Prevent LazyModule from increasing the size of nltk.__dict__ (#3033)
* Fix CoreNLPServer non-default port issue (#3031)
* Add "acion" suffix to the Spanish SnowballStemmer (#3030)
* Allow loading WordNet without OMW (#3026)
* Use input() in nltk.chat.chatbot() for Jupyter support (#3022)
* Fix edit_distance_align() in distance.py (#3017)
* Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 (#3014)
* Add the Iota operator to semantic logic (#3010)
* Resolve critical errors in WordNet app (#3008)
* Resolve critical error in CHILDES Corpus (#2998)
* Make WordNet information_content() accept adjective satellites (#2995)
* Add "strict=True" parameter to CoreNLP (#2993, #3043)
* Resolve issue with WordNet's synset_from_sense_key (#2988)
* Handle WordNet synsets that were lost in mapping (#2985)
* Resolve TypeError in Boxer (#2979)
* Add function to retrieve WordNet synonyms (#2978)
* Warn about nonexistent OMW offsets instead of raising an error (#2974)
* Fix missing ic argument in res, jcn and lin similarity functions of WordNet (#2970)
* Add support for the extended OMW (#2946)
* Fix LC cutoff policy of text tiling (#2936)
* Optimize ConditionalFreqDist.__add__ performance (#2939)
* Add Markdown corpus reader (#2902)
Thanks to the following contributors to 3.8:
Alexandre Perez-Lebel, David Lukes, Eric Kafe, Fernando Carranza, Heungson Lee,
Hoyeol Kim, James Huang, Jelle Zijlstra, Louis-Justin Tallot, M.K. Pawelkiewicz,
Jan Lennartz, Malinda Dilhara, Martin Kondratzky, Rob Malouf, Saud Kadiri,
Siddhesh Mhadnak, Stephan Hasler, Steve Smith, Tom Aarsen, Tyler Sheaffer,
Yue Zhao, cestwc, elespike, purificant, richardyy1188
Version 3.7 2022-02-09
* Improve and update the NLTK team page on nltk.org (#2855, #2941)
* Drop support for Python 3.6, support Python 3.10 (#2920)
Thanks to the following contributors to 3.7:
Tom Aarsen
Version 3.6.7 2021-12-28
* Resolve IndexError in `sent_tokenize` and `word_tokenize` (#2922)
Thanks to the following contributors to 3.6.7:
Tom Aarsen
Version 3.6.6 2021-12-21
* Refactor `gensim.doctest` to work for gensim 4.0.0 and up (#2914)
* Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)
* Added warnings if .zip files exist without any corresponding .csv files. (#2908)
* Fix `FileNotFoundError` when the `download_dir` is a non-existing nested folder (#2910)
* Rename omw to omw-1.4 (#2907)
* Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)
* Support OMW 1.4 (#2899)
* Deprecate Tree get and set node methods (#2900)
* Fix broken inaugural test case (#2903)
* Use Multilingual Wordnet Data from OMW with newer Wordnet versions (#2889)
* Keep NLTKs "tokenize" module working with pathlib (#2896)
* Make prettyprinter to be more readable (#2893)
* Update links to the nltk book (#2895)
* Add `CITATION.cff` to nltk (#2880)
* Resolve serious ReDoS in PunktSentenceTokenizer (#2869)
* Delete old CI config files (#2881)
* Improve Tokenize documentation + add TokenizerI as superclass for TweetTokenizer (#2878)
* Fix expected value for BLEU score doctest after changes from #2572
* Add multi Bleu functionality and tests (#2793)
* Deprecate 'return_str' parameter in NLTKWordTokenizer and TreebankWordTokenizer (#2883)
* Allow empty string in CFG's + more (#2888)
* Partition `tree.py` module into `tree` package + pickle fix (#2863)
* Fix several TreebankWordTokenizer and NLTKWordTokenizer bugs (#2877)
* Rewind Wordnet data file after each lookup (#2868)
* Correct __init__ call for SyntaxCorpusReader subclasses (#2872)
* Documentation fixes (#2873)
* Fix levenstein distance for duplicated letters (#2849)
* Support alternative Wordnet versions (#2860)
* Remove hundreds of formatting warnings for nltk.org (#2859)
* Modernize `nltk.org/howto` pages (#2856)
* Fix Bleu Score smoothing function from taking log(0) (#2839)
* Update third party tools to newer versions and removing MaltParser fixed version (#2832)
* Fix TypeError: _pretty() takes 1 positional argument but 2 were given in sem/drt.py (#2854)
* Replace `http` with `https` in most URLs (#2852)
Thanks to the following contributors to 3.6.6:
Adam Hawley, BatMrE, Danny Sepler, Eric Kafe, Gavish Poddar, Panagiotis Simakis,
RnDevelover, Robby Horvath, Tom Aarsen, Yuta Nakamura, Mohaned Mashaly
Version 3.6.5 2021-10-11
* modernised nltk.org website
* addressed LGTM.com issues
* support ZWJ sequences emoji and skin tone modifer emoji in TweetTokenizer
* METEOR evaluation now requires pre-tokenized input
* Code linting and type hinting
* implement get_refs function for DrtLambdaExpression
* Enable automated CoreNLP, Senna, Prover9/Mace4, Megam, MaltParser CI tests
* specify minimum regex version that supports regex.Pattern
* avoid re.Pattern and regex.Pattern which fail for Python 3.6, 3.7
Thanks to the following contributors to 3.6.5:
Tom Aarsen, Saibo Geng, Mohaned Mashaly, Dimitri Papadopoulos, Danny Sepler,
Ahmet Yildirim, RnDevelover, yutanakamura
Version 3.6.4 2021-10-01
* deprecate `nltk.usage(obj)` in favor of `help(obj)`
* resolve ReDoS vulnerability in Corpus Reader
* solidify performance tests
* improve phone number recognition in tweet tokenizer
* refactored CISTEM stemmer for German
* identify NLTK Team as the author
* replace travis badge with github actions badge
* add SECURITY.md
Thanks to the following contributors to 3.6.4:
Tom Aarsen, Mohaned Mashaly, Dimitri Papadopoulos Orfanos, purificant, Danny Sepler
Version 3.6.3 2021-09-19
* Dropped support for Python 3.5
* Run CI tests on Windows, too
* Moved from Travis CI to GitHub Actions
* Code and comment cleanups
* Visualize WordNet relation graphs using Graphviz
* Fixed large error in METEOR score
* Apply isort, pyupgrade, black, added as pre-commit hooks
* Prevent debug_decisions in Punkt from throwing IndexError
* Resolved ZeroDivisionError in RIBES with dissimilar sentences
* Initialize WordNet IC total counts with smoothing value
* Fixed AttributeError for Arabic ARLSTem2 stemmer
* Many fixes and improvements to lm language model package
* Fix bug in nltk.metrics.aline, C_skip = -10
* Improvements to TweetTokenizer
* Optional show arg for FreqDist.plot, ConditionalFreqDist.plot
* edit_distance now computes Damerau-Levenshtein edit-distance
Thanks to the following contributors to 3.6.3:
Tom Aarsen, Abhijnan Bajpai, Michael Wayne Goodman, Michał Górny, Maarten ter Huurne,
Manu Joseph, Eric Kafe, Ilia Kurenkov, Daniel Loney, Rob Malouf, Mohaned Mashaly,
purificant, Danny Sepler, Anthony Sottile
Version 3.6.2 2021-04-20
* move test code to nltk/test
* clean up some doctests
* fix bug in NgramAssocMeasures (order preserving fix)
* fixes for compatibility with Pypy 7.3.4
Thanks to the following contributors to 3.6.2:
Ruben Cartuyvels, Rob Malouf, Dalton Pearson, Danny Sepler
Version 3.6 2021-04-07
* add support for Python 3.9
* add Tree.fromlist
* compute Minimum Spanning Tree of unweighted graph using BFS
* fix bug with infinite loop in Wordnet closure and tree
* fix bug in calculating BLEU using smoothing method 4
* Wordnet synset similarities work for all pos
* new Arabic light stemmer (ARLSTem2)
* new syllable tokenizer (LegalitySyllableTokenizer)
* remove nose in favor of pytest
* misc bug fixes, code cleanups, test cleanups, efficiency improvements
Thanks to the following contributors to 3.6:
Tom Aarsen, K Abainia, Akshita Bhagia, Andrew Bird, Thomas Bird,
Tom Conroy, Christopher Hench, Andrew Jorgensen, Eric Kafe,
Ilia Kurenkov, Yeting Li, Joseph Manu, Marius Mather, Denali Molitor,
Jacob Moorman, Philippe Ombredanne, Vassilis Palassopoulos, Ram Rachum,
Danny Sepler, Or Sharir, Brad Solomon, Hiroki Teranishi, Constantin Weisser,
Pratap Yadav, Louis Yang
Version 3.5 2020-04-13
* add support for Python 3.8
* drop support for Python 2
* create NLTK's own Tokenizer class distinct from the Treebank reference tokeniser
* update Vader sentiment analyser
* fix JSON serialization of some PoS taggers
* minor improvements in grammar.CFG, Vader, pl196x corpus reader, StringTokenizer
* change implementation <= and >= for FreqDist so they are partial orders
* make FreqDist iterable
* correctly handle Penn Treebank trees with a unlabeled branching top node.
Thanks to the following contributors to 3.5:
Nicolas Darr, Gerhard Kremer, Liling Tan, Christopher Hench, Alexandre Dias, Hervé Nicol,
Pierpaolo Pantone, Bonifacio de Oliveira, Maciej Gawinecki, BLKSerene, hoefling, alvations,
pyfisch, srhrshr
Version 3.4.5 2019-08-20
* Fixed security bug in downloader: Zip slip vulnerability - for the unlikely
situation where a user configures their downloader to use a compromised server
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14751)
Thanks to the following contributors to 3.4.5:
Mike Salvatore
Version 3.4.4 2019-07-04
* fix bug in plot function (probability.py)
* add improved PanLex Swadesh corpus reader
Thanks to the following contributors to 3.4.4:
Devashish Lal, Liling Tan
Version 3.4.3 2019-06-07
* add Text.generate()
* add QuadgramAssocMeasures
* add SSP to tokenizers
* return confidence of best tag from AveragedPerceptron
* make plot methods return Axes objects
* don't require list arguments to PositiveNaiveBayesClassifier.train
* fix Tree classes to work with native Python copy library
* fix inconsistency for NomBank
* fix random seeding in LanguageModel.generate
* fix ConditionalFreqDist mutation on tabulate/plot call
* fix broken links in documentation
* fix misc Wordnet issues
* update installation instructions
Thanks to the following contributors to 3.4.3:
alvations, Bharat123rox, cifkao, drewmiller, free-variation, henchc
irisxzhou, nick-ulle, ppartarr, simonepri, yigitsever, zhaoyanpeng
Version 3.4.1 2019-04-17
* add chomsky_normal_form for CFGs
* add meteor score
* add minimum edit/Levenshtein distance based alignment function
* allow access to collocation list via text.collocation_list()
* support corenlp server options
* drop support for Python 3.4
* other minor fixes
Thanks to the following contributors to 3.4.1:
Adrian Ellis, Andrew Martin, Ayush Kaushal, BLKSerene, Bharat
Raghunathan, Franklin Chen, KMiNT21 Kevin Brown, Liling Tan,
Matan Rak, Nat Quayle Nelson, Osman Zubair, Purificant,
Uday Krishna, Viresh Gupta
Version 3.4 2018-11-17
* Support Python 3.7
* Language Modeling incl Kneser-Ney, Witten-Bell, Good-Turing
* Cistem Stemmer for German
* Support Russian National Corpus incl POS tag model
* Decouple sentiment and twitter packages
* Minor extensions for WordNet
* K-alpha
* Fix warning messages for corenlp
* Comprehensive code cleanups
* Many other minor fixes
* Switch continuous integration from Jenkins to Travis
Special thanks to Ilia Kurenkov (Language Model package), Liling Tan (Python 3.7, Travis-CI),
and purificant (code cleanups). Thanks also to: Afshin Sadeghi, Ales Tamchyna, Alok Debnath,
aquatiko, Coykto, Denis Kataev, dnc1994, Fabian Howard, Frankie Robertson, Iaroslav Tymchenko,
Jayakrishna Sahit, LBenzahia, Leonie Weißweiler, Linghao Zhang, Rohit Kumar, sahitpj,
Tim Gianitsos, vagrant, 53X
Version 3.3 2018-05-06
* Support Python 3.6
* New interface to CoreNLP
* Support synset retrieval by sense key
* Minor fixes to CoNLL Corpus Reader, AlignedSent
* Fixed minor inconsistencies in APIs and API documentation
* Better conformance to PEP8
* Drop moses.py (incompatible license)
Special thanks to Liling Tan for leading our transition to Python 3.6.
Thanks to other contributors listed here: https://github.com/nltk/nltk/blob/develop/AUTHORS.md
Version 3.2.5 2017-09-24
* Arabic stemmers (ARLSTem, Snowball)
* NIST MT evaluation metric and added NIST international_tokenize
* Moses tokenizer
* Document Russian tagger
* Fix to Stanford segmenter
* Improve treebank detokenizer, VerbNet, Vader
* Misc code and documentation cleanups
* Implement fixes suggested by LGTM
Thanks to the following contributors to 3.2.5:
Ali Abdullah, Lakhdar Benzahia, Henry Elder, Campion Fellin,
Tsolak Ghukasyan, Thanh Ha, Jean Helie, Nelson Liu,
Nathan Schneider, Chintan Shah, Fábio Silva, Liling Tan,
Ziyao Wei, Zicheng Xu, Albert Au Yeung, AbdealiJK,
porqupine, sbagan, xprogramer
Version 3.2.4 2017-05-21
* remove load-time dependency on Python requests library
* add support for Arabic in StanfordSegmenter
* fix MosesDetokenizer on irregular quote tokens
Thanks to the following contributors to 3.2.4:
Alex Constantin, Hatem Nassrat, Liling Tan
Version 3.2.3 2017-05-16
* new interface to Stanford CoreNLP Web API
* improved Lancaster stemmer with customizable rules from Whoosh
* improved Treebank tokenizer
* improved support for GLEU score
* adopt new Abstract base class style
* support custom tab files for extending WordNet
* make synset_from_pos_and_offset a public method
* make non-English WordNet lemma lookups case-insensitive
* speed up TnT tagger
* speed up FreqDist and ConditionalFreqDist
* support additional quotes in TreebankWordTokenizer
* clean up Tk's postscript output
* drop explicit support for corpora not distributed with NLTK to streamline testing
* allow iterator in perceptron tagger training
* allow for curly bracket quantifiers in chunk.regexp.CHUNK_TAG_PATTERN
* new corpus reader for MWA subset of PPDB
* improved testing framework
Thanks to the following contributors to 3.2.3:
Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman,
Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs,
Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson,
Arthur Tilley, jmhutch, Yorwba, eromoe and others
Version 3.2.2 2016-12-31
* added Kondrak's Aline algorithm
* added ChrF and GLEU MT evaluation metrics
* added Russian pos tagger model
* added Moses detokenizer
* rewrite Porter Stemmer
* rewrite FrameNet corpus reader
(adds frame parameter to fes(), lus(), exemplars()
see https://www.nltk.org/howto/framenet.html)
* updated FrameNet Corpus to version 1.7
* fixes to stanford_segmenter.py, SentiText, CoNLL Corpus Reader
* fixes to BLEU, naivebayes, Krippendorff's alpha, Punkt
* fixes to tests for TransitionParser, Senna, edit distance
* fixes to Moses Tokenizer and Detokenizer
* improved TweetTokenizer
* strip trailing whitespace when splitting sentences
* handle inverted exclamation mark in ToktokTokenizer
* resolved some issues with Python 3.5 support
* improvements to testing framework
* clean up dependencies
Thanks to the following contributors to 3.2.2:
Prasasto Adi, Mark Amery, Geoff Bacon, George Berry, Colin Carroll, Alexis Dimitriadis,
Nicholas Fabina, German Ferrero, Tsolak Ghukasyan, Hyuckin David Lim, Naoya Kanai,
Greg Kondrak, Igor Korolev, Tim Leslie, Rob Malouf, Heguang Miao, Dmitrijs Milajevs,
Adam Nelson, Dennis O'Brien, Qi Liu, Pierpaolo Pantone, Andy Reagan, Mike Recachinas,
Nathan Schneider, Jānis Šlapiņš, Richard Snape, Liling Tan, Marcus Uneson,
Linghao Zhang, drevicko, SaintNazaire
Version 3.2.1 2016-04-09
* Support for CCG semantics, Stanford segmenter, VADER lexicon
* Fixes to BLEU score calculation, CHILDES corpus reader
* Other miscellaneous fixes
Thanks to the following contributors to 3.2.1:
Andrew Giel, Casper Lehmann-Strøm, David Madl, Tanin Na Nakorn,
Guilherme Nardari, Philippe Ombredanne, Nathan Schneider, Liling Tan,
Josiah Wang, venticello
Version 3.2 2016-03-03
* Fixes for Python 3.5
* Code cleanups now Python 2.6 is no longer supported
* Improvements to documentation
* Comprehensive use of os.path for platform-specific path handling
* Support for PanLex
* Support for third party download locations for NLTK data
* Fix bugs in IBM method 3 smoothing and BLEU calculation
* Support smoothing for BLEU score and corpus-level BLEU
* Support RIBES score
* Improvements to TweetTokenizer
* Updates for Stanford API
* Add mathematical operators to ConditionalFreqDist
* Fix bug in sentiwordnet for adjectives
* Merged internal implementations of Trie
Thanks to the following contributors to 3.2:
Santiago Castro, Jihun Choi, Graham Christensen, Andrew Drozdov, Long
Duong, Kyriakos Georgiou, Michael Wayne Goodman, Clark Grubb, Tah Wei
Hoon, David Kamholz, Ewan Klein, Reed Loden, Rob Malouf, Philippe
Ombredanne, Josh Owen, Pierpaolo Pantone, Mike Recachinas, Elijah
Rippeth, Thomas Stieglmaier, Liling Tan, Philip Tzou, Pratap Vardhan.
Version 3.1 2015-10-15
* Fixes for Python 3.5 (drop support for capturing groups in regexp tokenizer)
* Drop support for Python 2.6
* Adopt perceptron tagger for new default POS tagger nltk.pos_tag
* Stanford Neural Dependency Parser wrapper
* Sentiment analysis package incl VADER
* Improvements to twitter package
* Multi word expression tokenizer
* Support for everygram and skipgram
* consistent evaluation metric interfaces, putting reference before hypothesis
* new nltk.translate module, incorporating the old align module
* implement stack decoder
* clean up Alignment interface
* CorpusReader method to support access to license and citation
* Multext East Corpus and MTECorpusReader
* include six module to streamline installation on MS Windows
Thanks to the following contributors to 3.1:
Le Tuan Anh, Petra Barancikova, Alexander Böhm, Francis Bond,
Long Duong, Anna Garbar, Matthew Honnibal, Tah Wei Hoon, Ewan Klein,
Rob Malouf, Dmitrijs Milajevs, Will Monroe, Sergio Oller, Pierpaolo
Pantone, Jacob Perkins, Lorenzo Rubio, Thomas Stieglmaier, Liling Tan,
Pratap Vardhan
Version 3.0.5 2015-09-05
* rewritten IBM models, and new IBM Model 4 and 5 implementations
* new Twitter package
* stabilized MaltParser API
* improved regex tagger
* improved documentation on contributing
* minor improvements to documentation and testing
Thanks to the following contributors to 3.0.5:
Álvaro Justen, Dmitrijs Milajevs, Ewan Klein, Heran Lin, Justin Hammar,
Liling Tan, Long Duong, Lorenzo Rubio, Pierpaolo Pantone, Tah Wei Hoon
Version 3.0.4 2015-07-13
* minor bug fixes and enhancements
Thanks to the following contributors to 3.0.4:
Nicola Bova, Santiago Castro, Len Remmerswaal, Keith Suderman, kabayan55,
pln-fing-udelar (NLP Group, Instituto de Computación, Facultad de Ingeniería, Universidad de la República, Uruguay).
Version 3.0.3 2015-06-12
* bug fixes (Stanford NER, Boxer, Snowball, treebank tokenizer,
dependency graph, KneserNey, BLEU)
* code clean-ups
* default POS tagger permits tagset to be specified
* gensim illustration
* tgrep implementation
* added PanLex Swadesh corpora
* visualisation for aligned bitext
* support for Google App Engine
* POSTagger renamed StanfordPOSTagger, NERTagger renamed StanfordNERTagger
Thanks to the following contributors to 3.0.3:
Long Duong, Pedro Fialho, Dan Garrette, Helder, Saimadhav Heblikar,
Chris Inskip, David Kamholz, Dmitrijs Milajevs, Smitha Milli,
Tom Mortimer-Jones, Avital Pekker, Jonathan Pool, Sam Raker,
Will Roberts, Dmitry Sadovnychyi, Nathan Schneider, Anirudh W
Version 3.0.2 2015-03-13
* make pretty-printing method names consistent
* improvements to Portuguese stemmer
* transition-based dependency parsers
* dependency graph visualisation for ipython notebook
* interfaces for Senna, BLLIP, python-crfsuite
* NKJP corpus reader
* code clean ups, minor bug fixes
Thanks to the following contributors to 3.0.2:
Long Duong, Saimadhav Heblikar, Helder, Mikhail Korobov, Denis Krusko,
Alex Louden, Felipe Madrigal, David McClosky, Dmitrijs Milajevs,
Ondrej Platek, Nathan Schneider, Dávid Márk Nemeskey, 0ssifrage, ducki13, kiwipi.
Version 3.0.1 2015-01-12
* fix setup.py for new version of setuptools
Version 3.0.0 2014-09-07
* minor bugfixes
* added phrase extraction code by Liling Tan and Fredrik Hedman
Thanks to the following contributors to 3.0.0:
Mark Amery, Ivan Barria, Ingolf Becker, Francis Bond, Lars
Buitinck, Cristian Capdevila, Arthur Darcet, Michelle Fullwood,
Dan Garrette, Dougal Graham, Dan Garrette, Dougal Graham, Lauri
Hallila, Tyler Hartley, Fredrik Hedman, Ofer Helman, Bruce Hill,
Marcus Huderle, Nancy Ide, Nick Johnson, Angelos Katharopoulos,
Ewan Klein, Mikhail Korobov, Chris Liechti, Peter Ljunglof,
Joseph Lynch, Haejoong Lee, Peter Ljunglöf, Dean Malmgren, Rob
Malouf, Thorsten Marek, Dmitrijs Milajevs, Shari A’aidil
Nasruddin, Lance Nathan, Joel Nothman, Alireza Nourian, Alexander
Oleynikov, Ted Pedersen, Jacob Perkins, Will Roberts, Alex
Rudnick, Nathan Schneider, Geraldine Sim Wei Ying, Lynn Soe,
Liling Tan, Louis Tiao, Marcus Uneson, Yu Usami, Steven Xu, Zhe
Wang, Chuck Wooters, lade, isnowfy, onesandzeros, pquentin, wvanlint
Version 3.0b2 2014-08-21
* minor bugfixes and clean-ups
* renamed remaining parse_ methods to read_ or load_, cf issue #656
* added Paice's method of evaluating stemming algorithms
Thanks to the following contributors to 3.0.0b2: Lars Buitinck,
Cristian Capdevila, Lauri Hallila, Ofer Helman, Dmitrijs Milajevs,
lade, Liling Tan, Steven Xu
Version 3.0.0b1 2014-07-11
* Added SentiWordNet corpus and corpus reader
* Fixed support for 10-column dependency file format
* Changed Tree initialization to use fromstring
Thanks to the following contributors to 3.0b1: Mark Amery, Ivan
Barria, Ingolf Becker, Francis Bond, Lars Buitinck, Arthur Darcet,
Michelle Fullwood, Dan Garrette, Dougal Graham, Dan Garrette, Dougal
Graham, Tyler Hartley, Ofer Helman, Bruce Hill, Marcus Huderle, Nancy
Ide, Nick Johnson, Angelos Katharopoulos, Ewan Klein, Mikhail Korobov,
Chris Liechti, Peter Ljunglof, Joseph Lynch, Haejoong Lee, Peter
Ljunglöf, Dean Malmgren, Rob Malouf, Thorsten Marek, Dmitrijs
Milajevs, Shari A’aidil Nasruddin, Lance Nathan, Joel Nothman, Alireza
Nourian, Alexander Oleynikov, Ted Pedersen, Jacob Perkins, Will
Roberts, Alex Rudnick, Nathan Schneider, Geraldine Sim Wei Ying, Lynn
Soe, Liling Tan, Louis Tiao, Marcus Uneson, Yu Usami, Steven Xu, Zhe
Wang, Chuck Wooters, isnowfy, onesandzeros, pquentin, wvanlint
Version 3.0a4 2014-05-25
* IBM Models 1-3, BLEU, Gale-Church aligner
* Lesk algorithm for WSD
* Open Multilingual WordNet
* New implementation of Brill Tagger
* Extend BNCCorpusReader to parse the whole BNC
* MASC Tagged Corpus and corpus reader
* Interface to Stanford Parser
* Code speed-ups and clean-ups
* API standardisation, including fromstring method for many objects
* Improved regression testing setup
* Removed PyYAML dependency
Thanks to the following contributors to 3.0a4:
Ivan Barria, Ingolf Becker, Francis Bond, Arthur Darcet, Dan Garrette,
Ofer Helman, Dougal Graham, Nancy Ide, Ewan Klein, Mikhail Korobov,
Chris Liechti, Peter Ljunglof, Joseph Lynch, Rob Malouf, Thorsten Marek,
Dmitrijs Milajevs, Shari A’aidil Nasruddin, Lance Nathan, Joel Nothman,
Jacob Perkins, Lynn Soe, Liling Tan, Louis Tiao, Marcus Uneson, Steven Xu,
Geraldine Sim Wei Ying
Version 3.0a3 2013-11-02
* support for FrameNet contributed by Chuck Wooters
* support for Universal Declaration of Human Rights Corpus (udhr2)
* major API changes:
- Tree.node -> Tree.label() / Tree.set_label()
- Chunk parser: top_node -> root_label; chunk_node -> chunk_label
- WordNet properties are now access methods, e.g. Synset.definition -> Synset.definition()
- relextract: show_raw_rtuple() -> rtuple(), show_clause() -> clause()
* bugfix in texttiling
* replaced simplify_tags with support for universal tagset (simplify_tags=True -> tagset='universal')
* Punkt default behavior changed to realign sentence boundaries after trailing parenthesis and quotes
* deprecated classify.svm (use scikit-learn instead)
* various efficiency improvements
Thanks to the following contributors to 3.0a3:
Lars Buitinck, Marcus Huderle, Nick Johnson, Dougal Graham, Ewan Klein,
Mikhail Korobov, Haejoong Lee, Peter Ljunglöf, Dean Malmgren, Lance Nathan,
Alexander Oleynikov, Nathan Schneider, Chuck Wooters, Yu Usami, Steven Xu,
pquentin, wvanlint
Version 3.0a2 2013-07-12
* speed improvements in word_tokenize, GAAClusterer, TnT tagger, Baum Welch, HMM tagger
* small improvements in collocation finders, probability, modelling, Porter Stemmer
* bugfix in lowest common hypernyn calculation (used in path similarity measures)
* code cleanups, docstring cleanups, demo fixes
Thanks to the following contributors to 3.0a2:
Mark Amery, Lars Buitinck, Michelle Fullwood, Dan Garrette, Dougal Graham,
Tyler Hartley, Bruce Hill, Angelos Katharopoulos, Mikhail Korobov,
Rob Malouf, Joel Nothman, Ted Pedersen, Will Roberts, Alex Rudnick,
Steven Xu, isnowfy, onesandzeros
Version 3.0a1 2013-02-14
* reinstated tkinter support (Haejoong Lee)
Version 3.0a0 2013-01-14
* alpha release of first version to support Python 2.6, 2.7, and 3.
Version 2.0.4 2012-11-07
* minor bugfix (removed numpy dependency)
Version 2.0.3 2012-09-24
* fixed corpus/reader/util.py to support Python 2.5
* make MaltParser safe to use in parallel
* fixed bug in inter-annotator agreement
* updates to various doctests (nltk/test)
* minor bugfixes
Thanks to the following contributors to 2.0.3:
Robin Cooper, Pablo Duboue, Christian Federmann, Dan Garrette, Ewan Klein,
Pierre-François Laquerre, Max Leonov, Peter Ljunglöf, Nitin Madnani, Ceri Stagg
Version 2.0.2 2012-07-05
* improvements to PropBank, NomBank, and SemCor corpus readers
* interface to full Penn Treebank Corpus V3 (corpus.ptb)
* made wordnet.lemmas case-insensitive
* more flexible padding in model.ngram
* minor bugfixes and documentation enhancements
* better support for automated testing
Thanks to the following contributors to 2.0.2:
Daniel Blanchard, Mikhail Korobov, Nitin Madnani, Duncan McGreggor,
Morten Neergaard, Nathan Schneider, Rico Sennrich.
Version 2.0.1 2012-05-15
* moved NLTK to GitHub: https://github.com/nltk
* set up integration testing: https://jenkins.shiningpanda.com/nltk/ (Morten Neergaard)
* converted documentation to Sphinx format: https://www.nltk.org/api/nltk.html
* dozens of minor enhancements and bugfixes: https://github.com/nltk/nltk/commits/
* dozens of fixes for conformance with PEP-8
* dozens of fixes to ensure operation with Python 2.5
* added interface to Lin's Dependency Thesaurus (Dan Blanchard)
* added interface to scikit-learn classifiers (Lars Buitinck)
* added segmentation evaluation measures (David Doukhan)
Thanks to the following contributors to 2.0.1 (since 2.0b9, July 2010):
Rami Al-Rfou', Yonatan Becker, Steven Bethard, Daniel Blanchard, Lars
Buitinck, David Coles, Lucas Cooper, David Doukhan, Dan Garrette,
Masato Hagiwara, Michael Hansen, Michael Heilman, Rebecca Ingram,
Sudharshan Kaushik, Mikhail Korobov, Peter Ljunglof, Nitin Madnani,
Rob Malouf, Tomonori Nagano, Morten Neergaard, David Nemeskey,
Joel Nothman, Jacob Perkins, Alessandro Presta, Alex Rudnick,
Nathan Schneider, Stefano Lattarini, Peter Stahl, Jason Yoder
Version 2.0.1 (rc1) 2011-04-11
NLTK:
* added interface to the Stanford POS Tagger
* updates to sem.Boxer, sem.drt.DRS
* allow unicode strings in grammars
* allow non-string features in classifiers
* modifications to HunposTagger
* issues with DRS printing
* fixed bigram collocation finder for window_size > 2
* doctest paths no longer presume unix-style pathname separators
* fixed issue with NLTK's tokenize module colliding with the Python tokenize module
* fixed issue with stemming Unicode strings
* changed ViterbiParser.nbest_parse to parse
* ChaSen and KNBC Japanese corpus readers
* preserve case in concordance display
* fixed bug in simplification of Brown tags
* a version of IBM Model 1 as described in Koehn 2010
* new class AlignedSent for aligned sentence data and evaluation metrics
* new nltk.util.set_proxy to allow easy configuration of HTTP proxy
* improvements to downloader user interface to catch URL and HTTP errors
* added CHILDES corpus reader
* created special exception hierarchy for Prover9 errors
* significant changes to the underlying code of the boxer interface
* path-based wordnet similarity metrics use a fake root node for verbs, following the Perl version
* added ability to handle multi-sentence discourses in Boxer
* added the 'english' Snowball stemmer
* simplifications and corrections of Earley Chart Parser rules
* several changes to the feature chart parsers for correct unification
* bugfixes: FreqDist.plot, FreqDist.max, NgramModel.entropy, CategorizedCorpusReader, DecisionTreeClassifier
* removal of Python >2.4 language features for 2.4 compatibility
* removal of deprecated functions and associated warnings
* added semantic domains to wordnet corpus reader
* changed wordnet similarity functions to include instance hyponyms
* updated to use latest version of Boxer
Data:
* JEITA Public Morphologically Tagged Corpus (in ChaSen format)
* KNB Annotated corpus of Japanese blog posts
* Fixed some minor bugs in alvey.fcfg, and added number of parse trees in alvey_sentences.txt
* added more comtrans data
Documentation:
* minor fixes to documentation
* NLTK Japanese book (chapter 12) by Masato Hagiwara
NLTK-Contrib:
* Viethen and Dale referring expression algorithms
Version 2.0b9 2010-07-25
NLTK:
* many code and documentation cleanups
* Added port of Snowball stemmers
* Fixed loading of pickled tokenizers (issue 556)
* DecisionTreeClassifier now handles unknown features (issue 570)
* Added error messages to LogicParser
* Replaced max_models with end_size to prevent Mace from hanging
* Added interface to Boxer
* Added nltk.corpus.semcor to give access to SemCor 3.0 corpus (issue 530)
* Added support for integer- and float-valued features in maxent classifiers
* Permit NgramModels to be pickled
* Added Sourced Strings (see test/sourcedstring.doctest for details)
* Fixed bugs in with Good-Turing and Simple Good-Turing Estimation (issue 26)
* Add support for span tokenization, aka standoff annotation of segmentation (incl Punkt)
* allow unicode nodes in Tree.productions()
* Fixed WordNet's morphy to be consistent with the original implementation,
taking the shortest returned form instead of an arbitrary one (issues 427, 487)
* Fixed bug in MaxentClassifier
* Accepted bugfixes for YCOE corpus reader (issue 435)
* Added test to _cumulative_frequencies() to correctly handle the case when no arguments are supplied
* Added a TaggerI interface to the HunPos open-source tagger
* Return 0, not None, when no count is present for a lemma in WordNet
* fixed pretty-printing of unicode leaves
* More efficient calculation of the leftcorner relation for left corner parsers
* Added two functions for graph calculations: transitive closure and inversion.
* FreqDist.pop() and FreqDist.popitems() now invalidate the caches (issue 511)
Data:
* Added SemCor 3.0 corpus (Brown Corpus tagged with WordNet synsets)
* Added LanguageID corpus (trigram counts for 451 languages)
* Added grammar for a^n b^n c^n
NLTK-Contrib:
* minor updates
Thanks to the following contributors to 2.0b9:
Steven Bethard, Francis Bond, Dmitry Chichkov, Liang Dong, Dan Garrette,
Simon Greenhill, Bjorn Maeland, Rob Malouf, Joel Nothman, Jacob Perkins,
Alberto Planas, Alex Rudnick, Geoffrey Sampson, Kevin Scannell, Richard Sproat
Version 2.0b8 2010-02-05
NLTK:
* fixed copyright and license statements
* removed PyYAML, and added dependency to installers and download instructions
* updated to LogicParser, DRT (Dan Garrette)
* WordNet similarity metrics return None instead of -1 when
they fail to find a path (Steve Bethard)
* shortest_path_distance uses instance hypernyms (Jordan Boyd-Graber)
* clean_html improved (Bjorn Maeland)
* batch_parse, batch_interpret and batch_evaluate functions allow
grammar or grammar filename as argument
* more Portuguese examples (portuguese_en.doctest, examples/pt.py)
NLTK-Contrib:
* Aligner implementations (Christopher Crowner, Torsten Marek)
* ScriptTranscriber package (Richard Sproat and Kristy Hollingshead)
Book:
* updates for second printing, correcting errata
https://nltk.googlecode.com/svn/trunk/nltk/doc/book/errata.txt
Data:
* added Europarl sample, with 10 docs for each of 11 langs (Nitin Madnani)
* added SMULTRON sample corpus (Torsten Marek, Martin Volk)
Version 2.0b7 2009-11-09
NLTK:
* minor bugfixes and enhancements: data loader, inference package, FreqDist, Punkt
* added Portuguese example module, similar to nltk.book for English (examples/pt.py)
* added all_lemma_names() method to WordNet corpus reader
* added update() and __add__() extensions to FreqDist (enhances alignment with Python 3.0 counters)
* reimplemented clean_html
* added test-suite runner for automatic/manual regression testing
NLTK-Data:
* updated Punkt models for sentence segmentation
* added corpus of the works of Machado de Assis (Brazilian Portuguese)
Book:
* Added translation of preface into Portuguese, contributed by Tiago Tresoldi.
Version 2.0b6 2009-09-20
NLTK:
* minor fixes for Python 2.4 compatibility
* added words() method to XML corpus reader
* minor bugfixes and code clean-ups
* fixed downloader to put data in %APPDATA% on Windows
Data:
* Updated Punkt models
* Fixed utf8 encoding issues with UDHR and Stopwords Corpora
* Renamed CoNLL "cat" files to "esp" (different language)
* Added Alvey NLT feature-based grammar
* Added Polish PL196x corpus
Version 2.0b5 2009-07-19
NLTK:
* minor bugfixes (incl FreqDist, Python eggs)
* added reader for Europarl Corpora (contributed by Nitin Madnani)
* added reader for IPI PAN Polish Corpus (contributed by Konrad Goluchowski)
* fixed data.py so that it doesn't generate a warning for Windows Python 2.6
NLTK-Contrib:
* updated Praat reader (contributed by Margaret Mitchell)
Version 2.0b4 2009-07-10
NLTK:
* switched to Apache License, Version 2.0
* minor bugfixes in semantics and inference packages
* support for Python eggs
* fixed stale regression tests
Data:
* added NomBank 1.0
* uppercased feature names in some grammars
Version 2.0b3 2009-06-25
NLTK:
* several bugfixes
* added nombank corpus reader (Paul Bedaride)
Version 2.0b2 2009-06-15
NLTK:
* minor bugfixes and optimizations for parsers, updated some doctests
* added bottom-up filtered left corner parsers,
LeftCornerChartParser and IncrementalLeftCornerChartParser.
* fixed dispersion plot bug which prevented empty plots
Version 2.0b1 2009-06-09
NLTK:
* major refactor of chart parser code and improved API (Peter Ljungl喃)
* added new bottom-up left-corner chart parser strategy
* misc bugfixes (ChunkScore, chart rules, chatbots, jcn-similarity)
* improved efficiency of "import nltk" using lazy module imports
* moved CCG package and ISRI Arabic stemmer from NLTK-Contrib into core NLTK
* misc code cleanups
Contrib:
* moved out of the main NLTK distribution into a separate distribution
Book:
* Ongoing polishing ahead of print publication
Version 0.9.9 2009-05-06
NLTK:
* Finalized API for NLTK 2.0 and the book, incl dozens of small fixes
* Names of the form nltk.foo.Bar now available as nltk.Bar
for significant functionality; in some cases the name was modified
(using old names will produce a deprecation warning)
* Bugfixes in downloader, WordNet
* Expanded functionality in DecisionTree
* Bigram collocations extended for discontiguous bigrams
* Translation toy nltk.misc.babelfish
* New module nltk.help giving access to tagset documentation
* Fix imports so that NLTK builds without Tkinter (Bjorn Maeland)
Data:
* new maxent NE chunker model
* updated grammar packages for the book
* data for new tagsets collection, documenting several tagsets
* added lolcat translation to the Genesis collection
Contrib (work in progress):
* Updates to coreference package (Joseph Frazee)
* New ISRI Arabic stemmer (Hosam Algasaier)
* Updates to Toolbox package (Greg Aumann)
Book:
* Substantial editorial corrections ahead of final submission
Version 0.9.8 2009-02-18
NLTK:
* New off-the-shelf tokenizer, POS tagger, and named-entity tagger
* New metrics package with inter-annotator agreement scores,
distance metrics, rank correlation
* New collocations package (Joel Nothman)
* Many clean-ups to WordNet package (Steven Bethard, Jordan Boyd-Graber)
* Moved old pywordnet-based WordNet package to nltk_contrib
* WordNet browser (Paul Bone)
* New interface to dependency treebank corpora
* Moved MinimalSet class into nltk.misc package
* Put NLTK applications in new nltk.app package
* Many other improvements incl semantics package, toolbox, MaltParser
* Misc changes to many API names in preparation for 1.0, old names deprecated
* Most classes now available in the top-level namespace
* Work on Python egg distribution (Brandon Rhodes)
* Removed deprecated code remaining from 0.8.* versions
* Fixes for Python 2.4 compatibility
Data:
* Corrected identifiers in Dependency Treebank corpus
* Basque and Catalan Dependency Treebanks (CoNLL 2007)
* PE08 Parser Evaluation data
* New models for POS tagger and named-entity tagger
Book:
* Substantial editorial corrections
Version 0.9.7 2008-12-19
NLTK:
* fixed problems with accessing zipped corpora
* improved design and efficiency of grammars and chart parsers
including new bottom-up combine strategy and a redesigned
Earley strategy (Peter Ljunglof)
* fixed bugs in smoothed probability distributions and added
regression tests (Peter Ljunglof)
* improvements to Punkt (Joel Nothman)
* improvements to text classifiers
* simple word-overlap RTE classifier
Data:
* A new package of large grammars (Peter Ljunglof)
* A small gazetteer corpus and corpus reader
* Organized example grammars into separate packages
* Childrens' stories added to gutenberg package
Contrib (work in progress):
* fixes and demonstration for named-entity feature extractors in nltk_contrib.coref
Book:
* extensive changes throughout, including new chapter 5 on classification
and substantially revised chapter 11 on managing linguistic data
Version 0.9.6 2008-12-07
NLTK:
* new WordNet corpus reader (contributed by Steven Bethard)
* incorporated dependency parsers into NLTK (was NLTK-Contrib) (contributed by Jason Narad)
* moved nltk/cfg.py to nltk/grammar.py and incorporated dependency grammars
* improved efficiency of unification algorithm
* various enhancements to the semantics package
* added plot() and tabulate() methods to FreqDist and ConditionalFreqDist
* FreqDist.keys() and list(FreqDist) provide keys reverse-sorted by value,
to avoid the confusion caused by FreqDist.sorted()
* new downloader module to support interactive data download: nltk.download()
run using "python -m nltk.downloader all"
* fixed WordNet bug that caused min_depth() to sometimes give incorrect result
* added nltk.util.Index as a wrapper around defaultdict(list) plus
a functional-style initializer
* fixed bug in Earley chart parser that caused it to break
* added basic TnT tagger nltk.tag.tnt
* new corpus reader for CoNLL dependency format (contributed by Kepa Sarasola and Iker Manterola)
* misc other bugfixes
Contrib (work in progress):
* TIGERSearch implementation by Torsten Marek
* extensions to hole and glue semantics modules by Dan Garrette
* new coreference package by Joseph Frazee
* MapReduce interface by Xinfan Meng
Data:
* Corpora are stored in compressed format if this will not compromise speed of access
* Swadesh Corpus of comparative wordlists in 23 languages
* Split grammar collection into separate packages
* New Basque and Spanish grammar samples (contributed by Kepa Sarasola and Iker Manterola)
* Brown Corpus sections now have meaningful names (e.g. 'a' is now 'news')
* Fixed bug that forced users to manually unzip the WordNet corpus
* New dependency-parsed version of Treebank corpus sample
* Added movie script "Monty Python and the Holy Grail" to webtext corpus
* Replaced words corpus data with a much larger list of English words
* New URL for list of available NLTK corpora
https://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
Book:
* complete rewrite of first three chapters to make the book accessible
to a wider audience
* new chapter on data-intensive language processing
* extensive reworking of most chapters
* Dropped subsection numbering; moved exercises to end of chapters
Distributions:
* created Portfile to support Mac installation
Version 0.9.5 2008-08-27
NLTK:
* text module with support for concordancing, text generation, plotting
* book module
* Major reworking of the automated theorem proving modules (Dan Garrette)
* draw.dispersion now uses pylab
* draw.concordance GUI tool
* nltk.data supports for reading corpora and other data files from within zipfiles
* trees can be constructed from strings with Tree(s) (cf Tree.parse(s))
Contrib (work in progress):
* many updates to student projects
- nltk_contrib.agreement (Thomas Lippincott)
- nltk_contrib.coref (Joseph Frazee)
- nltk_contrib.depparser (Jason Narad)
- nltk_contrib.fuf (Petro Verkhogliad)
- nltk_contrib.hadoop (Xinfan Meng)
* clean-ups: deleted stale files; moved some packages to misc
Data
* Cleaned up Gutenberg text corpora
* added Moby Dick; removed redundant copy of Blake songs.
* more tagger models
* renamed to nltk_data to facilitate installation
* stored each corpus as a zip file for quicker installation
and access, and to solve a problem with the Propbank
corpus including a file with an illegal name for MSWindows
(con.xml).
Book:
* changed filenames to chNN format
* reworked opening chapters (work in progress)
Distributions:
* fixed problem with mac installer that arose when Python binary
couldn't be found
* removed dependency of NLTK on nltk_data so that NLTK code can be
installed before the data
Version 0.9.4 2008-08-01
NLTK:
- Expanded semantics package for first order logic, linear logic,
glue semantics, DRT, LFG (Dan Garrette)
- new WordSense class in wordnet.synset supporting access to synsets
from sense keys and accessing sense counts (Joel Nothman)
- interface to Mallet's linear chain CRF implementation (nltk.tag.crf)
- misc bugfixes incl Punkt, synsets, maxent
- improved support for chunkers incl flexible chunk corpus reader,
new rule type: ChunkRuleWithContext
- new GUI for pos-tagged concordancing nltk.draw.pos_concordance
- new GUI for developing regexp chunkers nltk.draw.rechunkparser
- added bio_sents() and bio_words() methods to ConllChunkCorpusReader in conll.py
to allow reading (word, tag, chunk_typ) tuples off of CoNLL-2000 corpus. Also
modified ConllChunkCorpusView to support these changes.
- feature structures support values with custom unification methods
- new flag on tagged corpus readers to use simplified tagsets
- new package for ngram language modeling with Katz backoff nltk.model
- added classes for single-parented and multi-parented trees that
automatically maintain parent pointers (nltk.tree.ParentedTree and
nltk.tree.MultiParentedTree)
- new WordNet browser GUI (Jussi Salmela, Paul Bone)
- improved support for lazy sequences
- added generate() method to probability distributions
- more flexible parser for converting bracketed strings to trees
- made fixes to docstrings to improve API documentation
Contrib (work in progress)
- new NLG package, FUF/SURGE (Petro Verkhogliad)
- new dependency parser package (Jason Narad)
- new Coreference package, incl support for
ACE-2, MUC-6 and MUC-7 corpora (Joseph Frazee)
- CCG Parser (Graeme Gange)
- first order resolution theorem prover (Dan Garrette)
Data:
- Nnw NPS Chat Corpus and corpus reader (nltk.corpus.nps_chat)
- ConllCorpusReader can now be used to read CoNLL 2004 and 2005 corpora.
- Implemented HMM-based Treebank POS tagger and phrase chunker for
nltk_contrib.coref in api.py. Pickled versions of these objects are checked
in in data/taggers and data/chunkers.
Book:
- misc corrections in response to feedback from readers
Version 0.9.3 2008-06-03
NLTK:
- modified WordNet similarity code to use pre-built information content files
- new classifier-based tagger, BNC corpus reader
- improved unicode support for corpus readers
- improved interfaces to Weka, Prover9/Mace4
- new support for using MEGAM and SciPy to train maxent classifiers
- rewrite of Punkt sentence segmenter (Joel Nothman)
- bugfixes for WordNet information content module (Jordan Boyd-Graber)
- code clean-ups throughout
Book:
- miscellaneous fixes in response to feedback from readers
Contrib:
- implementation of incremental algorithm for generating
referring expressions (contributed by Margaret Mitchell)
- refactoring WordNet browser (Paul Bone)
Corpora:
- included WordNet information content files
Version 0.9.2 2008-03-04
NLTK:
- new theorem-prover and model-checker module nltk.inference,
including interface to Prover9/Mace4 (Dan Garrette, Ewan Klein)
- bugfix in Reuters corpus reader that causes Python
to complain about too many open files
- VerbNet and PropBank corpus readers
Data:
- VerbNet Corpus version 2.1: hierarchical, verb lexicon linked to WordNet
- PropBank Corpus: predicate-argument structures, as stand-off annotation of Penn Treebank
Contrib:
- New work on WordNet browser, incorporating a client-server model (Jussi Salmela)
Distributions:
- Mac OS 10.5 distribution
Version 0.9.1 2008-01-24
NLTK:
- new interface for text categorization corpora
- new corpus readers: RTE, Movie Reviews, Question Classification, Brown Corpus
- bugfix in ConcatenatedCorpusView that caused iteration to fail if it didn't start from the beginning of the corpus
Data:
- Question classification data, included with permission of Li & Roth
- Reuters 21578 Corpus, ApteMod version, from CPAN
- Movie Reviews corpus (sentiment polarity), included with permission of Lillian Lee
- Corpus for Recognising Textual Entailment (RTE) Challenges 1, 2 and 3
- Brown Corpus (reverted to original file structure: ca01-cr09)
- Penn Treebank corpus sample (simplified implementation, new readers treebank_raw and treebank_chunk)
- Minor redesign of corpus readers, to use filenames instead of "items" to identify parts of a corpus
Contrib:
- theorem_prover: Prover9, tableau, MaltParser, Mace4, glue semantics, docs (Dan Garrette, Ewan Klein)
- drt: improved drawing, conversion to FOL (Dan Garrette)
- gluesemantics: GUI demonstration, abstracted LFG code, documentation (Dan Garrette)
- readability: various text readability scores (Thomas Jakobsen, Thomas Skardal)
- toolbox: code to normalize toolbox databases (Greg Aumann)
Book:
- many improvements in early chapters in response to reader feedback
- updates for revised corpus readers
- moved unicode section to chapter 3
- work on engineering.txt (not included in 0.9.1)
Distributions:
- Fixed installation for Mac OS 10.5 (Joshua Ritterman)
- Generalize doctest_driver to work with doc_contrib
Version 0.9 2007-10-12
NLTK:
- New naming of packages and modules, and more functions imported into
top-level nltk namespace, e.g. nltk.chunk.Regexp -> nltk.RegexpParser,
nltk.tokenize.Line -> nltk.LineTokenizer, nltk.stem.Porter -> nltk.PorterStemmer,
nltk.parse.ShiftReduce -> nltk.ShiftReduceParser
- processing class names changed from verbs to nouns, e.g.
StemI -> StemmerI, ParseI -> ParserI, ChunkParseI -> ChunkParserI, ClassifyI -> ClassifierI
- all tokenizers are now available as subclasses of TokenizeI,
selected tokenizers are also available as functions, e.g. wordpunct_tokenize()
- rewritten ngram tagger code, collapsed lookup tagger with unigram tagger
- improved tagger API, permitting training in the initializer
- new system for deprecating code so that users are notified of name changes.
- support for reading feature cfgs to parallel reading cfgs (parse_featcfg())
- text classifier package, maxent (GIS, IIS), naive Bayes, decision trees, weka support
- more consistent tree printing
- wordnet's morphy stemmer now accessible via stemmer package
- RSLP Portuguese stemmer (originally developed by Viviane Moreira Orengo, reimplemented by Tiago Tresoldi)
- promoted ieer_rels.py to the sem package
- improvements to WordNet package (Jussi Salmela)
- more regression tests, and support for checking coverage of tests
- miscellaneous bugfixes
- remove numpy dependency
Data:
- new corpus reader implementation, refactored syntax corpus readers
- new data package: corpora, grammars, tokenizers, stemmers, samples
- CESS-ESP Spanish Treebank and corpus reader
- CESS-CAT Catalan Treebank and corpus reader
- Alpino Dutch Treebank and corpus reader
- MacMorpho POS-tagged Brazilian Portuguese news text and corpus reader
- trained model for Portuguese sentence segmenter
- Floresta Portuguese Treebank version 7.4 and corpus reader
- TIMIT player audio support
Contrib:
- BioReader (contributed by Carlos Rodriguez)
- TnT tagger (contributed by Sam Huston)
- wordnet browser (contributed by Jussi Salmela, requires wxpython)
- lpath interpreter (contributed by Haejoong Lee)
- timex -- regular expression-based temporal expression tagger
Book:
- polishing of early chapters
- introductions to parts 1, 2, 3
- improvements in book processing software (xrefs, avm & gloss formatting, javascript clipboard)
- updates to book organization, chapter contents
- corrections throughout suggested by readers (acknowledged in preface)
- more consistent use of US spelling throughout
- all examples redone to work with single import statement: "import nltk"
- reordered chapters: 5->7->8->9->11->12->5
* language engineering in part 1 to broaden the appeal
of the earlier part of the book and to talk more about
evaluation and baselines at an earlier stage
* concentrate the partial and full parsing material in part 2,
and remove the specialized feature-grammar material into part 3
Distributions:
- streamlined mac installation (Joshua Ritterman)
- included mac distribution with ISO image
Version 0.8 2007-07-01
Code:
- changed nltk.__init__ imports to explicitly import names from top-level modules
- changed corpus.util to use the 'rb' flag for opening files, to fix problems
reading corpora under MSWindows
- updated stale examples in engineering.txt
- extended feature structure interface to permit chained features, e.g. fs['F','G']
- further misc improvements to test code plus some bugfixes
Tutorials:
- rewritten opening section of tagging chapter
- reorganized some exercises
Version 0.8b2 2007-06-26
Code (major):
- new corpus package, obsoleting old corpora package
- supports caching, slicing, corpus search path
- more flexible API
- global updates so all NLTK modules use new corpus package
- moved nltk/contrib to separate top-level package nltk_contrib
- changed wordpunct tokenizer to use \w instead of a-zA-Z0-9
as this will be more robust for languages other than English,
with implications for many corpus readers that use it
- known bug: certain re-entrant structures in featstruct
- known bug: when the LHS of an edge contains an ApplicationExpression,
variable values in the RHS bindings aren't copied over when the
fundamental rule applies
- known bug: HMM tagger is broken
Tutorials:
- global updates to NLTK and docs
- ongoing polishing
Corpora:
- treebank sample reverted to published multi-file structure
Contrib:
- DRT and Glue Semantics code (nltk_contrib.drt, nltk_contrib.gluesemantics, by Dan Garrette)
Version 0.8b1 2007-06-18
Code (major):
- changed package name to nltk
- import all top-level modules into nltk, reducing need for import statements
- reorganization of sub-package structures to simplify imports
- new featstruct module, unifying old featurelite and featurestructure modules
- FreqDist now inherits from dict, fd.count(sample) becomes fd[sample]
- FreqDist initializer permits: fd = FreqDist(len(token) for token in text)
- made numpy optional
Code (minor):
- changed GrammarFile initializer to accept filename
- consistent tree display format
- fixed loading process for WordNet and TIMIT that prevented code installation if data not installed
- taken more care with unicode types
- incorporated pcfg code into cfg module
- moved cfg, tree, featstruct to top level
- new filebroker module to make handling of example grammar files more transparent
- more corpus readers (webtext, abc)
- added cfg.covers() to check that a grammar covers a sentence
- simple text-based wordnet browser
- known bug: parse/featurechart.py uses incorrect apply() function
Corpora:
- csv data file to document NLTK corpora
Contrib:
- added Glue semantics code (contrib.glue, by Dan Garrette)
- Punkt sentence segmenter port (contrib.punkt, by Willy)
- added LPath interpreter (contrib.lpath, by Haejoong Lee)
- extensive work on classifiers (contrib.classifier*, Sumukh Ghodke)
Tutorials:
- polishing on parts I, II
- more illustrations, data plots, summaries, exercises
- continuing to make prose more accessible to non-linguistic audience
- new default import that all chapters presume: from nltk.book import *
Distributions:
- updated to latest version of numpy
- removed WordNet installation instructions as WordNet is now included in corpus distribution
- added pylab (matplotlib)
Version 0.7.5 2007-05-16
Code:
- improved WordNet and WordNet-Similarity interface
- the Lancaster Stemmer (contributed by Steven Tomcavage)
Corpora:
- Web text samples
- BioCreAtIvE-PPI - a corpus for protein-protein interactions
- Switchboard Telephone Speech Corpus Sample (via Talkbank)
- CMU Problem Reports Corpus sample
- CONLL2002 POS+NER data
- Patient Information Leaflet corpus
- WordNet 3.0 data files
- English wordlists: basic English, frequent words
Tutorials:
- more improvements to text and images
Version 0.7.4 2007-05-01
Code:
- Indian POS tagged corpus reader: corpora.indian
- Sinica Treebank corpus reader: corpora.sinica_treebank
- new web corpus reader corpora.web
- tag package now supports pickling
- added function to utilities.py to guess character encoding
Corpora:
- Rotokas texts from Stuart Robinson
- POS-tagged corpora for several Indian languages (Bangla, Hindi, Marathi, Telugu) from A Kumaran
Tutorials:
- Substantial work on Part II of book on structured programming, parsing and grammar
- More bibliographic citations
- Improvements in typesetting, cross references
- Redimensioned images and tables for better use of page space
- Moved project list to wiki
Contrib:
- validation of toolbox entries using chunking
- improved classifiers
Distribution:
- updated for Python 2.5.1, Numpy 1.0.2
Version 0.7.3 2007-04-02
* Code:
- made chunk.Regexp.parse() more flexible about its input
- developed new syntax for PCFG grammars, e.g. A -> B C [0.3] | D [0.7]
- fixed CFG parser to support grammars with slash categories
- moved beta classify package from main NLTK to contrib
- Brill taggers loaded correctly
- misc bugfixes
* Corpora:
- Shakespeare XML corpus sample and corpus reader
* Tutorials:
- improvements to prose, exercises, plots, images
- expanded and reorganized tutorial on structured programming
- formatting improvements for Python listings
- improved plots (using pylab)
- categorization of problems by difficulty
Contrib:
- more work on kimmo lexicon and grammar
- more work on classifiers
Version 0.7.2 2007-03-01
* Code:
- simple feature detectors (detect module)
- fixed problem when token generators are passed to a parser (parse package)
- fixed bug in Grammar.productions() (identified by Lucas Champollion and Mitch Marcus)
- fixed import bug in category.GrammarFile.earley_parser
- added utilities.OrderedDict
- initial port of old NLTK classifier package (by Sam Huston)
- UDHR corpus reader
* Corpora:
- added UDHR corpus (Universal Declaration of Human Rights)
with 10k text samples in 300+ languages
* Tutorials:
- improved images
- improved book formatting, including new support for:
- javascript to copy program examples to clipboard in HTML version,
- bibliography, chapter cross-references, colorization, index, table-of-contents
* Contrib:
- new Kimmo system: contrib.mit.six863.kimmo (Rob Speer)
- fixes for: contrib.fsa (Rob Speer)
- demonstration of text classifiers trained on UDHR corpus for
language identification: contrib.langid (Sam Huston)
- new Lambek calculus system: contrib.lambek
- new tree implementation based on elementtree: contrib.tree
Version 0.7.1 2007-01-14
* Code:
- bugfixes (HMM, WordNet)
Version 0.7 2006-12-22
* Code:
- bugfixes, including fixed bug in Brown corpus reader
- cleaned up wordnet 2.1 interface code and similarity measures
- support for full Penn treebank format contributed by Yoav Goldberg
* Tutorials:
- expanded tutorials on advanced parsing and structured programming
- checked all doctest code
- improved images for chart parsing
Version 0.7b1 2006-12-06
* Code:
- expanded semantic interpretation package
- new high-level chunking interface, with cascaded chunking
- split chunking code into new chunk package
- updated wordnet package to support version 2.1 of Wordnet.
- prototyped basic wordnet similarity measures
(path distance, Wu + Palmer and Leacock + Chodorow, Resnik similarity measures.)
- bugfixes (tag.Window, tag.ngram)
- more doctests
* Contrib:
- toolbox language settings module
* Tutorials:
- rewrite of chunking chapter, switched from Treebank to CoNLL format as main focus,
simplified evaluation framework, added ngram chunking section
- substantial updates throughout (esp programming and semantics chapters)
* Corpora:
- Chat-80 Prolog data files provided as corpora, plus corpus reader
Version 0.7a2 2006-11-13
* Code:
- more doctest
- code to read Chat-80 data
- HMM bugfix
* Tutorials:
- continued updates and polishing
* Corpora:
- toolbox MDF sample data
Version 0.7a1 2006-10-29
* Code:
- new toolbox module (Greg Aumann)
- new semantics package (Ewan Klein)
- bugfixes
* Tutorials
- substantial revision, especially in preface, introduction, words,
and semantics chapters.
Version 0.6.6 2006-10-06
* Code:
- bugfixes (probability, shoebox, draw)
* Contrib:
- new work on shoebox package (Stuart Robinson)
* Tutorials:
- continual expansion and revision, especially on introduction to
programming, advanced programming and the feature-based grammar chapters.
Version 0.6.5 2006-07-09
* Code:
- improvements to shoebox module (Stuart Robinson, Greg Aumann)
- incorporated feature-based parsing into core NLTK-Lite
- corpus reader for Sinica treebank sample
- new stemmer package
* Contrib:
- hole semantics implementation (Peter Wang)
- Incorporating yaml
- new work on feature structures, unification, lambda calculus
- new work on shoebox package (Stuart Robinson, Greg Aumann)
* Corpora:
- Sinica treebank sample
* Tutorials:
- expanded discussion throughout, incl: left-recursion, trees, grammars,
feature-based grammar, agreement, unification, PCFGs,
baseline performance, exercises, improved display of trees
Version 0.6.4 2006-04-20
* Code:
- corpus readers for Senseval 2 and TIMIT
- clusterer (ported from old NLTK)
- support for cascaded chunkers
- bugfix suggested by Brent Payne
- new SortedDict class for regression testing
* Contrib:
- CombinedTagger tagger and marshalling taggers, contributed by Tiago Tresoldi
* Corpora:
- new: Senseval 2, TIMIT sample
* Tutorials:
- major revisions to programming, words, tagging, chunking, and parsing tutorials
- many new exercises
- formatting improvements, including colorized program examples
- fixed problem with testing on training data, reported by Jason Baldridge
Version 0.6.3 2006-03-09
* switch to new style classes
* repair FSA model sufficiently for Kimmo module to work
* port of MIT Kimmo morphological analyzer; still needs lots of code clean-up and inline docs
* expanded support for shoebox format, developed with Stuart Robinson
* fixed bug in indexing CFG productions, for empty right-hand-sides
* efficiency improvements, suggested by Martin Ranang
* replaced classeq with isinstance, for efficiency improvement, as suggested by Martin Ranang
* bugfixes in chunk eval
* simplified call to draw_trees
* names, stopwords corpora
Version 0.6.2 2006-01-29
* Peter Spiller's concordancer
* Will Hardy's implementation of Penton's paradigm visualization system
* corpus readers for presidential speeches
* removed NLTK dependency
* generalized CFG terminals to permit full range of characters
* used fully qualified names in demo code, for portability
* bugfixes from Yoav Goldberg, Eduardo Pereira Habkost
* fixed obscure quoting bug in tree displays and conversions
* simplified demo code, fixed import bug
|