File: README.txt

package info (click to toggle)
python-pattern 2.6%2Bgit20180818-4.1
  • links: PTS
  • area: main
  • in suites: sid, trixie
  • size: 95,160 kB
  • sloc: python: 28,135; xml: 15,085; javascript: 5,810; makefile: 194
file content (123 lines) | stat: -rw-r--r-- 4,384 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
TEST CORPORA
============

The purpose of the corpora is for testing and evaluating the functionality in the Pattern module. These are not the original corpora; but samples that have been reduced in size and/or balanced. The original corpora can be found by following the links below.

The corpora are meant for personal use, they are not part of the module's BSD license.

1) Through the Looking-Glass, written by Lewis Carroll
- carroll-lookingglass.pdf
- http://www.gutenberg.org/
- Chapter 1 of Through the Looking-Glass in Office Open XML format.

2) Alice in Wonderland, written by Lewis Carroll
- carroll-wonderland.pdf
- http://www.gutenberg.org/
- Full text of Alice in Wonderland in PDF format.

3) Clough & Stevenson's plagiarism corpus
- plagiarism-clough&stevenson.csv
- http://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html
- 100 texts: authentic (0), heavy (1) or light revision (2), cut & paste (3).

4) Amazon.de German book reviews
- polarity-de-amazon.csv
- http://www.amazon.de/gp/bestsellers/books/
- 100 "positive" and 100 "negative" book reviews.

5) Amazon.fr French book reviews
- polarity-fr-amazon.csv
- http://www.amazon.fr/
- 750 "positive" and 750 "negative" movie reviews.

6) Pang & Lee's sentence polarity dataset v1.0
- polarity-en-pang&lee1.csv
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
- 2000 "positive" and 2000 "negative" sentences.

7) Pang & Lee's polarity dataset v2.0
- polarity-en-pang&lee2.csv
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
- 750 "positive" and 750 "negative" movie reviews.

8) Bol.com Dutch book reviews
- polarity-nl-bol.com.csv
- http://www.bol.com/nl/m/nederlandse-boeken/literatuur/
- 1500 "positive" and 1500 "negative" book reviews.

9) German portion of Tiger Treebank (Brants et al.)
- tagged-de-tiger.txt
- http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora
- 250 German sentences with STTS part-of-speech tags.

10) English portion of Open American National Corpus (Ide et al.)
- tagged-en-oanc.txt
- http://www.anc.org/data/oanc/
- 1000 English sentences with Penn Treebank part-of-speech tags.

11) English portion of Penn Treebank (Marcus et al.)
- tagged-en-wsj.txt
- http://www.cis.upenn.edu/~treebank/home.html
- 1000 English sentences with Penn Treebank part-of-speech tags.

12) Spanish portion of Wikicorpus v.1.0 (Reese & Boleda et al.)
- tagged-es-wikicorpus.txt
- http://www.lsi.upc.edu/~nlp/wikicorpus/
- 1000 Spanish sentences with Parole part-of-speech tags.

13) Italian portion of WaCKy Corpus (Baroni et al.)
- tagged-it-wacky.txt
- http://wacky.sslmit.unibo.it/doku.php?id=corpora
- 1000 Italian sentences with Penn Treebank II part-of-speech tags.

14) Dutch portion of Twente Nieuws Corpus (Ordelman et al.)
- tagged-nl-twnc.txt
- http://hmi.ewi.utwente.nl/TwNC
- 1000 Dutch sentences with Wotan part-of-speech tags.

15) Apache SpamAssassin public mail corpus
- spam-apache.csv
- http://spamassassin.apache.org/publiccorpus/
- 125 "spam" and 125 (mostly technical) "ham" messages.

16) Birkbeck spelling error corpus
- spelling-birkbeck.csv
- http://www.ota.ox.ac.uk/headers/0643.xml
- 500 words and how they are commonly misspelled.

17) CoNLL 2010 Shared Task 1 - Wikipedia uncertainty
- uncertainty-conll2010.csv
- http://www.inf.u-szeged.hu/rgai/conll2010st/tasks.html#task1
- 1500 "certain" and 1500 "uncertain" Wikipedia sentences.

18) Celex 2.5 German word forms
- wordforms-de-celex.csv
- http://celex.mpi.nl/
- 250 singular nouns and their plural form.
- 250 predicative adjectives and their attributive form.

19) Celex 2.5 English word forms
- wordforms-en-celex.csv
- http://celex.mpi.nl/
- 4000 singular nouns and their plural form.

20) Celex 2.5 Dutch word forms
- wordforms-nl-celex.csv
- http://celex.mpi.nl/
- 1000 singular nouns and their plural form.
- 1000 predicative adjectives and their attributive form.

21) Davies Corpus del Espaol word forms
- wordforms-es-davies.csv
- http://www.wordfrequency.info/files/spanish/spanish_lemmas20k.txt
- 3000 word forms with lemma, part-of-speech and frequency.

22) Wiktionary Italian word forms
- wordforms-it-wiktionary.csv
- https://en.wiktionary.org/wiki/Category:Italian_language
- 2000 word forms with lemma, part-of-speech and gender.

23) Lexique 3 French word forms
- wordforms-fr-lexique.csv
- http://www.lexique.org/
- 2000 word forms with lemma and part-of-speech.