1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
|
# prose [](https://travis-ci.org/jdkato/prose) [](https://ci.appveyor.com/project/jdkato/prose) [](https://godoc.org/github.com/jdkato/prose) [](https://coveralls.io/github/jdkato/prose?branch=master) [](https://goreportcard.com/report/github.com/jdkato/prose) [](https://github.com/avelino/awesome-go#natural-language-processing)
`prose` is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.
See the [GoDoc documentation](https://godoc.org/github.com/jdkato/prose) for more information.
## Install
```console
$ go get github.com/jdkato/prose/...
```
> **NOTE**: When using some vendoring tools, such as `govendor`, you may need to include the `github.com/jdkato/prose/internal/` package in addition to the core package(s). See [#14](https://github.com/jdkato/prose/issues/14) for more information.
## Usage
### Contents
* [Tokenizing](#tokenizing-godoc)
* [Tagging](#tagging-godoc)
* [Transforming](#transforming-godoc)
* [Summarizing](#summarizing-godoc)
* [Chunking](#chunking-godoc)
* [License](#license)
### Tokenizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/tokenize))
Word, sentence, and regexp tokenizers are available. Every tokenizer implements the [same interface](https://godoc.org/github.com/jdkato/prose/tokenize#ProseTokenizer), which makes it easy to customize tokenization in other parts of the library.
```go
package main
import (
"fmt"
"github.com/jdkato/prose/tokenize"
)
func main() {
text := "They'll save and invest more."
tokenizer := tokenize.NewTreebankWordTokenizer()
for _, word := range tokenizer.Tokenize(text) {
// [They 'll save and invest more .]
fmt.Println(word)
}
}
```
### Tagging ([GoDoc](https://godoc.org/github.com/jdkato/prose/tag))
The `tag` package includes a port of Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:
| Library | Accuracy | 5-Run Average (sec) |
|:--------|---------:|--------------------:|
| NLTK | 0.893 | 7.224 |
| `prose` | 0.961 | 2.538 |
(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)
```go
package main
import (
"fmt"
"github.com/jdkato/prose/tag"
"github.com/jdkato/prose/tokenize"
)
func main() {
text := "A fast and accurate part-of-speech tagger for Golang."
words := tokenize.NewTreebankWordTokenizer().Tokenize(text)
tagger := tag.NewPerceptronTagger()
for _, tok := range tagger.Tag(words) {
fmt.Println(tok.Text, tok.Tag)
}
}
```
### Transforming ([GoDoc](https://godoc.org/github.com/jdkato/prose/transform))
The `tranform` package implements a number of functions for changing the case of strings, including `Title`, `Snake`, `Pascal`, and `Camel`.
Additionally, unlike `strings.Title`, `tranform.Title` adheres to common guidelines—including styles for both the [AP Stylebook](https://www.apstylebook.com/) and [The Chicago Manual of Style](http://www.chicagomanualofstyle.org/home.html). You can also add your own custom style by defining an [`IgnoreFunc`](https://godoc.org/github.com/jdkato/prose/transform#IgnoreFunc) callback.
Inspiration and test data taken from [python-titlecase](https://github.com/ppannuto/python-titlecase) and [to-title-case](https://github.com/gouch/to-title-case).
```go
package main
import (
"fmt"
"strings"
"github.com/jdkato/prose/transform"
)
func main() {
text := "the last of the mohicans"
tc := transform.NewTitleConverter(transform.APStyle)
fmt.Println(strings.Title(text)) // The Last Of The Mohicans
fmt.Println(tc.Title(text)) // The Last of the Mohicans
}
```
### Summarizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/summarize))
The `summarize` package includes functions for computing standard readability and usage statistics. It's among the most accurate implementations available due to its reliance on legitimate tokenizers (whereas others, like [readability-score](https://github.com/DaveChild/Text-Statistics/blob/master/src/DaveChild/TextStatistics/Text.php#L308), rely on naive regular expressions).
It also includes a TL;DR algorithm for condensing text into a user-indicated number of paragraphs.
```go
package main
import (
"fmt"
"github.com/jdkato/prose/summarize"
)
func main() {
doc := summarize.NewDocument("This is some interesting text.")
fmt.Println(doc.SMOG(), doc.FleschKincaid())
}
```
### Chunking ([GoDoc](https://godoc.org/github.com/jdkato/prose/chunk))
The `chunk` package implements named-entity extraction using a regular expression indicating what chunks you're looking for and pre-tagged input.
```go
package main
import (
"fmt"
"github.com/jdkato/prose/chunk"
"github.com/jdkato/prose/tag"
"github.com/jdkato/prose/tokenize"
)
func main() {
words := tokenize.TextToWords("Go is an open source programming language created at Google.")
regex := chunk.TreebankNamedEntities
tagger := tag.NewPerceptronTagger()
for _, entity := range chunk.Chunk(tagger.Tag(words), regex) {
fmt.Println(entity) // [Go Google]
}
}
```
## License
If not otherwise specified (see below), the source files are distributed under MIT License found in the [LICENSE](https://github.com/jdkato/prose/blob/master/LICENSE) file.
Additionally, the following files contain their own license information:
- [`tag/aptag.go`](https://github.com/jdkato/prose/blob/master/tag/aptag.go): MIT © Matthew Honnibal.
- [`tokenize/punkt.go`](https://github.com/jdkato/prose/blob/master/tokenize/punkt.go): MIT © Eric Bower.
- [`tokenize/pragmatic.go`](https://github.com/jdkato/prose/blob/master/tokenize/pragmatic.go): MIT © Kevin S. Dias.
|