1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
|
# Natural Language Toolkit: Interface to TnT
#
# Author: Dan Garrette <dhgarrette@gmail.com>
#
# URL: <http://www.nltk.org/>
# For license information, see LICENSE.TXT
import os
import tempfile
from nltk import tokenize
from nltk.internals import find_binary
_tnt_bin = None
def config_tnt(bin=None, verbose=False):
"""
Configure the location of TnT Executable
@param path: Path to the TnT executable
@type path: C{str}
"""
try:
if _tnt_bin:
return _tnt_bin
except UnboundLocalError:
pass
# Find the tnt binary.
tnt_bin = find_binary('tnt', bin,
searchpath=tnt_search, env_vars=['TNTHOME'],
url='http://www.coli.uni-saarland.de/~thorsten/tnt/',
verbose=verbose)
_tnt_bin = tnt_bin
return _tnt_bin
tnt_search = ['.',
'/usr/lib/tnt',
'/usr/local/bin',
'/usr/local/bin/tnt']
def pos_tag(sentence, model_path=None, verbose=False):
"""
Use TnT to parse a sentence
@param sentence: Input sentence to parse
@type sentence: L{str}
@return: C{DepGraph} the dependency graph representation of the sentence
"""
tnt_bin = config_tnt(verbose=verbose)
if not model_path:
model_path = '%s/models/wsj' % tnt_bin[:-4]
input_file = '%s/tnt_in.txt' % tnt_bin[:-4]
output_file = '%s/tnt_out.txt' % tempfile.gettempdir()
execute_string = '%s %s %s > %s'
if not verbose:
execute_string += ' 2> %s/tnt.out' % tempfile.gettempdir()
tagged_words = []
f = None
try:
if verbose:
print 'Begin input file creation'
print 'input_file=%s' % input_file
f = open(input_file, 'w')
words = tokenize.WhitespaceTokenizer().tokenize(sentence)
for word in words:
f.write('%s\n' % word)
f.write('\n')
f.close()
if verbose: print 'End input file creation'
if verbose:
print 'tnt_bin=%s' % tnt_bin
print 'model_path=%s' % model_path
print 'output_file=%s' % output_file
execute_string = execute_string % (tnt_bin, model_path, input_file, output_file)
if verbose:
print 'execute_string=%s' % execute_string
if verbose: print 'Begin tagging'
tnt_exit = os.system(execute_string)
if verbose: print 'End tagging (exit code=%s)' % tnt_exit
f = open(output_file, 'r')
lines = f.readlines()
f.close()
tagged_words = []
tokenizer = tokenize.WhitespaceTokenizer()
for line in lines:
if not line.startswith('%%'):
tokens = tokenizer.tokenize(line.strip())
if len(tokens) == 2:
tagged_words.append((tokens[0], tokens[1]))
if verbose:
for tag in tagged_words:
print tag
finally:
if f: f.close()
return tagged_words
if __name__ == '__main__':
# train(True)
pos_tag('John sees Mary', verbose=True)
|