File: pocketsphinx.1.in

package info (click to toggle)
pocketsphinx 5.0.4-2
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 51,236 kB
  • sloc: ansic: 54,519; python: 2,438; sh: 566; cpp: 410; perl: 342; yacc: 93; lex: 50; makefile: 30
file content (125 lines) | stat: -rw-r--r-- 4,454 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
.TH POCKETSPHINX 1 "2022-09-27"
.SH NAME
pocketsphinx \- Run speech recognition on audio data
.SH SYNOPSIS
.B pocketsphinx
[ \fIoptions\fR... ]
[ \fBlive\fR |
\fBsingle\fR |
\fBhelp\fR |
\fBsoxflags\fR ]
\fIINPUTS\fR...
.SH DESCRIPTION
.PP
The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel
16-bit PCM audio one or more input files (or ‘\f[CR]-\fP’ to read from
standard input), and attempts to recognize speech in it using the
default acoustic and language model. The input files can be raw audio,
WAV, or NIST Sphere files, though some of these may not be recognized
properly.  It accepts a large number of options which you probably
don't care about, and a \fIcommand\fP which defaults to
‘\f[CR]live\fP’. The commands are as follows:
.TP
.B help
Print a long list of those options you don't care about.
.TP
.B config
Dump configuration as JSON to standard output (can be loaded with the
‘\f[CR]-config\fP’ option). 
.TP
.B live
Detect speech segments in input files, run recognition on them (using
those options you don't care about), and write the results to standard
output in line-delimited JSON. I realize this isn't the prettiest
format, but it sure beats XML. Each line contains a JSON object with
these fields, which have short names to make the lines more readable:
.IP
"b": Start time in seconds, from the beginning of the stream
.IP
"d": Duration in seconds
.IP
"p": Estimated probability of the recognition result, i.e. a number between
0 and 1 which may be used as a confidence score
.IP
"t": Full text of recognition result
.IP
"w": List of segments (usually words), each of which in turn contains the
‘\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for
start, end, probability, and the text of the word. In the future we
may also support hierarchical results in which case ‘\f[CR]w\fP’ could
be present.
.TP
.B single
Recognize the input as a single utterance, and write a JSON object in the same format described above.
.TP
.B align

Align a single input file (or ‘\f[CR]-\fP’ for standard input) to a word
sequence, and write a JSON object in the same format described above.
The first positional argument is the input, and all subsequent ones
are concatenated to make the text, to avoid surprises if you forget to
quote it.  You are responsible for normalizing the text to remove
punctuation, uppercase, centipedes, etc. For example:

.EX
    pocketsphinx align goforward.wav "go forward ten meters"
.EE

By default, only word-level alignment is done.  To get phone
alignments, pass `-phone_align yes` in the flags, e.g.:

.EX    
    pocketsphinx -phone_align yes align audio.wav $text
.EE        

This will make not particularly readable output, but you can use
.B jq
(https://stedolan.github.io/jq/) to clean it up.  For example,
you can get just the word names and start times like this:

.EX    
    pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]'
.EE        

Or you could get the phone names and durations like this:

.EX    
    pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]'
.EE        

There are many, many other possibilities, of course.
.TP
.B help
Print a usage and help text with a list of possible arguments.
.TP
.B soxflags
Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate
input format. Note that because the ‘\f[CR]sox\fP’ command-line is
slightly quirky these must always come \fIafter\fP the filename or
‘\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the
microphone). You can run live recognition like this:

.EX
    sox -d $(pocketsphinx soxflags) | pocketsphinx -
.EE

or decode from a file named "audio.mp3" like this:

.EX
    sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx -
.EE
.PP
By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however.
.SH OPTIONS
.\" ### ARGUMENTS ###
.SH AUTHOR
Written by numerous people at CMU from 1994 onwards.  This manual page
by David Huggins-Daines <dhdaines@gmail.com>
.SH COPYRIGHT
Copyright \(co 1994-2016 Carnegie Mellon University.  See the file
\fILICENSE\fR included with this package for more information.
.br
.SH "SEE ALSO"
.BR pocketsphinx_batch (1),
.BR sphinx_fe (1).
.br