1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
|
.TH POCKETSPHINX 1 "2022-09-27"
.SH NAME
pocketsphinx \- Run speech recognition on audio data
.SH SYNOPSIS
.B pocketsphinx
[ \fIoptions\fR... ]
[ \fBlive\fR |
\fBsingle\fR |
\fBhelp\fR |
\fBsoxflags\fR ]
\fIINPUTS\fR...
.SH DESCRIPTION
.PP
The ‘\f[CR]pocketsphinx\fP’ command-line program reads single-channel
16-bit PCM audio one or more input files (or ‘\f[CR]-\fP’ to read from
standard input), and attempts to recognize speech in it using the
default acoustic and language model. The input files can be raw audio,
WAV, or NIST Sphere files, though some of these may not be recognized
properly. It accepts a large number of options which you probably
don't care about, and a \fIcommand\fP which defaults to
‘\f[CR]live\fP’. The commands are as follows:
.TP
.B help
Print a long list of those options you don't care about.
.TP
.B config
Dump configuration as JSON to standard output (can be loaded with the
‘\f[CR]-config\fP’ option).
.TP
.B live
Detect speech segments in input files, run recognition on them (using
those options you don't care about), and write the results to standard
output in line-delimited JSON. I realize this isn't the prettiest
format, but it sure beats XML. Each line contains a JSON object with
these fields, which have short names to make the lines more readable:
.IP
"b": Start time in seconds, from the beginning of the stream
.IP
"d": Duration in seconds
.IP
"p": Estimated probability of the recognition result, i.e. a number between
0 and 1 which may be used as a confidence score
.IP
"t": Full text of recognition result
.IP
"w": List of segments (usually words), each of which in turn contains the
‘\f[CR]b\fP’, ‘\f[CR]d\fP’, ‘\f[CR]p\fP’, and ‘\f[CR]t\fP’ fields, for
start, end, probability, and the text of the word. In the future we
may also support hierarchical results in which case ‘\f[CR]w\fP’ could
be present.
.TP
.B single
Recognize the input as a single utterance, and write a JSON object in the same format described above.
.TP
.B align
Align a single input file (or ‘\f[CR]-\fP’ for standard input) to a word
sequence, and write a JSON object in the same format described above.
The first positional argument is the input, and all subsequent ones
are concatenated to make the text, to avoid surprises if you forget to
quote it. You are responsible for normalizing the text to remove
punctuation, uppercase, centipedes, etc. For example:
.EX
pocketsphinx align goforward.wav "go forward ten meters"
.EE
By default, only word-level alignment is done. To get phone
alignments, pass `-phone_align yes` in the flags, e.g.:
.EX
pocketsphinx -phone_align yes align audio.wav $text
.EE
This will make not particularly readable output, but you can use
.B jq
(https://stedolan.github.io/jq/) to clean it up. For example,
you can get just the word names and start times like this:
.EX
pocketsphinx align audio.wav $text | jq '.w[]|[.t,.b]'
.EE
Or you could get the phone names and durations like this:
.EX
pocketsphinx -phone_align yes align audio.wav $text | jq '.w[]|.w[]|[.t,.d]'
.EE
There are many, many other possibilities, of course.
.TP
.B help
Print a usage and help text with a list of possible arguments.
.TP
.B soxflags
Return arguments to ‘\f[CR]sox\fP’ which will create the appropriate
input format. Note that because the ‘\f[CR]sox\fP’ command-line is
slightly quirky these must always come \fIafter\fP the filename or
‘\f[CR]-d\fP’ (which tells ‘\f[CR]sox\fP’ to read from the
microphone). You can run live recognition like this:
.EX
sox -d $(pocketsphinx soxflags) | pocketsphinx -
.EE
or decode from a file named "audio.mp3" like this:
.EX
sox audio.mp3 $(pocketsphinx soxflags) | pocketsphinx -
.EE
.PP
By default only errors are printed to standard error, but if you want more information you can pass ‘\f[CR]-loglevel INFO\fP’. Partial results are not printed, maybe they will be in the future, but don't hold your breath. Force-alignment is likely to be supported soon, however.
.SH OPTIONS
.\" ### ARGUMENTS ###
.SH AUTHOR
Written by numerous people at CMU from 1994 onwards. This manual page
by David Huggins-Daines <dhdaines@gmail.com>
.SH COPYRIGHT
Copyright \(co 1994-2016 Carnegie Mellon University. See the file
\fILICENSE\fR included with this package for more information.
.br
.SH "SEE ALSO"
.BR pocketsphinx_batch (1),
.BR sphinx_fe (1).
.br
|