Sphinx-4 is a state-of-the-art speech recognition system written entirely in the JavaTM programming language. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).
Sphinx-4 started out as a port of Sphinx-3 to the Java programming
language, but evolved into a recognizer designed to be much more flexible than
Sphinx-3, thus becoming an excellent platform for speech research.
Live mode and batch mode speech recognizers, capable of recognizing discrete and continuous speech.
Generalized pluggable front end architecture. Includes pluggable implementations of preemphasis, Hamming window, FFT, Mel frequency filter bank, discrete cosine transform, cepstral mean normalization, and feature extraction of cepstra, delta cepstra, double delta cepstra features.
Generalized pluggable language model architecture. Includes pluggable language model support for ASCII and binary versions of unigram, bigram, trigram, Java Speech API Grammar Format (JSGF), and ARPA-format FST grammars.
Generalized acoustic model architecture. Includes pluggable support for Sphinx-3 acoustic models.
Generalized search management. Includes pluggable support for breadth first and word pruning searches.
Utilities for post-processing recognition results, including obtaining confidence scores, generating lattices and embedding ECMAScript into JSGF tags.
Standalone tools. Includes tools for displaying waveforms and spectrograms and generating features from audio.
Sphinx-4 is a very flexible system capable of performing many different types of recognition tasks. As such, it is difficult to characterize the performance and accuracy of Sphinx-4 with just a few simple numbers such as speed and accuracy. Instead, we regularly run regression tests on Sphinx-4 to determine how it performs under a variety of tasks. These tasks and their latest results are as follows (each task is progressively more difficult than the previous task):
The following table compares the performance of Sphinx 3.3 with Sphinx-4.
Test | S3.3 WER | S4 WER | S3.3 RT | S4 RT(1) | S4 RT (2) | Vocabulary Size | Language Model |
TI46 | 1.217 | 0.168 | 0.14 | .03 | .02 | 11 | isolated digits recognition |
TIDIGITS | 0.661 | 0.549 | 0.16 | 0.07 | 0.05 | 11 | continuous digits |
AN4 | 1.300 | 1.192 | 0.38 | 0.25 | 0.20 | 79 | trigram |
RM1 | 2.746 | 2.88 | 0.50 | 0.50 | 0.41 | 1,000 | trigram |
WSJ5K | 7.323 | 6.97 | 1.36 | 1.22 | 0.96 | 5,000 | trigram |
HUB4 | 18.845 | 18.756 | 3.06 | ~4.4 | 3.95 | 60,000 | trigram |
Note that performance work on the HUB4 test is not complete
This data was collected on a dual CPU UltraSPARC(R)-III running at 1015 MHz with 2G of memory.
Sphinx-4 has been built and tested on the Solaris TM Operating Environment, Mac OS X, Linux and Win32 operating systems. Running, building, and testing Sphinx-4 requires additional software. Before you start, you will need the following software available on your machine.
Sphinx-4 has two packages available for download:
See this FAQ question to help determine whether you should get the binary or the source distribution.
After you have downloaded the distribution, unjar the ZIP files using the
command which is in the bin
directory of your
Java installation:
jar xvf sphinx4-{version}-bin.zip jar xvf sphinx4-{version}-src.zip
For both downloads, a directory called "sphinx4-{version}" will be created.
There are also the RM1 acoustic model, and HUB4 acoustic and language models, available for download at the same location on SourceForge. Download them only if you want to run the regression tests for RM1 and HUB4.
If you want to be able to get the latest updates from the CVS source tree, you should retrieve the code from the CVS source tree on SourceForge. The Sphinx-4 code is located at sourceforge.net as open source. Please follow the instructions below to retrieve it.
% export CVS_RSH=ssh % cvs -z3 -d:ext:developername@cvs.sourceforge.net:/cvsroot/cmusphinx co sphinx4where developername is your sourceforge developer name.
% cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/cmusphinx login % cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/cmusphinx co sphinx4
Since the sphinx4-{version}-bin.zip distribution does not contain the source code, you must download the sphinx4-{version}-src.zip, or retrieved the code from SourceForge using CVS, in order to be able to build from the sources. The software required for building Sphinx-4 are listed in the Required Software section.
Setup JSAPI 1.0
Before you build Sphinx-4, it is important to setup your environment to support the Java Speech API (JSAPI), because a number of tests and demos rely on having JSAPI installed.
To build Sphinx-4, at the command prompt change to the directory where you
installed Sphinx-4 (usually, a simple "cd sphinx4" will do). Set your
and PATH
environment variables as described above. Then type the following:
This executes the Apache Ant command
to build the Sphinx-4 classes under the bld
directory, the jar
files under the lib
directory, and the demo jar files under the
To delete all the output from the build to give you a fresh start:
ant clean
The javadocs have already been built if you downloaded the sphinx4-{version}-bin.zip. In order to build the javadocs yourself, you must download the sphinx4-{version}-src.zip distribution instead. To build the javadocs, go to the top level directory ("sphinx4-{version}"), and type:
ant javadoc
This will build javadocs from public classes, displaying only the public methods and fields. In general, this is all the information you will need. If you need more details, such as private or protected classes, you can generate the corresponding javadoc by doing, for example:
ant -Daccess=private javadoc
Sphinx-4 contains a number of demo programs. If you downloaded the binary distribution (sphinx4-{version}-bin.zip), the JAR files of the demos are already built, so you can just run them directly. However, if you downloaded the source distribution (sphinx4-{version}-src.zip or via CVS), you need to build the demos. Click on the links below for instructions on how to build and run the demos.
There is also a live-mode test program (this link only works if you downloaded the source distribution), which is available if you download the sphinx-src-{version}.zip file but not available in the sphinx-bin-{version}.zip file.
The AudioTool is a visual tool that records and displays the waveform and spectrogram of an audio signal. It is available in both the binary and source releases.
The document Sphinx-4
Configuration Management describes, in detail, how to configure a Sphinx-4
The document Sphinx-4
Instrumentation describes, in detail, how to use the instrumentation
facilities of the Sphinx-4 system.
Sphinx-4 contains a number of regression tests using common speech databases. Again, you have to download the source distribution or downloaded the source tree using CVS in order to get the regression tests directory. The regression tests we have are:
Before you run any of the tests, make sure that you have built Sphinx-4 already. To do so, go to the top level and type:
You also need to make sure you have the appropriate acoustic model(s) installed. More details below.
The Sphinx-4 regression tests have different directories for the different tasks. The directory sphinx4/tests/performance contains directories named ti46, tidigits, an4, rm1, hub4, and some other tests. Each of these directories contains a build.xml with targets specific to the particular task. The build.xml allows you to run a number of different tests. Type:
ant -projecthelpto list a help text with the possible targets.
The TIDIGITS models are already included as part of the distribution. Therefore, you do not need to download them separately. You must have the TI46 test data, available from the LDC TI46 website.
You need to edit the batch file called ti46.batch
, located in
directory. You will need to change it such
that it matches where you stored the TI46 test files. Refer to the section Batch Files
for detail about the format of batch files.
To run the tests:
% cd sphinx4/tests/performance/ti46 % ant -projecthelp # to see a list of possible targets % ant ti46_wordlist
The TIDIGITS models are already included as part of the distribution. Therefore, you do not need to download them separately.
You must have the TIDIGITS test data, available from the LDC TIDIGITS website.
You need to edit the batch file called tidigits.batch
, located
in the tests/performance/tidigits
directory. You will need to
change it such that it matches where you stored the TIDIGITS test files. Refer
to the section Batch Files
for detail about the format of batch files.
To run the tests:
% cd sphinx4/tests/performance/tidigits % ant -projecthelp # to see a list of possible targets % ant tidigits_flat_unigram
The Wall Street Journal (WSJ) models are already included as part of the distribution. Therefore, you do not need to download them separately.
Download the big endian raw audio format of the AN4 Database. Unpack it at a directory of your choice:
% gunzip an4_raw.bigendian.tar.gz % tar -xvf an4_raw.bigendian.tar
Then update the following batch files (located in the
directory), so that they match up with
where you unpacked the AN4 data. You probably just need to replace all
instances of the string "/lab/speech/sphinx4/data"
inside these
batch files. Please refer to the Batch Files
section for details about batch files:
After you have updated the batch files, you can run the tests by:
% cd sphinx4/tests/performance/an4 % ant -projecthelp # to see a list of possible targets % ant an4_words_unigram
Make sure that you have downloaded the binary RM1 model file, called
, located at the
package in the downloads
Then in the build file for the RM1 tests,
, changed the
property of the build file to point to the location of
your RM1_13dCep_16k_40mel_130Hz_6800Hz.jar
You must have the RM1 test data, available from the LDC RM1 website.
You also need to prepare a batch file called rm1.batch
, by
following instructions in the Batch Files
section. There is already one in the RM1 test directory, but it will not work
for you, since the paths to test files will not match your setup.
To run the tests:
% cd sphinx4/tests/performance/rm1 % ant -projecthelp # to see a list of possible targets % ant rm1_bigram
You must have the HUB4 test data, available from the LDC HUB4 website.
You must download the binary HUB4 model file, called
, and the binary HUB4
trigram language model, called HUB4_trigram_lm.zip
, both located
at the sphinx4
package in the downloads
page. For the trigram language model file, unpack it by:
jar xvf HUB4_trigram_lm.zipThe trigram model file is called
. Then, in the build file for the
HUB4 tests, sphinx4/tests/performance/hub4/build.xml
, changed the
property of the build file to point to the location of
your HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.jar
. In the
configuration file, tests/performance/hub4/hub4.config.xml
change the 'location' of the 'trigramModel' component to where your
file is located.
You also need to prepare a batch file, which is currently called
in the build.xml file, by following instructions in
the Batch
Files section.
To run the test:
% cd sphinx4/tests/performance/hub4 % ant -projecthelp # to see a list of possible targets % ant hub4_trigram
Each batch mode regression test consists of the following components:
To learn about how to setup a regression test, take a look at the walkthrough of setting up the AN4 tests.
Batch files are used in batch mode regressions tests. It is a text file that contains the list of files to be processed, with the transcription for each file. The format is as shown below: one line for each file, where the first element in a line is the file name, which can be an absolute or relative path, and includes the file extension; after the file name, the words that make up the transcription for the audio. Sphinx-4 uses the transcription provided here to compute the system's accuracy after each sentence is processed. An utterance's processing produces in a hypothesis for what was said. This hypothesis is compared with the transcription, i.e., the hypothesis is aligned against the reference transcript, and a summary of the results is reported.
/lab/speech/sphinx4/data/tidigits/test/raw16k/man/man.ah.24z982za.raw two four zero nine eight two zero /lab/speech/sphinx4/data/tidigits/test/raw16k/man/man.ah.25896o4a.raw two five eight nine six oh four
An example batch file is tidigits.batch
(this link only works if you downloaded the source distribution).
The audio files used by Sphinx-4 can contain raw audio or cepstra, which is a form of encoded speech. The Java platform has support for other data formats, such as MS WAV or Sun's au, but, provided as is, Sphinx-4 can handle only raw data.
The audio defaults to 2 bytes/sample, at 16000 samples per second. The files are expected to be binaries without header. The Java platform assumes big endian order, always. These defaults can be changed. For example, the byte order or the sampling rate can be changed.
The input can also be cepstra. The cepstral file has a 4 byte integer containing the number of floats that follow. The following floats are 13 dimensional vectors concatenated. Notice that since the first piece of information is the number of floats, the total file size can be computed. If a comparisons with the actual size fails, either the byte order has to be reversed, or the file is corrupted. Importantly, the byte order can be automatically detected.
Walkthrough of Setting up the AN4
To illustrate the process of setting up a regression test, lets use AN4, an existing test, as an example. Use the following steps to create the AN4 tests.
. For example,
the AN4 tests reside in tests/performance/an4
. Since the AN4 test data
already comes in raw audio format, no conversion is necessary. However,
other test databases might require conversion to raw audio. For example, the
TIDIGITS test files are in SPHERE format, so it is necessary to convert them
to raw audio format before it can be read by the Sphinx-4 front end. This is
usually accomplished by using the program sox
looks like: /lab/speech/sphinx4/data/an4/an4_clstk/fash/an251-fash-b.raw yes /lab/speech/sphinx4/data/an4/an4_clstk/fash/an253-fash-b.raw go /lab/speech/sphinx4/data/an4/an4_clstk/fash/an254-fash-b.raw yes /lab/speech/sphinx4/data/an4/an4_clstk/fash/an255-fash-b.raw u m n y h six ...All batch files should reside in the test directory, in this case
at the top
level directory will create the JAR file for the WSJ model. The JAR file
should be included in the classpath of the application you are deploying. In
this case, the WSJ JAR file
) is included in
the java command line inside the build.xml run file. We also need to specify
in the config file (see the next item below) the acoustic model class we are
using, which in this case is
. The dictionary is also specified in the config file using the
resource mechanism of Sphinx-4.
, please take a look at it.
This file describes how the batch-mode recognizer and its various
sub-components should be configured. Note that this file also contains
configurations for the live-mode recognizer, which is not the subject of
interest of this walkthrough. In the following we will refer to components
in the config file using highlights
In an4.config.xml, the batch-mode recognizer is called
. It uses the Recognizer called
, which contains the decoder
, as
well as various monitors that keeps track of recognition accuracy, speed,
and memory. The decoder
contains the
, which in turn contains the
, the pruner
, the scorer
, and
the activeList
. Refer to the Javadoc
(go to bottom of the page) for a description of each of these components.
The linguist used is the flatLinguist
, and the grammar of the
is either the wordListGrammar
, which
is a file with a list of words, e.g.,
(i.e., N-gram language model), or
(i.e., finite state tranducer grammar). The
uses a language model file (text-based for AN4)
generated by the CMU
Statistical Language Modeling (SLM) Toolkit. The
also specifies the acoustic model used, and in
this case it is the WSJ models. The location and format of the WSJ model, as
well as the location of the various files in the model, are also specified.
The scorer
contains the front end, which is called
since it produces MFCC features.
is necessary to run Ant. This file is the Ant version
of the Makefile in Make. All Ant targets are listed in this file. For
details on how to write this file, refer to the documentation at http://ant.apache.org/. Lets use the first
Ant target, an4_words_wordlist
, as an example. This Ant target
invokes the java
command on the class
. This class
takes a configuration file (an4.config.xml
) and a batch file
) as arguments. This class looks for the
component named batch
in the configuration file. The
configuration manager will create this component (and its subcomponents).
Therefore, the component
should always be
named "batch"
in the config.xml file. Other AN4 Ant targets are
created similarly.
The two main acoustic models that are used by Sphinx-4, TIDIGITS and Wall
Street Journal, are already included in the "lib"
directory of
the binary distribution. For the source distribution, you will build it when
you type ant
at the top level directory. Our regression tests
also uses the RM1 and HUB4 models, which are available for download separately
on the download page. Sphinx-4 can handle model packages provided as a jar
Each acoustic model implements the AcousticModel
interface. For example, the WSJ models are wrapped by a class called
which implements the AcousticModel interface. This implementation class is in
the JAR file of the models, together with the actual data files of the model.
This way, two simple steps are need to use a particular acoustic model:
You can find out the model implementation class of a JAR file using the
java -jar
command. For example, you can find out the model class
of the WSJ model by:
sphinx4>java -jar lib/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar Wall Street Journal acoustic models Class: edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz Is Binary: true Sparse Form: false Filters: 40 Vector Length: 39 Gaussians: 8 Model Definition: etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef Data Location: cd_continuous_8gau Feature Type: cepstra_delta_doubledelta Sample Rate: 16000 Description: Wall Street Journal acoustic models Number Fft Points: 512 Max Freq: 6800 Min Freq.: 130The print out also includes details about how the model was trained, but this is not important for the average user.
The language model used by Sphinx-4 follows the ARPA format. Language models provided with the acoustic model packages were created with the Carnegie Mellon University Statistical Language Modeling toolkit (CMU SLM toolkit), available at CMU. A manual is available there.
The language model is created from a list of transcriptions. Given a file with training transcription, the following script creates a list of words that appear in the transcriptions, then creates a bigram and a trigram LM files in the ARPA format. The file with extension ccs contains the context cues, and it is usually a list of words used as markers - beginning or end of speech etc.
set task = RM # Location of the CMU SLM toolkit set bindir = ~/src/CMU-SLM_Toolkit_v2/bin cat $task.transcript | $bindir/text2wfreq | $bindir/wfreq2vocab > $task.vocab set mode = "-absolute" # Create bigram cat $task.transcript | $bindir/text2idngram -n 2 -vocab $task.vocab | \ $bindir/idngram2lm $mode -context $task.ccs -n 2 -vocab $task.vocab \ -idngram - -arpa $task.bigram.arpa # Create trigram cat $task.transcript | $bindir/text2idngram -n 3 -vocab $task.vocab | \ $bindir/idngram2lm $mode -context $task.ccs -n 3 -vocab $task.vocab \ -idngram - -arpa $task.trigram.arpa
Sphinx-4 uses the Java Speech API Grammar Format (JSGF) to perform speech recognition using a BNF-style grammar. Currently, you can only use JSGF grammars with the FlatLinguist. To specify JSGF grammars, set the following in the configuration file:
<component name="flatLinguist" type="edu.cmu.sphinx.linguist.flat.FlatLinguist"> <property name="grammar" value="jsgfGrammar"> // ... other properties ... </component> <component name="jsgfGrammar" type="edu.cmu.sphinx.jsapi.JSGFGrammar"> <property name="grammarLocation" value="...URL of grammar directory"/> </component>
For information on how to write JSGF grammars, and how to specify the
location of your JSGF grammar file(s), and the limitations of the current
implementation of JSGF grammar, please refer to the Javadocs
for JSGFGrammar.
The Sphinx-4 API can be found in the javadoc documentation.
If the previous is broken, please build the javadocs using the instructions in Creating Javadocs. In fact, rebuilding javadocs is something you should do every time you change code in Sphinx-4.
In this section, we will provide an overview of Sphinx-4, starting with an
introduction of HMM-based recognizers. We will highlight in
Sphinx-4 is an HMM-based speech recognizer.
During speech recognition, features are derived from the incoming speech
(we will use "speech" to mean the same thing as "audio") in the same way as in
the training process. The component of the recognizer that generates these
features is called the
The process of speech recognition is to find the best possible sequence of
words (or units) that will fit the given input speech. It is a
Constructing the above graph requires knowledge from various sources. It
requires a
Usually, the search graph also has information about how likely certain
words will occur. This information is supplied by the
Once this graph is constructed, the sequence of parametrized speech signals
(i.e., the features) is matched against different paths through the graph to
find the best fit. The best fit is usually the least cost or highest scoring
path, depending on the implementation. In Sphinx-4, the task of searching
through the graph for the best path is done by the
As you can see from the above graph, a lot of the nodes have self
transitions. This can lead to a very large number of possible paths through
the graph. As a result, finding the best possible path can take a very long
time. The purpose of the
As we described earlier, the input speech signal is transformed into a
sequence of feature vectors. After the last feature vector is decoded, we look
at all the paths that have reached the final exit node (the red node). The
path with the highest score is the best fit, and a
In this section, we describe the main components of Sphinx-4, and how they work together during the recognition process. First of all, lets look at the architecture diagram of Sphinx-4. It contains almost all the concepts (the words in red) that were introduced in the previous section. There are a few additional concepts in the diagram, which we will explain promptly.
When the recognizer starts up, it constructs the front end (which generates features from speech), the decoder, and the linguist (which generates the search graph) according to the configuration specified by the user. These components will in turn construct their own subcomponents. For example, the linguist will construct the acoustic model, the dictionary, and the language model. It will use the knowledge from these three components to construct a search graph that is appropriate for the task. The decoder will construct the search manager, which in turn constructs the scorer, the pruner, and the active list.
Most of these components represents interfaces. The search manager,
linguist, acoustic model, dictionary, language model, active list, scorer,
pruner, and search graph are all Java interfaces. There can be different
implementations of these interfaces. For example, there are two different
implementations of the search manager. Then, how does the system know which
implementation to use? It is specified by the user via the configuration file,
an XML-based file that is loaded by the
When the application asks the recognizer to perform recognition, the search manager will ask the scorer to score each token in the active list against the next feature vector obtained from the front end. This gives a new score for each of the active paths. The pruner will then prune the tokens (i.e., active paths) using certain heuristics. Each surviving paths will then be expanded to the next states, where a new token will be created for each next state. The process repeats itself until no more feature vectors can be obtained from the front end for scoring. This usually means that there is no more input speech data. At that point, we look at all paths that have reached the final exit state, and return the highest scoring path as the result to the application.
The performance of Sphinx-4 critically depends on your task and how you configured Sphinx-4 to suit your task. For example, a large vocabulary task needs a different linguist than a small vocabulary task. Your system has to be configured differently for the two tasks. This section will not tell you the exact configuration for different tasks, which will be dealt with later. Instead, this section will introduce you to the configuration mechanism of Sphinx-4, which is via an XML-based configuration file. Please click on the document Sphinx-4 Configuration Management to learn how to do this. It is important that you read this document before you proceed.