home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.speech
- Path: sparky!uunet!charon.amdahl.com!pacbell.com!ames!saimiri.primate.wisc.edu!zaphod.mps.ohio-state.edu!darwin.sura.net!sgiblab!munnari.oz.au!metro!waves!andrewh
- From: andrewh@ee.su.OZ.AU (Andrew Hunt)
- Subject: FAQ - First Draft
- Message-ID: <1992Nov18.012628.15394@ucc.su.OZ.AU>
- Sender: news@ucc.su.OZ.AU
- Nntp-Posting-Host: waves.ee.su.oz.au
- Reply-To: andrewh@ee.su.OZ.AU (Andrew Hunt)
- Organization: University of Sydney, Australia
- Date: Wed, 18 Nov 1992 01:26:28 GMT
- Lines: 918
-
-
- comp.speech
-
- Frequently Asked Questions
- ==========================
-
- (FAQ = "Frequently Asked Questions")
-
- Compiled: 18-Nov-92
-
- This document is an attempt to answer commonly asked questions and to
- reduce the bandwidth taken up by these posts and their associated replies.
- If you have a question, please check this file before you post.
-
- The FAQ is not meant to discuss any topic exhaustively. It will hopefully
- provide readers with pointers on where to find useful information. It also
- tries to list as much of the useful information and material available
- elsewhere on the Internet.
-
- If you have not already read the Usenet introductory material posted to
- "news.announce.newusers", please do. For help with FTP (file transfer
- protocol) look for a regular posting of "Anonymous FTP List - FAQ" in
- comp.misc, comp.archives.admin and news.answers amongst others.
-
-
- The document is an EARLY DRAFT. There are many unanswered questions and
- some answers are not comprehensive. I hope that it will improve over the next
- few months. If you have any comments, suggestions for inclusions, or answers
- then please post or email.
-
- In particular, the sections listing software/hardware to perform both
- synthesis, recognition and NLP have lots of gaps.
-
-
- My appologies to those who have been awaiting the first posting. I have
- been on holidays for most of the last month.
-
-
- Andrew Hunt
- Speech Technology Research Group email: andrewh@ee.su.oz.au
- Department of Electrical Engineering Ph: 61-2-692 4509
- University of Sydney, Australia. Fax: 61-2-692 3847
-
-
- ========================== Acknowledgements ===========================
-
- Thanks to the following for their comments and contributions.
-
- Oliver Jakobs <jakobs@ldv01.Uni-Trier.de>
- Tony Robinson <ajr@eng.cam.ac.uk>
-
-
- ============================ Contents =================================
-
- PART 1 - General
-
- Q1.1: What is comp.speech?
- Q1.2: Where are the comp.speech archives?
- Q1.3: Common abbreviations and jargon.
- Q1.4: What are related newsgroups and mailing lists?
- Q1.5: What are related journals and conferences?
- Q1.6: What speech databases are available?
- Q1.7: Speech File Formats, Conversion and Playing.
-
- PART 2 - Signal Processing for Speech
-
- Q2.1: What speech sampling and signal processing hardware can I use?
- Q2.2: Signal processing techniques for speech technology.
- Q2.3: How do I convert to/from mu-law format?
-
- PART 3 - Speech Coding and Compression
-
- Q3.1: Speech compression techniques.
- Q3.2: What are some good references/books on coding/compression?
- Q3.3: What software is available?
-
- PART 4 - Speech Synthesis
-
- Q4.1: What is speech synthesis?
- Q4.2: How can speech synthesis be performed?
- Q4.3: What are some good references/books on synthesis?
- Q4.4: What software/hardware is available?
-
- PART 5 - Speech Recognition
-
- Q5.1: What is speech recognition?
- Q5.2: How can I build a very simple speech recogniser?
- Q5.2: What does speaker dependent/adaptive/independent mean?
- Q5.3: What does small/medium/large/very-large vocabulary mean?
- Q5.4: What does continuous speech or isolated-word mean?
- Q5.5: How is speech recognition done?
- Q5.6: What are some good references/books on recognition?
- Q5.7: What packages are available?
-
- PART 6 - Speaker Recognition/Verification
-
- Q6.1: What is speaker recognition/verification?
- Q6.2: Where is speaker recognition used?
- Q6.3: What are techniques for speaker recognition?
- Q6.4: How good is speaker recognition?
- Q6.5: What are some good references/books on speaker recognition?
- Q6.6: What packages are available?
-
- PART 7 - Natural Language Processing
-
- Q7.1: What is NLP?
- Q7.2: What are some good references/books on NLP?
- Q7.3: What software is available?
-
- =======================================================================
-
- PART 1 - General
-
- Q1.1: What is comp.speech?
-
- comp.speech is a newsgroup for discussion of speech technology and
- speech science. It covers a wide range of issues from application of
- speech technology, to research, to products and lots more.
-
- By nature speech technology is an inter-disciplinary field and the
- newsgroup reflects this. However, computer application should be the
- basic theme of the group.
-
- The following is a list of topics but does not cover all matters related
- to the field - no order of importance is implied.
-
- [1] Speech Recognition - discussion of methodologies, training, techniques,
- results and applications. This should cover the application of techniques
- including HMMs, neural-nets and so on to the field.
-
- [2] Speech Synthesis - discussion concerning theoretical and practical
- issues associated with the design of speech synthesis systems.
-
- [3] Speech Coding and Compression - both research and application matters.
-
- [4] Phonetic/Linguistic Issues - coverage of linguistic and phonetic issues
- which are relevant to speech technology applications. Could cover parsing,
- natural language processing, phonology and prosodic work.
-
- [5] Speech System Design - issues relating to the application of speech
- technology to real-world problems. Includes the design of user interfaces,
- the building of real-time systems and so on.
-
- [6] Other matters - relevant conferences, books, public domain software,
- hardware and related products.
-
- ------------------------------------------------------------------------
-
- Q1.2: Where are the comp.speech archives?
-
- comp.speech is being archived for anonymous ftp.
-
- ftp site: svr-ftp.eng.cam.ac.uk (or 129.169.24.20).
- directory: comp.speech/archive
-
- comp.speech/archive contains the articles as they arrive. Batches of 100
- articles are grouped into a shar file, along with an associated file of
- Subject lines.
-
- Other useful information is also available in comp.speech/info.
-
- ------------------------------------------------------------------------
-
- Q1.3: Common abbreviations and jargon.
-
- ANN - Artificial Neural Network.
- ASR - Automatic Speech Recognition.
- CELP - Code excited linear prediction.
- DTW - Dynamic time warping.
- FAQ - Frequently asked questions.
- HMM - Hidden markov model.
- LPC - Linear predictive coding.
- LVQ - Linear vector quantisation.
- NLP - Natural Language Processing.
- NN - Neural Network.
- TTS - Text-To-Speech (i.e. synthesis).
-
- ------------------------------------------------------------------------
-
- Q1.4: What are related newsgroups and mailing lists?
-
-
- NEWGROUPS
-
- comp.ai - Artificial Intelligence newsgroup.
- Postings on general AI issues, language processing and AI techniques.
- Has a good FAQ including NLP, NN and other AI information.
-
- comp.ai.nlang-know-rep - Natural Language Knowledge Representation
- Moderated group covering Natural Language.
-
- comp.ai.neural-nets - discussion of Neural Networks and related issues.
- There are often posting on speech related matters - phonetic recognition,
- connectionist grammars and so on.
-
- comp.compression - occasional articles on compression of speech.
- FAQ for comp.compression has some info on audio compression standards.
-
- comp.dcom.telecom - Telecommunications newsgroup.
- Has occasional articles on voice products.
-
- comp.dsp - discussion of signal processing - hardware and algorithms and more.
- Has a good FAQ posting.
- Has a regular posting of a comprehensive list of Audio File Formats.
-
- comp.multimedia - Multi-Media discussion group.
- Has occasional articles on voice I/O.
-
- sci.lang - Language.
- Discussion about phonetics, phonology, grammar, etymology and lots more.
-
-
- MAILING LISTS
-
- ECTL - Electronic Communal Temporal Lobe
- Founder & Moderator: David Leip
- Moderated mailing list for researchers with interests in computer speech
- interfaces. This list serves a broad community including persons from
- signal processing, AI, linguistics and human factors.
-
- To subscribe, send the following information to:
- ectl-request@snowhite.cis.uoguelph.ca
- name, institute, department, daytime phone & e-mail address
-
- To access the archive, ftp snowhite.cis.uoguelph.ca, login as anonymous,
- and supply your local userid as a password. All the ECTL things can be
- found in pub/ectl.
-
- Prosody Mailing List
- Unmoderated mailing list for discussion of prosody. The aim is
- to facilitate the spread of information relating to the research
- of prosody by creating a network of researchers in the field.
- If you want to participate, send the following one-line
- message to "listserv@purccvm.bitnet" :-
-
- subscribe prosody Your Name
-
- Digital Mobile Radio
- Covers lots of areas include some speech topics including speech
- coding and speech compression.
- Mail Peter Decker (dec@dfv.rwth-aachen.de) to subscribe.
-
- ------------------------------------------------------------------------
-
- Q1.5: What are related journals and conferences?
-
- Try the following commercially oriented magazines...
-
- Speech Technology
-
- Try the following technical journals...
-
- Computational Linguistics
- Computer Speech and Language
- Journal of the Acoustical Society of America
- Transactions of IEEE ASSP
-
- Try the following conferences...
-
- IEEE Intl. Conf. Acoustics, Speech and Signal Processing
- Intl Conf on Spoken Language Processing
-
- ------------------------------------------------------------------------
-
- Q1.6: What speech databases are available?
-
- A wide range of speech databases have been collected. These databases
- are primarily for the development of speech synthesis/recognition and for
- linguistic research. Unfortunately, almost all the information listed
- here refers to the English language.
-
- Some databases are free but most appear to be available for a small cost.
- The databases normally require lots of storage space - do not expect to be
- able to ftp all the data you want.
-
- [There are too many to list here in detail - perhaps someone would like to
- set up a special posting on speech databases?]
-
-
- LINGUISTIC DATA CONSORTIUM (LDC)
-
- Information about the Linguistic Data Consortium is available via
- anonymous ftp from: ftp.cis.upenn.edu (130.91.6.8)
- in the directory: /pub/ldc
-
- Here are some excerpts from the files in that directory:
-
- Briefly stated, the LDC has been established to broaden the collection
- and distribution of speech and natural language data bases for the
- purposes of research and technology development in automatic speech
- recognition, natural language processing and other areas where large
- amounts of linguistic data are needed.
-
- Here is the brief list of corpora:
-
- * The TIMIT and NTIMIT speech corpora
- * The Resource Management speech corpus (RM1, RM2)
- * The Air Travel Information System (ATIS0) speech corpus
- * The Association for Computational Linguistics - Data Collection
- Initiative text corpus (ACL-DCI)
- * The TI Connected Digits speech corpus (TIDIGITS)
- * The TI 46-word Isolated Word speech corpus (TI-46)
- * The Road Rally conversational speech corpora (including "Stonehenge"
- and "Waterloo" corpora)
- * The Tipster Information Retrieval Test Collection
- * The Switchboard speech corpus ("Credit Card" excerpts and portions
- of the complete Switchboard collection)
-
- Further resources to be made available within the first year (or two):
-
- * The Machine-Readable Spoken English speech corpus (MARSEC)
- * The Edinburgh Map Task speech corpus
- * The Message Understanding Conference (MUC) text corpus of FBI
- terrorist reports
- * The Continuous Speech Recognition - Wall Street Journal speech
- corpus (WSJ-CSR)
- * The Penn Treebank parsed/tagged text corpus
- * The Multi-site ATIS speech corpus (ATIS2)
- * The Air Traffic Control (ATC) speech corpus
- * The Hansard English/French parallel text corpus
- * The European Corpus Initiative multi-language text corpus (ECI)
- * The Int'l Labor Organization/Int'l Trade Union multi-language
- text corpus (ILO/ITU)
- * Machine-readable dictionaries/lexical data bases (COMLEX, CELEX)
-
- The files in the directory include more detailed information on the
- individual databases. For further information contact
-
- Elizabeth Hodas
- 441 Williams Hall
- University of Pennsylvania
- Philadelphia, PA 19104-6305
- Phone: (215) 898-0464
- Fax: (215) 573-2175
- e-mail: ehodas@walnut.ling.upenn.edu
-
-
- Center for Spoken Language Understanding (CSLU)
-
- 1. The ISOLET speech database of spoken letters of the English alphabet.
- The speech is high quality (16 kHz with a noise cancelling microphone).
- 150 speakers x 26 letters of the English alphabet twice in random order.
- The "ISOLET" data base can be purchased for $100 by sending an email request
- to vincew@cse.ogi.edu. (This covers handling, shipping and medium costs).
- The data base comes with a technical report describing the data.
-
- 2. CSLU has a telephone speech corpus of 1000 English alphabets. Callers
- recite the alphabet with brief pauses between letters. This database is
- available to not-for-profit institutions for $100. The data base is described
- in the proceedings of the International Conference on Spoken Language
- Processing. Contact vincew@cse.ogi.edu if interested.
-
- ------------------------------------------------------------------------
-
- Q1.7: Speech File Formats, Conversion and Playing.
-
- Section 2 of this FAQ has information on mu-law coding.
-
- A very good and very comprehensive list of audio file formats is prepared
- by Guido van Rossum. The list is posted regularly to comp.dsp and
- alt.binaries.sounds.misc, amongst others. It includes information on
- sampling rates, hardware, compression techniques, file format definitions,
- format conversion, standards, programming hints and lots more.
-
- It is also available by ftp
- from: ftp.cwi.nl
- directory: /pub
- file: AudioFormats<version>
-
-
-
- =======================================================================
-
- PART 2 - Signal Processing for Speech
-
- Q2.1: What speech sampling and signal processing hardware can I use?
-
- In addition to the following information, have a look at the Audio File
- format document prepared by Guido van Rossum (referred to above).
-
-
- Product: Sun standard audio port (SPARC 1 & 2)
- Input: 1 channel, 8 bit mu-law encoded (telephone quality)
- Output: 1 channel, 8 bit mu-law encoded (telephone quality)
-
- Product: Ariel
- Platform: Sun + others?
- Input: 2 channels, 16bit linear, sample rate 8-96kHz (inc 32, 44.1, 48kHz).
- Output: 2 channels, 16bit linear, sample rate 8-50kHz (inc 32, 44.1, 48kHz).
- Contact:
-
- Can anyone provide information on Soundblaster, Mac, NeXT and other hardware?
-
- [Help is needed to source more info. How about the following format?]
-
- Product: xxx
- Platform: PC, Mac, Sun, ...
- Rough Cost (pref $US):
- Input: e.g. 16bit linear, 8,10,16,32kHz.
- Output: e.g. 16bit linear, 8,10,16,32kHz.
- DSP: signal processing support
- Other:
- Contact:
-
- ------------------------------------------------------------------------
-
- Q2.2: Signal processing techniques for speech technology.
-
- Would anyone like to write a short intro, list some useful refs, and list
- public domain signal processing software?
-
- ------------------------------------------------------------------------
-
- Q2.3: How do I convert to/from mu-law format?
-
- Mu-law coding is a form of compression for audio signals including speech.
- It is widely used in the telecommunications field because it improves the
- signal-to-noise ratio without increasing the amount of data. Typically,
- mu-law compressed speech is carried in 8-bit samples. It is a companding
- technqiue. That means that carries more information about the smaller signals
- than about larger signals. Mu-law coding is provided as standard for the
- audio input and output of the SUN Sparc stations 1&2 (Sparc 10's are linear).
-
- Here is some sample conversion code in C.
-
- # include <stdio.h>
-
- unsigned char linear2ulaw(/* int */);
- int ulaw2linear(/* unsigned char */);
-
- /*
- ** This routine converts from linear to ulaw.
- **
- ** Craig Reese: IDA/Supercomputing Research Center
- ** Joe Campbell: Department of Defense
- ** 29 September 1989
- **
- ** References:
- ** 1) CCITT Recommendation G.711 (very difficult to follow)
- ** 2) "A New Digital Technique for Implementation of Any
- ** Continuous PCM Companding Law," Villeret, Michel,
- ** et al. 1973 IEEE Int. Conf. on Communications, Vol 1,
- ** 1973, pg. 11.12-11.17
- ** 3) MIL-STD-188-113,"Interoperability and Performance Standards
- ** for Analog-to_Digital Conversion Techniques,"
- ** 17 February 1987
- **
- ** Input: Signed 16 bit linear sample
- ** Output: 8 bit ulaw sample
- */
-
- #define ZEROTRAP /* turn on the trap as per the MIL-STD */
- #undef ZEROTRAP
- #define BIAS 0x84 /* define the add-in bias for 16 bit samples */
- #define CLIP 32635
-
- unsigned char linear2ulaw(sample) int sample; {
- static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,
- 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
- 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
- 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
- int sign, exponent, mantissa;
- unsigned char ulawbyte;
-
- /* Get the sample into sign-magnitude. */
- sign = (sample >> 8) & 0x80; /* set aside the sign */
- if(sign != 0) sample = -sample; /* get magnitude */
- if(sample > CLIP) sample = CLIP; /* clip the magnitude */
-
- /* Convert from 16 bit linear to ulaw. */
- sample = sample + BIAS;
- exponent = exp_lut[( sample >> 7 ) & 0xFF];
- mantissa = (sample >> (exponent + 3)) & 0x0F;
- ulawbyte = ~(sign | (exponent << 4) | mantissa);
- #ifdef ZEROTRAP
- if (ulawbyte == 0) ulawbyte = 0x02; /* optional CCITT trap */
- #endif
-
- return(ulawbyte);
- }
-
- /*
- ** This routine converts from ulaw to 16 bit linear.
- **
- ** Craig Reese: IDA/Supercomputing Research Center
- ** 29 September 1989
- **
- ** References:
- ** 1) CCITT Recommendation G.711 (very difficult to follow)
- ** 2) MIL-STD-188-113,"Interoperability and Performance Standards
- ** for Analog-to_Digital Conversion Techniques,"
- ** 17 February 1987
- **
- ** Input: 8 bit ulaw sample
- ** Output: signed 16 bit linear sample
- */
-
- int ulaw2linear(ulawbyte) unsigned char ulawbyte; {
- static int exp_lut[8] = { 0, 132, 396, 924, 1980, 4092, 8316, 16764 };
- int sign, exponent, mantissa, sample;
-
- ulawbyte = ~ulawbyte;
- sign = (ulawbyte & 0x80);
- exponent = (ulawbyte >> 4) & 0x07;
- mantissa = ulawbyte & 0x0F;
- sample = exp_lut[exponent] + (mantissa << (exponent + 3));
- if(sign != 0) sample = -sample;
-
- return(sample);
- }
-
-
-
- =======================================================================
-
- PART 3 - Speech Coding and Compression
-
- Q3.1: Speech compression techniques.
-
- [The FAQ for comp.compression includes a few questions and answers
- on the compression of speech.]
-
- ------------------------------------------------------------------------
-
- Q3.2: What are some good references/books on coding/compression?
-
- ------------------------------------------------------------------------
-
- Q3.3: What software is available?
-
- A lossless compressor for speech signals is available by anonymous ftp from
- svr-ftp.eng.cam.ac.uk in directory misc and file shorten-0.4.shar. It will
- compile and run on UNIX workstations and will cope with a wide variety of
- formats. Compression is typically 50% for 16bit clean speech sampled at
- 16kHz.
-
- [ADPCM, CELP and GSM source appear to be available by ftp - can someone
- provide details?]
-
-
-
- =======================================================================
-
- PART 4 - Speech Synthesis
-
- Q4.1: What is speech synthesis?
-
- Speech synthesis is the task of transforming written input to spoken output.
- The input can either be provided in a graphemic/orthographic or a phonemic
- script, depending on its source.
-
- ------------------------------------------------------------------------
-
- Q4.2: How can speech synthesis be performed?
-
- There are several algorithms. The choice depends on the task they're used
- for. The easiest way is to just record the voice of a person speaking the
- desired phrases. This is useful if only a restricted volume of phrases and
- sentences is used, e.g. messages in a train station, or schedule information
- via phone. The quality depends on the way recording is done.
-
- More sophisticated but worse in quality are algorithms which split the
- speech into smaller pieces. The smaller those units are, the less are they
- in number, but the quality also decreases. An often used unit is the phoneme,
- the smallest linguistic unit. Depending on the language used there are about
- 35-50 phonemes in western European languages, i.e. there are 35-50 single
- recordings. The problem is combining them as fluent speech requires fluent
- transitions between the elements. The intellegibility is therefore lower, but
- the memory required is small.
-
- A solution to this dilemma is using diphones. Instead of splitting at the
- transitions, the cut is done at the center of the phonemes, leaving the
- transitions themselves intact. This gives about 400 elements (20*20) and
- the quality increases.
-
- The longer the units become, the more elements are there, but the quality
- increases along with the memory required. Other units which are widely used
- are half-syllables, syllables, words, or combinations of them, e.g. word stems
- and inflectional endings.
-
- ------------------------------------------------------------------------
-
- Q4.3: What are some good references/books on synthesis?
-
- MITalk (also known as DECtalk) is one of the most sucessful speech synthesis
- systems around. It is described in:
-
- John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech:
- The MITalk System", Cambridge University Press, 1987.
-
- ------------------------------------------------------------------------
-
- Q4.4: What software/hardware is available?
-
- There appears to be very little Public Domain or Shareware speech synthesis
- related software available for FTP. However, the following are available.
- Strictly speaking, not all the following sources are speech synthesis - all
- are speech output systems.
-
-
- SIMTEL-20
- The following is a list of speech related software available from SIMTEL-20
- and its mirror sites for PCs.
-
- Directory PD1:<MSDOS.VOICE>
- Filename Type Length Date Description
- ==============================================
- AUTOTALK.ARC B 23618 881216 Digitized speech for the PC
- CVOICE.ARC B 21335 891113 Tells time via voice response on PC
- HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth.
- HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker
- SAY.ARC B 20224 860330 Computer Speech - using phonemes
- SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes
- TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker
- TRAN.ARC B 39766 890715 Repeats typed text in digital voice
- VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs
- VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening
-
-
- Other Sources
-
- Package: Text to phoneme program
- Platform: unknown
- Description: Text to phoneme program. Possibly based on Naval Research Lab's
- set of text to phoneme rules.
- Availablity: By FTP from "shark.cse.fau.edu" (131.91.80.13) in the directory
- /pub/src/phon.tar.Z
-
- Package: Text to speech program
- Platform: Sun SPARC
- Description: Text to speech program based on concatenation of pre-recorded
- speech segments.
- Hardware: SPARC audio I/O
- Availablity: by FTP from "wilma.cs.brown.edu" as /pub/speak.tar.Z
-
-
- Package: xxx
- Platform: (PC, Mac, Sun, NeXt etc)
- Rough Cost: (if appropriate)
- Description: (keep it brief)
- Hardware: (requirement list)
- Availablity: (ftp info, email contact or company contact)
-
-
- Can anyone provide information on the following packages?
-
- MacIntalk (Mac) - formant based speech synthesis
- Narrator (Amiga) - formant based synthesis
- Bliss software
- CSRE software
- JSRU software
- Klatt Software
- Sensimetrics products
-
- Can anyone provide information on speech synthesis chip sets?
-
- Please email or post suitable information for this list. Commercial and
- research packages are both appropriate.
-
- [This list may be large enough to justify a separate posting]
-
-
- =======================================================================
-
- PART 5 - Speech Recognition
-
- Q5.1: What is speech recognition?
-
- Anyone got a good definition?
-
- ------------------------------------------------------------------------
-
- Q5.2: How can I build a very simple speech recogniser?
-
- Doug Danforth provides a detailed account in article 253 in the comp.speech
- archives - also available as file info/DIY_Speech_Recognition.
-
- The first part is reproduced here.
-
- QUICKY RECOGNIZER sketch:
-
- Here is a simple recognizer that should give you 85%+ recognition
- accuracy. The accuracy is a function of WHAT words you have in
- your vocabulary. Long distinct words are easy. Short similar
- words are hard. You can get 98+% on the digits with this recognizer.
-
- Overview:
- (1) Find the begining and end of the utterance.
- (2) Filter the raw signal into frequency bands.
- (3) Cut the utterance into a fixed number of segments.
- (4) Average data for each band in each segment.
- (5) Store this pattern with its name.
- (6) Collect training set of about 3 repetitions of each pattern (word).
- (7) Recognize unknown by comparing its pattern against all patterns
- in the training set and returning the name of the pattern closest
- to the unknown.
-
- Many variations upon the theme can be made to improve the performance.
- Try different filtering of the raw signal and different processing methods.
-
- ------------------------------------------------------------------------
-
- Q5.2: What does speaker dependent/adaptive/independent mean?
-
- A speaker dependent system is developed (trained) to operate for a single
- speaker. These systems are usually easier to develop, cheaper to buy and
- more accurate, but are not as flexible as speaker adaptive or speaker
- independent systems.
-
- A speaker independent system is developed (trained) to operate for any
- speaker or speakers of a particular type (e.g. male/female, American/English).
- These systems are the most difficult to develop, most expensive and currently
- accuracy is not as good. They are the most flexible.
-
- A speaker adaptive system is developed to adapt its operation for new
- speakers that it encounters usually based on a general model of speaker
- characteristics. It lies somewhere between speaker independent and speaker
- adaptive systems.
-
- Each type of system is suited to different applications and domains.
-
- ------------------------------------------------------------------------
-
- Q5.3: What does small/medium/large/very-large vocabulary mean?
-
- The size of vocabulary of a speech recognition system affects the complexity,
- processing requirements and the accuracy of the system. Some applications
- only require a few words (e.g. numbers only), others require very large
- dictionaries (e.g. dictation machines).
-
- There are no established definitions but the following may be a helpful guide.
-
- small vocabulary - tens of words
- medium vocabulary - hundreds of words
- large vocabulary - thousands of words
- very-large vocabulary - tens of thousands of words.
-
- [Does anyone have a more precise definition?]
-
- ------------------------------------------------------------------------
-
- Q5.4: What does continuous speech or isolated-word mean?
-
- An isolated-word system operates on single words at a time - requiring a
- pause between saying each word. This is the simplest form of recognition
- to perform, because the pronunciation of the words tends not affect each
- other. Because the occurrences of each particular word are similar they are
- easier to recognise.
-
- A continuous speech system operates on speech in which words are connected
- together, i.e. not separated by pauses. Continuous speech is more difficult
- to handle because of a variety of effects. One effect is "coarticulation" -
- the production of each phoneme is affects by the production of surrounding
- phonemes and the so the start and end of words are affected by the preceding
- and following words. The recognition of continuous speech is affected by
- the rate of speech (fast speech tends to be harder).
-
- ------------------------------------------------------------------------
-
- Q5.5: How is speech recognition done?
-
- A wide variety of techniques are used to perform speech recognition.
- There are many types of speech recognition. There are many levels of
- speech recognition/processing/understanding.
-
- Typically speech recognition starts with the digital sampling of speech.
- The next stage would be acoustic signal processing. Common techniques
- include a variety of spectral analyses, LPC analysis, the cepstral transform,
- cochlea modelling and many, many more.
-
- The next stage will typically try to recognise phonemes, groups of phonemes
- or words. This stage can be achieved by many processes such as DTW (Dynamic
- Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), and
- sometimes expert systems. In crude terms, all these processes to recognise
- the patterns of speech. The most advanced systems are statistically
- motivated.
-
- Some systems utilise knowledge of grammar to help with the recognition
- process.
-
- Some systems attempt to utilise prosody (pitch, stress, rhythm etc) to
- process the speech input.
-
- Some systems try to "understand" speech. That is, they try to convert the
- words into a representation of what the speaker intended to mean or achieve
- by what they said.
-
- ------------------------------------------------------------------------
-
- Q5.6: What are some good references/books on recognition?
-
- Suggestions?
-
- ------------------------------------------------------------------------
-
- Q5.7: What packages are available?
-
- Package Name: xxx
- Platform: PC, Mac, UNIX, ....
- Description: (e.g. isolated word, speaker independent...)
- Rough Cost: (if applicable)
- Requirements: (hardware/software needs - if applicable)
- Misc:
- Contact: (email, ftp or address)
-
-
- Can anyone provide info on
-
- DragonDictate
- SayIt (from Qualix)
- HTK (HMM Toolkit)
- Voice Navigator (from Articulate Systems)
- IN3 Voice Command
- Votan
-
-
- I would like information on any software/hardware/packages that you know about.
- Commercial, public domain and research packages are all appropriate.
-
- [If there is enough information a separate posting could be started.]
-
-
- =======================================================================
-
- PART 6 - Speaker Recognition/Verification
-
- Q6.1: What is speaker recognition/verification?
-
- ------------------------------------------------------------------------
-
- Q6.2: Where is speaker recognition used?
-
- ------------------------------------------------------------------------
-
- Q6.3: What are techniques for speaker recognition?
-
- ------------------------------------------------------------------------
-
- Q6.4: How good is speaker recognition?
-
- ------------------------------------------------------------------------
-
- Q6.5: What are some good references/books on speaker recognition?
-
- ------------------------------------------------------------------------
-
- Q6.6: What packages are available?
-
-
- =======================================================================
-
- PART 7 - Natural Language Processing
-
- There is a lot of useful information on the following questions in the
- FAQ for comp.ai. The FAQ lists available software and useful references.
- Included is a substantial list of software, documentation and other info
- which is available by ftp.
-
- ------------------------------------------------------------------------
-
- Q7.1: What is NLP?
-
- Natural Language Processing is a field of great breadth. It covers everything
- from syntax and semantic analysis or text, to methods of "understanding"
- texts, to methods of generating text from abstract representations, to
- language translations and more.
-
- ------------------------------------------------------------------------
-
- Q7.2: What are some good references/books on NLP?
-
- Any recommendations? A few references/books for each area such as parsing,
- translation, knowledge representation etc, would be suitable.
-
- The FAQ for the "comp.ai" newsgroup includes some useful refs also.
-
- ------------------------------------------------------------------------
-
- Q7.3: What software is available?
-
- The FAQ for the "comp.ai" newsgroup lists a variety of language processing
- software that is available. That FAQ is posted monthly.
-
- Natural Language Software Registry
- The Natural Language Software Registry is available from the German Research
- Institute for Artificial Intelligence (DFKI) in Saarbrucken.
-
- The current version details
- + speech signal processors, e.g. Computerized Speech Lab (Kay Electronics)
- + morphological analyzers, e.g. PC-KIMMO (Summer Institute for Linguistics)
- + parsers, e.g. Alveytools (University of Edinburgh)
- + knowledge representation systems, e.g. Rhet (University of Rochester)
- + multicomponent systems, such as ELU (ISSCO), PENMAN (ISI), Pundit (UNISYS),
- SNePS (SUNY Buffalo),
- + applications programs (misc.)
-
- This document is available on-line via anonymous ftp to ftp.dfki.uni-sb.de
- (directory: registry; or: tira.uchicago.edu, IP 128.135.96.31), by email to
- registry@dfki.uni-sb.de.
-
- If you have developed a piece of software for natural language
- processing that other researchers might find useful, you can include
- it by returning a description form, available from the same source.
-
-
-
-