home *** CD-ROM | disk | FTP | other *** search
- A few notes on the SBRECOG speech recognition demo
-
-
- With SBRECOG, I am presenting a speaker dependent
- speech recognizer that works on DOS machines with
- soundblaster compatible sound cards. The recognition
- can be quite good if the conditions are optimal:
- -- sufficiently distinct test sets that consist of
- words of two or more syllables
- -- good recording conditions
- Sets that work fine with me are 4-6 element sets
- consisting of Italian numbers or the aviation alphabet.
-
-
- The program is based on the paper "Untersuchungen zur Verteilung von
- Nulldurchgangsabstaenden in Sprachsignalen" (a study on the distribution
- of zero crossing distances in speech signals) by Michael Kirstein, published
- in IKP-Forschungsberichte II/62, Hamburg 1977.
-
- Under the next two headings, I try to summarize the paper, only of course
- as far as I have understood it and think it relevant for the program.
-
-
- I. Related works
-
- A couple of works presented since the 1950s give reason to assume that
- zero crossings of a speech signal contain sufficient information to
- allow the discrimination of phonemes or at least words:
-
- -- Licklider & Pollack (1948) show that clipped speech remains under-
- standable. In SBRECOG the amplitudes of the individual samples are
- clipped at the value of |1|, i.e. the signal is reduced to 1 Bit
- -- Chang, Pihl & Essigmann (1951) examine how the densities of zero
- crossings and extrema (rho0 and rho0') are related to the first and
- second formant in voiced sounds
- -- Peterson (1951) shows that their values in the spectrum of vowels are
- proportional to rho0, rho0'.
- -- Chang, Pihl & Wiren (1952) introduce the "intervalgram", a graphical
- representation of intervals between zero crossings
- -- Kirstein (1971) talks about "Kumulanten" ("cumulants"), characteristic
- concentrations of intervals (horizontal lines in the intervalgram)
-
- Kirstein also quotes the rather pessimistic Burghard & Hess (1971) who
- come to the result that zero crossing interval distributions did not
- allow discrimination of vowels.
-
-
- II. Problem and method
-
- Windows with common sizes such as 10 or 20 ms are too narrow to give a
- stable "view" on a speech signal; the distributions found are not
- significant. That is why whole word utterances are chosen as the subject
- of study.
-
- -- The signal s(t) is clipped to a square signal _s_(t)=c*sgn(s(t))
- -- the zero crossing intervals are collected
- -- their distribution is examined, i.e. it is counted how many intervals
- have the size i, how many the size i*2 and so on
-
- Kirstein makes his PDP 15 micro examine the signal in real time; to reduce
- the necessary computations he watches the positive part of the signal only.
- Thus he reaches a sample frequency of 32 kHz. He admits that the speech
- signal is "not at all symmetrical to the zero line", but thinks the
- results are usable anyway.
-
- The smallest interval that can be measured (at the resulting time resolution)
- is 31.6 mu-s; the biggest that gets counted is 6.3 ms. Thus there are
- 200 possible intervals, stretching over a frequency range of 79..15,823 Hz.
- These 200 intervals are classified into 16 classes; the idea is that
- one class stretches over the bandwidth of about one formant.
-
- Here is how Kirstein assigned intervals to the 16 classes:
- class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
- interv 1 2 3 4 5 6 7 8 9-10 11-12 13-15 16-19 20-25 26-38 39-78 79-200
- mu-s 31 158 284 347 410 505 632 821 1232 2496
-
- The signal durations turned out to be varying significantly between speakers
- and even between different productions of one speaker. As the zero crossings
- vary of course with the length of a signal, relative frequencies must be
- calculated.
-
- What we have at this point is a 16-dimensional vector representing each
- word. Kirstein examines a number of statistical methods that compare
- the similarity of two vectors. The one yielding the best results in his
- study is a contingency matrix. The method is similar to that employed
- by information theorists to calculate the "information transmission rate"
- or "Transinformation" (Meyer-Eppler, 1969).
-
- The formula combines input entropy + output entropy - overall entropy
- to calculate a measure for the transmitted information
- T = Sum_i=1..r ( Sum_j=1..c (p_ij * log_2 (p_ij/(p_i.*p_.j))),
- where c, r are columns, rows of the matrix (c: dimension of the vectors,
- r: number of vectors to be compared); p_ij are matrix cells, p_i. are
- row sums, p_.j column sums.
-
- Kirstein decides to smooth out the vectors (by averaging each element
- with its weighted nearest neighbours). This turned out be desastrous
- in my implementation, so I left out the smoothing.
-
-
- III. About my implementation
-
- My main interest was voice recognition in the telephone network, thus
- I had to make do with a smaller bandwidth and a sampling rate of around
- 11 kHz. It is easy to see why the number of possible interval sizes is
- reduced to 64 instead of Kirstein's 200 (see the related comments in
- the code). Although their classification, that must eventually yield to
- the 16-dimensional vector, is quite crucial for the performance of this
- method, I must admit I did it quite ad hoc: I printed out a couple of
- matrices and decided that they looked characteristical enough...
-
- The performance of my program of course changes considerably with
- different CPU speeds, as the sampling frequency is not constant on
- different machines. If you do not achieve satisfactory results, try
- changing the #define value of CPUSPEED to the tact rate of your machine,
- or lower. I didn't test the program on machines other that 286s,
- so given the quite different CPU designs it may be possible that you
- have to set CPUSPEED to a value that doesn't match that of your computer
- at all... The playback rate (that you observe during the training of
- words) is no clue here, as of course the recognition depends only on
- the recording speed. Just fiddle around with these things a bit.
-
- The "user interface" of the program is so primitive that you will master
- it without my explaining it here. Just note that there are basically
- two ways of improving the recognition of a test set:
- You can have multiple dictionary entries for different realisations of
- one word. You may want to attach different ID strings to the dictionary
- entries (like "bravo_1", "bravo_2", "bravo_fast", "bravo_slow"...), so
- that you can see how often each of the entries is picked by the program.--
- Or you can have the parameter vectors in the dictionary calculated as
- the average of two or more (the program supports two only) realisations.
- This is what the program means by asking "Would you like another test set
- to be averaged with the set entered".
-
- The sound blaster interface "direct.obj" was written by Joel Lucsy of
- Vroom Diggy Diggy Software and is part of a Freeware package, "Blast".
- I am including only the Blast files necessary to compile my demo. If
- you want to use the package for your own programs I suggest you let archie
- search your favourite ftp servers for it.
-
- Why am I publishing this demo program? I would like to see people
- starting further experiments inspired by the ideas presented here.
- The material is free to use and share. I hope you may feel somewhat
- obliged to make your enhancements and applications free software, too.
-
- If you have any further questions or comments, you can contact me
- by electronic mail at
- kiehl@ldv01.uni-trier.de
- or by conventional mail
- until 06-31-1993 from 07-01-1993
- Johannes Kiehl Johannes Kiehl
- Postfach 2441 Postfach 2441
- D - W 5500 Trier D - 54214 Trier
-