Centre for

Speech Enhancement Tutorial - Enhancement Methods

3.14 Speech Model Enhancement

An early technique in the class of speech enhancement de-reverberation, that has not been of recent interest, was described in Oppenheim et al. [1968] and Oppenheim and Schafer [1975]. The authors observe that simple echoes result in distinct peaks in the cepstrum of the speech signal. Consequently, they use a peak picking algorithm to identify such peaks and attenuate them. They also consider an alternative approach in which a low-pass weighting function is applied to the cepstrum on the assumption that most of the speech energy lies in the lower quefrencies. However, this approach was not found suitable for more complex reverberation models.

A class of techniques emerged from the observation that the residual signal following linear prediction analysis contains peaks corresponding to the excitation events in voiced speech together with additional peaks due to the reverberant channel. Several methods for processing the LPC residual have subsequently been developed. These aim to suppress the additional peaks due to reverberation without degrading the original components of the residual in order that de-reverberated speech can be synthesised using the processed residual and the all-pole filter resulting from prediction analysis on the reverberant speech.

It is assumed that the effect of reverberation on the AR coefficients is negligible Gaubitch et al. [2006]. Yegnanarayana and Satyanarayana [2000] provided a comprehensive study of the prediction residual of reverberant speech. They demonstrate that reverberation affects the prediction residual differently in different speech segments, depending on the energy in the signal and whether a segment is voiced or unvoiced. Motivated by these observations, the authors proposed using a regional weighting function based on the signal-to-reverberant ratio in each region together with a global weighting function derived from the short term signal energy. For the derivation of the SRR based weightings, the entropy function and the normalised error were used.

An adaptive algorithm was proposed in Gillespie et al. [2001] using a kurtosis maximising sub-band adaptive filter. The authors demonstrate that the kurtosis of the prediction residual decreases as a function of increased reverberation, which was also suggested in Yegnanarayana and Satyanarayana [2000]. They use this observation to derive an adaptive filter that maximises the kurtosis of the prediction residual.

The filter is applied directly to the observed signal rather than to the prediction residual and thus avoids the need for an LPC synthesis stage. The adaptive filter is implemented in a multi-channel sub-band framework for increased efficiency. This approach is an example of a hybrid method between speech enhancement and blind system identification and inversion. An extension to this method was presented in Wu and Wang [2006], where it was combined with spectral subtraction to remove residual reverberation due to late reflections.

In Gaubitch and Naylor [2007] a spatio-temporal averaging method is described. The peaks in the LPC residual due to glottal closure instances are identified and enhanced glottal cycles are obtained by temporal averaging of neighbouring cycles. Each processed glottal cycle is used to update an equalisation filter that is applied during unvoiced speech regions. The LPC based methods described above all achieve moderate reverberation reduction when a single microphone is used.

A related method, proposed by Nakatani et al. [2005], assumes a sinusoidal speech model. First the fundamental frequency of the speech signal is identified from the reverberant observations, then follows the identification of the remaining sinusoidal components. Using the magnitudes and phases of these sinusoids, an enhanced speech signal is synthesised. Subsequently, the reverberant and the de-reverberated speech signals are used to derive an equivalent equalisation filter. The processing is performed in short frames and the inverse filter is updated in each frame. It is shown that this inverse filter does tend to the RIR equalisation filter, however, it is very long and takes over an hour of training.

Previous | Next