### Speech Enhancement Tutorial - Enhancement Methods

### 3.2 Frame Based Processing

Enhancement methods that operate in the frequency domain normally require the sampled input signal x(n) to be decomposed into overlapping frames with the *i *th frame given by

for n = 0;...;N - 1 where w(n) is a windowing function with finite support and M - N is the time increment between successive frames (in samples). The window length, N, is a compromise between frequency and time resolution and is typically chosen in the range 10_30 ms for speech signals resulting in a frequency resolution of around 50 Hz. The frame increment, M, is most commonly set at N=2 although, as noted below, there are theoretical reasons for using M = N=4 despite its higher computational cost.

A Fourier transform is normally performed on each frame to obtain the Short Time Fourier Transform (STFT). If no processing is done on the frame spectra, the original time domain signal can be reconstructed exactly with overlap-addition. However, when frequency-domain processing is performed on the frames, distortion artifacts may be introduced due to signal discontinuities at frame boundaries and aliasing of rapidly changing spectral coefficients. The reconstruction properties can be controlled by the choice of the windowing function and the ration M : N.

Martin and Cox [1999] suggest the use of half-overlapping (M = N=2) square-root Hanning windows for both analysis and synthesis in order to provide perfect reconstruction in the absence of any processing and, at the same time, to attenuate any frame-boundary discontinuities. An extensive discussion of these issues is given in Allen [1977] and Allen and Rabiner [1977] where it is shown that, for a Hamming analysis window, a three-quarters overlap (M = N=4) is needed in order to ensure that the spectral coefficients are sampled frequently enough to avoid aliasing.

Previous | Next