Auditory Signal Processing in Hardware

BIB
Brucke, Matthias and Schulz, Arne and Nebel, Wolfgang
A digital hardware implementation of an algorithm modeling the effective signal processing of the human auditory system is presented in this paper. The Medical Physics Group at the university of Oldenburg has been working in the field of psychoacoustical modeling, speech perception and processing, audiological diagnostics and digital hearing aids for several years. One particular aspect of the work done is the development of a psychoacoustical preprocessing model and the demonstration of its applicability as a preprocessing algorithm for speech, for example in automatic speech recognition, objective speech quality measurement, noise reduction and digital hearing aids. The model describes the effective signal processing in the human auditory system and provides the appropriate internal representation of acoustic signals. It combines several stages of processing, simulating spectral properties of the human ear (spectral masking, frequency-dependent bandwidth of auditory filters) as well as dynamical effects (nonlinear compression of stationary or dynamic signals and temporal masking). A gammatone filter bank represents the first processing stage and models the frequency-place transformation on the basilar membrane. It is build up from 30 bandpass filters with center frequencies from 73Hz to 6.7kHz equidistant on the ERB scale. The bandwidth of the filters grows with increasing frequency. The output of each channel of the gammatone filter bank is halfwave rectified and lowpass filtered at 1kHz to preserve the envelope of the signal for high carrier frequencies, because of the limited phase locking of auditory nerve fibers at higher frequencies. The output of this lowpass filter is then fed to the adaptation loops to consider the dynamical effects as nonlinear adaptive compression and temporal masking. Due to the complexity of the model, it was partitioned and will be implemented on two chips: the binaural (stereo) gammatone filter bank, the halfwave rectification and the lowpass filter on chip1 and the monaural (mono) adaptation loops and another lowpass filter on chip2. The VLSI group at the computer science department of the university of Oldenburg is currently working on the implementation of chip1. Chip2 is being implemented by the IMA group at the computer science department of the university of Hamburg. A target environment will consist of three chips: One chip1 working binaurally and two chip2 working monaurally. A direct hardware implementation of the floating point software version of the model is not possible due to limitations of area and power consumption. The main problem when converting floating point arithmetic to fixed point arithmetic is the determination of the necessary numerical precision which implies the wordlength of internal number representation. The necessary internal wordlength for the gammatone filter bank can be assessed in a straight-forward way because the filters are linear time invariant systems, where classical numerical parameters like SNR can be applied. The choice of a certain maximal square error leads directly to the necessary wordlength. For the realization of the adaptation stage this procedure is not applicable because the system is nonlinear and has a large possible dynamical range. One application of the model (objective speech quality measurement) was used to determine the necessary internal precision. By observing the degradiation of the performance with decreasing internal precision it shows that necessary internal wordlength can be reduced while the performance almost stays the same. To validate the approach a prototype of the design for chip1 was implemented on a Xilinx FPGA XC4062XL-2 including the input interface, the gammatone filter bank, halfwave rectification, lowpass filter and the output interface. The design runs on an evaluation board containing the FPGA, a standard PCI interface and 2MB of SRAM. Real time sampling of the stereo input data is done by a sound card, the data is then transfered to the SRAM by DMA burst over the PCI bus. The output data of the filterbank -running on the FPGA- is stored in the SRAM again and transfered to the host CPU for further processing. More sophisticated speech processing tasks will become increasingly possible because the FPGA is executing a big part of the computation.
01 / 1999
inproceedings
FPL 1999, Field-Programmable Logic and Applications: Proceedings of the 9th Interantional Workshop