TechTalks from event: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

Source Separation and Localization

  • Binaural Detection of Speech Sources in Complex Acoustic Scenes Authors: Tobias May, Steven van de Par, University of Oldenburg, and Armin Kohlrausch, Eindhoven University of Technology and Philips Research
    In this paper we present a novel system that is able to simultaneously localize and detect a predefined number of speech sources in complex acoustic scenes based on binaural signals. The system operates in two steps: First, the acoustic scene is analyzed by a binaural front-end that detects relevant sound source activity. Second, a speech detection module selects source positions from a set of candidate positions that are most likely speech. The proposed method is evaluated in simulated multi-source scenarios consisting of two speech sources, three interfering noise sources and reverberation.
  • Supervised Source Localization using Diffusion Kernels Authors: Ronen Talmon, Israel Cohen, Technion - Israel Institute of Technology, and Sharon Gannot, Bar-Ilan University
    Recently, we introduced a method to recover the controlling parameters of linear systems using diffusion kernels. In this paper, we apply our approach to the problem of source localization in a reverberant room using measurements from a single microphone. Prior recordings of signals from various known locations in the room are required for training and calibration. The proposed algorithm relies on a computation of a diffusion kernel with a specially-tailored distance measure. Experimental results in a real reverberant environment demonstrate accurate recovery of the source location.
  • Optimal 3-D HOA Encoding with Applications in Improving Close-Spaced Source Localization Authors: Haohai Sun and U. Peter Svensson, Norwegian University of Science and Technology
    In this paper, a three-dimensional optimal higher-order Ambisonics encoding (3-D HOA) method, which offers the possibility to impose spatial stop-bands in the directivity patterns of all the spherical harmonics while keeping the transformed audio channels still compatible with the 3-D HOA reproduction sound format, is introduced. This might be useful as a post-processing technique for suppressing interfering signals from specific directions in a 3-D HOA recording. The method is adapted from recent work on the optimization of spherical microphone array beamforming. The new spherical harmonics decomposition approach also exhibits its advantage in improving the resolution of audio source localization. Numerical simulations and experimental results are used to evaluate the proposed method.
  • Gaussian modeling of mixtures of non-stationary signals in the time-frequency domain (HR-NMF) Authors: Roland Badeau, Telecom ParisTech
    Nonnegative Matrix Factorization (NMF) is a powerful tool for decomposing mixtures of non-stationary signals in the Time-Frequency (TF) domain. However, unlike the High Resolution (HR) methods dedicated to mixtures of exponentials, its spectral resolution is limited by that of the underlying TF representation. In this paper, we propose a unified probabilistic model called HR-NMF, that permits to overcome this limit by taking both phases and local correlations in each frequency band into account. This model is estimated with a recursive implementation of the EM algorithm, that is successfully applied to source separation and audio inpainting.
  • Informed source separation: source coding meets source separation Authors: Alexey Ozerov, INRIA, Centre de Rennes - Bretagne Atlantique, Antoine Liutkus, Roland Badeau, and Ga¨el Richard, Telecom ParisTech
    We consider the informed source separation (ISS) problem where, given the sources and the mixtures, any kind of sideinformation can be computed during a so-called encoding stage. This side-information is then used to assist source separation, given the mixtures only, at the so-called decoding stage. State of the art ISS approaches do not really consider ISS as a coding problem and rely on some purely source separation-inspired strategies, leading to performances that can at best reach those of oracle estimators. On the other hand, classical source coding strategies are not optimal either, since they do not benefit from the mixture availability. We introduce a general probabilistic framework called coding-based ISS (CISS) that consists in quantizing the sources using some posterior source distribution from those usually used in probabilistic model-based source separation. CISS benefits from both source coding, thanks to the source quantization, and source separation, thanks to the use of the posterior distribution that depends on the mixture. Our experiments show that CISS based on a particular model considerably outperforms for all rates both the conventional ISS approach and the source coding approach based on the same model.
  • On the Disjointess of Sources in Music using Different Time-Frequency Representations Authors: Dimitrios Giannoulis, Daniele Barchiesi, Anssi Klapuri, and Mark Plumbley, Queen Mary University of London
    This paper studies the disjointness of the time-frequency representations of simultaneously playing musical instruments. As a measure of disjointness, we use the approximate Wdisjoint orthogonality as proposed by Yilmaz and Rickard [1], which (loosely speaking) measures the degree of overlap of different sources in the time-frequency domain. The motivation for this study is to find a maximally disjoint representation in order to facilitate the separation and recognition of musical instruments in mixture signals. The transforms investigated in this paper include the short-time Fourier transform (STFT), constant-Q transform, modified discrete cosine transform (MDCT), and pitch-synchronous lapped orthogonal transforms. Simulation results are reported for a database of polyphonic music where the multitrack data (instrument signals before mixing) were available. Absolute performance varies depending on the instrument source in question, but on the average MDCT with 93 ms frame size performed best.