TechTalks from event: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

Music Signal Analysis

  • Large-Scale Cover Song Recognition using Hashed Chroma Landmarks Authors: Thierry Bertin-Mahieux and Daniel Ellis, Columbia University
    Cover song recognition, also known as version identification, can only be solved by exposing the underlying tonal content of music. Apart from obvious applications in copyright enforcement, techniques for cover identification can also be used to find patterns and structure in music datasets too large for any musicologist to listen to even once. Much progress has been made on cover song recognition, but work to date has been reported on datasets of at most a few thousand songs, using algorithms that simply do not scale beyond the capacity of a small portable music player. In this paper, we consider the problem of finding covers in a database of a million songs, considering only algorithms that can deal with such data. Using a fingerprinting-inspired model, we present the first results of cover song recognition on the Million Song Dataset. The availability of industrial-scale datasets to the research community presents a new frontier for version identification, and this work is intended to be the first step toward a practical solution.

Music Signal Analysis

  • Optimizing the Mapping from a Symbolic to an Audio Representation for Music-to-Score Alignment Authors: Cyril Joder, Slim Essid, and Ga¨el Richard, Telecom Paris- Tech
    A key processing step in music-to-score alignment systems is the estimation of the intantaneous match between an audio observation and the score. We here propose a general formulation of this matching measure, using a linear transformation from the symbolic domain to any time-frequency representation of the audio. We investigate the learning of the mapping for several common audio representations, based on a best-fit criterion. We evaluate the effectiveness of our mapping approach with two different alignment systems, on a large database of popular and classical polyphonic music. The results show that the learning procedure significantly improves the precision of the alignments obtained, compared to common heuristic templates used in the literature.
  • Polyphonic Pitch Tracking by Example Authors: Paris Smaragdis, University of Illinois / Adobe Systems
    We introduce a novel approach for pitch tracking of multiple sources in mixture signals. Unlike traditional approaches to pitch tracking, which explicitly attempt to detect periodicities, this approach is using a learning framework by making use of previously pitch-tagged recordings as training data to teach spectrum/pitch associations. We show how the mixture case of this task is a nearest subspace search problem which is efficiently solved by transforming it to an overcomplete sparse coding formulation. We demonstrate the use of this algorithm with real mixtures ranging from solo up to a quintet recordings.
  • Scale-Invariant Probabilistic Latent Component Analysis Authors: Romain Hennequin, Roland Badeau, and Bertrand David, Telecom ParisTech
    In this paper, we present a new method for decomposing musical spectrograms. This method is similar to shift-invariant Probabilistic Latent Component Analysis, but, when the latter works with constant Q spectrograms (i.e. with a logarithmic frequency resolution), our technique is designed to decompose standard short time Fourier transform spectrograms (i.e. with a linear frequency resolution). This makes it possible to easily reconstruct the latent signals (which can be useful for source separation).
  • A Temporally-constrained Convolutive Probabilistic Model for Pitch Detection Authors: Emmanouil Benetos and Simon Dixon, Queen Mary University of London
    A method for pitch detection which models the temporal evolution of musical sounds is presented in this paper. The proposed model is based on shift-invariant probabilistic latent component analysis, constrained by a hidden Markov model. The time-frequency representation of a produced musical note can be expressed by the model as a temporal sequence of spectral templates which can also be shifted over log-frequency. Thus, this approach can be effectively used for pitch detection in music signals that contain amplitude and frequency modulations. Experiments were performed using extracted sequences of spectral templates on monophonic music excerpts, where the proposed model outperforms a non-temporally constrained convolutive model for pitch detection. Finally, future directions are given for multipitch extensions of the proposed model.
  • Probabilistic Latent Tensor Factorization Framework for Audio Modeling Authors: Ali Taylan Cemgil, Umut Simsekli, and Yusuf Cem Subakan, Bogazici University
    This paper introduces probabilistic latent tensor factorization (PLTF) as a general framework for hierarchical modeling of audio. This framework combines practical aspects of graphical modeling of machine learning with tensor factorization models. Once a model is constructed in the PLTF framework, the estimation algorithm is immediately available. We illustrate our approach using several popular models such as NMF or NMF2D and provide extensions with simulation results on real data for key audio processing tasks such as restoration and source separation.

Microphone Arrays

  • Diffuseness Estimation With High Temporal Resolution Via Spatial Coherence Between Virtual First-Order Microphones Authors: Oliver Thiergart, International Audio Laboratories Erlangen, Giovanni Del Galdo, Fraunhofer Institute for Integrated Circuits IIS, and Emanu¨el Habets, International Audio Laboratories Erlangen
    The diffuseness of sound can be estimated with practical microphone setups by considering the spatial coherence between two microphone signals. In applications where small arrays of omnidirectional microphones are preferred, the diffuseness estimation is impaired by a high signal coherence in diffuse fields at lower frequencies, which is particularly problematic when carrying out the estimation with high temporal resolution. Therefore, we propose to exploit the spatial coherence between two virtual first-order microphones derived from the omnidirectional array. This represents a flexible method to accurately estimate the diffuseness in high- SNR regions at lower frequencies with high temporal resolution.
  • Spatial Soundfield Recording Over a Large Area using Distributed Higher Order Microphones Authors: Prasanga Samarasinghe, Thushara Abhayapala, The Australian National University, and Mark Poletti, Industrial Research Limited, New Zealand
    Recording and reproduction of spatial sound fields over a large area is an unresolved problem in acoustic signal processing. This is due to the the inherent restriction in recording higher order harmonic components using practically realizable microphone arrays. As the frequency increases and as the region of interest becomes large the number of microphones needed in effective recording increases beyond practicality. In this paper, we show how to use higher order microphones, distributed in a large area, to record and accurately reconstruct spatial sound fields. We use sound field coefficient translation between origins to combine distributed field recording to a single sound field over the entire region. We use simulation examples in (i) interior and (ii) exterior fields to corroborate our design.
  • Compressed sensing for acoustic response reconstruction: interpolation of the early part Authors: R´emi Mignot, Laurent Daudet, ESPCI ParisTech, and Franc¸ois Ollivier, Institut Jean Le Rond d'Alembert, UPMC
    The goal of this paper is to interpolate Room Impulse Responses (RIRs) within a whole volume, from a few measurements. We here focus on the early reflections, that have the key property of being sparse in the time domain: this can be exploited in a framework of model-based Compressed Sensing. Starting from a set of RIRs randomly sampled in space by a 3D microphone array, we use a modified Matching Pursuit algorithm to estimate the position of a small set of virtual sources. Then, the reconstruction of the RIRs at interpolated positions is performed using a projection onto a basis of monopoles. This approach is validated both by numerical and experimental measurements using a 120-microphone 3D array.
  • Block-wise Incremental Adaptation Algorithm for Maximum Kurtosis Beamforming Authors: Kenichi Kumatani, Disney Research, John McDonough, and Bhiksha Raj, Carnegie Mellon University
    In prior work, the current authors investigated beamforming algorithms that exploit the non-Gaussianity of human speech. The beamformers we proposed were designed to maximize the kurtosis or negentropy of the subband output subject to the distortionless constraint for the direction of interest. Such techniques are able to suppress interference signals as well as reverberation effects without signal cancellation. However, multiple passes of processing were required for each utterance in order to estimate the active weight vector. Hence, they were unsuitable for online implementation. In this work, we propose an online implementation of the maximum kurtosis beamformer. In a set of distant speech recognition experiments on far-field data, we demonstrate the effectiveness of the proposed technique. Compared to a single channel of the array, the proposed algorithm reduced word error rate from 15.4% to 6.5%.
  • Decorrelation for Adaptive Beamforming Applied to Arbitrarily Sampled Spherical Arrays Authors: Ines Hafizovic, University of Oslo and Squarehead Technology AS, Carl-Inge Colombo Nilsen, and Sverre Holm, University of Oslo
    Correlated signals lead to signal cancellation in adaptive beamformers applied to microphone arrays. This is commonly counteracted by spatial smoothing. Unfortunately, spatial smoothing can only be used with array geometries consisting of identical, shifted subarrays, making it unsuitable for spherical arrays in general. We suggest a transformation that makes spatial smoothing applicable to any wellsampled spherical array, and we show results for the case of Minimum Variance Distortionless Response (MVDR) beamforming.
  • Robust Beamforming and Steering of Arbitrary Beam Patterns using Spherical Arrays Authors: Joshua Atkins, Johns Hopkins University
    Spherical microphone and loudspeaker arrays present a compact method for analysis and synthesis of arbitrary threedimensional sound fields. Issues such as sensor self noise, sensor placement errors, and mismatch require robustness constraints in beamformer design. We present a method for designing robust beam-patterns with an arbitrary shape and an efficient method for steering the resulting patterns in three dimensions. This technique is used for two applications: synthesizing spherical microphone array recordings over loudspeaker arrays and binaurally over headphones with head tracking.