TechTalks from event: ICML 2012 Workshop on Representation Learning

  • Representation Learning Authors: Prof. Yoshua Bengio, Department of Computer Science and Operations Research Canada Research Chair in Statistical Learning Algorithms
  • Learning a selectivity-invariance-selectivity feature extraction architecture for images Authors: Michael U. Gutmann, Aapo Hyvarinen
    Selectivity and invariance are thought to be important ingredients in biological or artifi- cial visual systems. A fundamental problem is, however, to know what the visual system should be selective to and what to be invari- ant to. Building a statistical model of images, we learn here a three-layer feature extraction system where the selectivity and invariance emerges from the properties of the images.
  • Deconvolutional Networks Authors: Rob Fergus
    We present a hierarchical model that learns image decompositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a variety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hierarchical approaches. This makes it possible to learn multiple layers of representation and we show models with 4 layers, trained on images from the Caltech-101 and 256 datasets. Features extracted from these models, in combination with a standard classi?er, outperform SIFT and representations from other feature learning approaches. Joint work with Matt Zeiler (NYU) and Graham Taylor (NYU).
  • Challenges and Progress in Learning Representations Authors: Ryan Adams
    Representation is a central issue in machine learning. Indeed, with good representations even simple algorithms can work, but even the most sophisticated machine learning algorithms will find it difficult to make up for a poor representation. Ideally, we would like the data representation itself to be the subject of learning, to avoid these difficulties. I will discuss some the challenges in learning useful representations and will give an overview of some of my work in discovering representations for image and text data, using methods from connectioninst deep learning to Bayesian nonparametrics.
  • Scaling Deep Learning Authors: Jeff Dean
    We have recently started investigating how to scale deep learning techniques to much larger models in an effort to improve the accuracy of such models in the domains of computer vision, speech recognition, and natural language processing. Our largest models to date have more than 1 billion parameters, and we utilize both supervised and unsupervised training in our work. In order to train models of this scale, we utilize clusters of thousands of machines, and exploit both model parallelism (by distributing computation within a single replica of the model across multiple cores and multiple machines) and data parallelism (by distributing computation across many replicas of these distributed models). In this talk I'll describe the progress we've made on building training systems for models of this scale, and also highlight a few results for using these models for tasks that are important to improving Google's products. This talk describes joint work with Kai Chen, Greg Corrado, Matthieu Devin, Quoc Le, Rajat Monga, Andrew Ng, MarcAurelio Ranzato, Paul Tucker, and Ke Yang.
  • Semi-supervised Deep Inference Machines for Multi-modal Perception Authors: Drew Bagnell
    In recent years, the popularity of new sensing modalities like ladar and depth cameras have provided a rich new source of data for computer perception to explore. Understanding such data in context can be perhaps best understood as a problem of structured prediction. The traditional approach to structured prediction problems is to craft a graphical model structure, learn parameters for the model, and perform inference using an efficient and usually approximate inference approach, including, e.g., graph cut methods, belief propagation, and variational methods. Unfortunately, while remarkably powerful methods for inference have been developed and substantial theoretical insight has been achieved especially for simple potentials, the combination of learning and approximate inference for graphical models is still poorly understood and limited in practice. Within computer vision, for instance, there is a common belief that more sophisticated representations and energy functions are necessary to achieve high performance which are difficult for theoretically sound inference/learning procedures. An alternate view is to consider approximate inference as procedure: we can view an iterative procedure like belief propagation on a random field as a network of computational modules taking observations, other local computations on a graph (messages), and providing intermediate output messages and final output classifications over nodes in the random field. This approach has shown significant promise for the resulting quality of predictions at computer vision tasks, speed of inference and training, and theoretical understanding. The resulting network of predictive modules is often a tremendously deep (up to ~10^6 computational modules) one taking perceptual features to semantic predictions. We demonstrate that multi-modal data provides both new challenges and new advantages that are well addressed by inference machines. In particular, we show a particular structure that is appropriate to inference in such data. Further, we demonstrate that multi-modality enables very efficient use of unlabeled data to learn representations through co-regulariziation which encourages predictions from each modality to agree wherever they overlap. We relate the resulting approaches to previous techniques include CCA and graphical model approaches. Finally, we demonstrate performance on difficult problems in multi-modal scene understanding. This is joint work with Daniel Munoz and Martial Hebert.
  • Sequence Transduction with Recurrent Neural Networks Authors: Alex Graves
    Many machine learning tasks can be ex- pressed as the transformationor transduc- tion of input sequences into output se- quences: speech recognition, machine trans- lation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating. Recurrent neu- ral networks (RNNs) are a powerful sequence learning architecture that has proven capa- ble of learning such representations. How- ever RNNs traditionally require a pre-defined alignment between the input and output se- quences to perform transduction. This is a severe limitation since finding the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even deter- mining the length of the output sequence is often challenging. This paper introduces an end-to-end, probabilistic sequence transduc- tion system, based entirely on RNNs, that re- turns a distribution over output sequences of all possible lengths and alignments for any in- put sequence. Experimental results are pro- vided on the TIMIT speech corpus.
  • Some thoughts on learning representations for text, images, music and knowledge Authors: Jason Weston
    In this talk I will discuss the pros and cons of various methods for learning feature representations for audio (e.g, music recommendation), text (e.g., retrieval, and syntactic and semantic tagging ), images (e.g., ranking and annotation) and knowledge (e.g. using the Wordnet graph to help in the tasks above). Particular emphasis is put on methods that scale well on large data and have fast serving times so that they can be used in production. In particular I have worked on a number of supervised feature embedding algorithms that work well on these tasks which I will describe, as well as areas where I think these methods can be improved.
  • Exploiting compositionality to explore a large space of model structures Authors: Roger Grosse
    The recent proliferation of highly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset. We focus on a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning. To enable model selection, we organize these models into a context-free grammar which generates a wide variety of structures through the compositional application of a few simple rules. We use our grammar to generically and efficiently infer latent components and estimate predictive likelihood for nearly 2500 structures using a small toolbox of reusable algorithms. Using a greedy search over our grammar, we automatically choose the decomposition structure from raw data by evaluating only a tiny fraction of all models. The proposed method typically finds the correct structure for synthetic data and backs off gracefully to simpler models under heavy noise. It learns plausible structures for datasets as diverse as image patches, motion capture, 20 Questions, and U.S. Senate votes, all using exactly the same code.
  • On Linear Embeddings and Unsupervised Feature Learning Authors: Ryan Kiros
    The ability to train deep architectures has led to many developments in parametric, non-linear dimensionality reduction but with little attention given to algorithms based on convolutional feature extraction without backpropagation training. This paper aims to fill this gap in the context of supervised Mahalanobis metric learning. Modifying two existing approaches to model latent space similarities with a Students t-distribution, we obtain competitive classification perfor- mance on CIFAR-10 and STL-10 with k-NN in a 50-dimensional space compared with a linear SVM with significantly more features. Using simple modifications to existing feature extraction pipelines, we obtain an error of 0.40% on MNIST, the best reported result without appending distortions for training.