TechTalks from event: ICML 2011

Neural Networks and NLP

  • Parsing Natural Scenes and Natural Language with Recursive Neural Networks Authors: Richard Socher; Cliff Chiung-Yu Lin; Andrew Ng; Chris Manning
    Recursive structure is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. Discovering this recursive structure helps us to not only identify the units that an image or sentence contains but also how they interact to form a whole. We introduce a max-margin structure prediction architecture based on recursive neural networks that can successfully recover such structure both in complex scene images as well as sentences. The same algorithm can be used both to provide a competitive syntactic parser for natural language sentences from the Penn Treebank and to outperform alternative approaches for semantic scene segmentation, annotation and classification. For segmentation and annotation our algorithm obtains a new level of state-of-the-art performance on the Stanford background dataset (78.1%). The features from the image parse tree outperform Gist descriptors for scene classification by 4%.
  • Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach Authors: Xavier Glorot; Antoine Bordes; Yoshua Bengio
    The exponential increase in the availability of online reviews and recommendations makes sentiment classification an interesting topic in academic and industrial research. Reviews can span so many different domains that it is difficult to gather annotated training data for all of them. Hence, this paper studies the problem of domain adaptation for sentiment classifiers, hereby a system is trained on labeled reviews from one source domain but is meant to be deployed on another. We propose a deep learning approach which learns to extract a meaningful representation for each review in an unsupervised fashion. Sentiment classifiers trained with this high-level feature representation clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products. Furthermore, this method scales well and allowed us to successfully perform domain adaptation on a larger industrial-strength dataset of 22 domains.
  • Large-Scale Learning of Embeddings with Reconstruction Sampling Authors: Yann Dauphin; Xavier Glorot; Yoshua Bengio
    In this paper, we present a novel method to speed up the learning of embeddings for large-scale learning tasks involving very sparse data, as is typically the case for Natural Language Processing tasks. Our speed-up method has been developed in the context of Denoising Auto-encoders, which are trained in a purely unsupervised way to capture the input distribution, and learn embeddings for words and text similar to earlier neural language models. The main contribution is a new method to approximate reconstruction error by a sampling procedure. We show how this approximation can be made to obtain an unbiased estimator of the training criterion, and we show how it can be leveraged to make learning much more computationally efficient. We demonstrate the effectiveness of this method on the Amazon and RCV1 NLP datasets. Instead of reducing vocabulary size to make learning practical, our method allows us to train using very large vocabularies. In particular, reconstruction sampling requires 22x less training time on the full Amazon dataset.
  • Generating Text with Recurrent Neural Networks Authors: Ilya Sutskever; James Martens; Geoffrey Hinton
    Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems. In this paper we demonstrate the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or ``gated'') connections which allow the current input character to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, we were able to surpass the performance of the best previous single method for character-level language modeling -- a hierarchical non-parametric sequence model. To our knowledge this represents the largest recurrent neural network application to date.
  • Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Authors: Salah Rifai; Pascal Vincent; Xavier Muller; Xavier Glorot; Yoshua Bengio
    We present in this paper a novel approach for training deterministic auto-encoders. We show that by adding a well chosen penalty term to the classical reconstruction cost function, we can achieve results that equal or surpass those attained by other regularized auto-encoders as well as denoising auto-encoders on a range of datasets. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. We show that this penalty term results in a localized space contraction which in turn yields robust features on the activation layer. Furthermore, we show how this penalty term is related to both regularized auto-encoders and denoising auto-encoders and how it can be seen as a link between deterministic and non-deterministic auto-encoders. We find empirically that this penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold. Finally, we show that by using the learned features to initialize an MLP, we achieve state of the art classification error on a range of datasets, surpassing other methods of pre-training.

Probabilistic Models & MCMC

  • Probabilistic Matrix Addition Authors: Amrudin Agovic; Arindam Banerjee; Snigdhansu Chatterje
    We introduce Probabilistic Matrix Addition (PMA) for modeling real-valued data matrices by simultaneously capturing covariance structure among rows and among columns. PMA additively combines two latent matrices drawn from two Gaussian Processes respectively over rows and columns. The resulting joint distribution over the observed matrix does not factorize over entries, rows, or columns, and can thus capture intricate dependencies in the matrix. Exact inference in PMA is possible, but involves inversion of large matrices, and can be computationally prohibitive. Efficient approximate inference is possible due to the sparse dependency structure among latent variables. We propose two families of approximate inference algorithms for PMA based on Gibbs sampling and MAP inference. We demonstrate the effectiveness of PMA for missing value prediction and multi-label classification problems.
  • SampleRank: Training Factor Graphs with Atomic Gradients Authors: Michael Wick; Khashayar Rohanimanesh; Kedar Bellare; Aron Culotta; Andrew McCallum
    We present SampleRank, an alternative to contrastive divergence (CD) for estimating parameters in complex graphical models. SampleRank harnesses a user-provided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. SampleRank is not only faster than CD, but also achieves better accuracy in practice (up to 23\% error reduction on noun-phrase coreference).
  • A New Bayesian Rating System for Team Competitions Authors: Sergey Nikolenko; Alexander Sirotkin
    We present a novel probabilistic rating system for team competitions. Building upon TrueSkill(tm), we change the factor graph structure to cope with the problems of TrueSkill(tm), e.g., multiway ties and variable team size. We give detailed inference algorithms for the new structure. Experimental results show a significant improvement over TrueSkill(tm).
  • Bayesian Learning via Stochastic Gradient Langevin Dynamics Authors: Max Welling; Yee Whye Teh
    In this paper we propose a new framework for learning from large scale datasets based on iterative learning from small mini-batches. By adding the right amount of noise to a standard stochastic gradient optimization algorithm we show that the iterates will converge to samples from the true posterior distribution as we anneal the stepsize. This seamless transition between optimization and Bayesian posterior sampling provides an in-built protection against overfitting. We also propose a practical method for Monte Carlo estimates of posterior statistics which monitors a ``sampling threshold'' and collects samples after it has been surpassed. We apply the method to three models: a mixture of Gaussians, logistic regression and ICA with natural gradients.
  • ABC-EP: Expectation Propagation for Likelihood-free Bayesian Computation Authors: Simon Barthelmé; Nicolas Chopin
    Many statistical models of interest to the natural and social sciences have no tractable likelihood function. Until recently, Bayesian inference for such models was thought infeasible. Pritchard et al. (1999) introduced an algorithm known as ABC, for Approximate Bayesian Computation, that enables Bayesian computation in such models. Despite steady progress since this first breakthrough, such as the adaptation of MCMC and Sequential Monte Carlo techniques to likelihood-free inference, state-of-the art methods remain notoriously hard to use and require enormous computation times. Among other issues, one faces the difficult task of finding appropriate summary statistics for the model, and tuning the algorithm can be time-consuming when little prior information is available. We show that Expectation Propagation, a widely successful approximate inference technique, can be adapted to the likelihood-free context. The resulting algorithm does not require summary statistics, is an order of magnitude faster than existing techniques, and remains usable when prior information is vague.

Online Learning

  • Online AUC Maximization Authors: Peilin Zhao; Steven Hoi; Rong Jin; Tianbao Yang
    Most studies of online learning measure the performance of a learner by classification accuracy, which is inappropriate for applications where the data are unevenly distributed among different classes. We address this limitation by developing online learning algorithm for maximizing Area Under the ROC curve (AUC), a metric that is widely used for measuring the classification performance for imbalanced data distributions. The key challenge of online AUC maximization is that it needs to optimize the pairwise loss between two instances from different classes. This is in contrast to the classical setup of online learning where the overall loss is a sum of losses over individual training examples. We address this challenge by exploiting the reservoir sampling technique, and present two algorithms for online AUC maximization with theoretic performance guarantee. Extensive experimental studies confirm the effectiveness and the efficiency of the proposed algorithms for maximizing AUC.
  • Online Submodular Minimization for Combinatorial Structures Authors: Stefanie Jegelka; Jeff Bilmes
    Most results for online decision problems with structured concepts, such as trees or cuts, assume linear costs. In many settings, however, nonlinear costs are more realistic. Owing to their non-separability, these lead to much harder optimization problems. Going beyond linearity, we address online approximation algorithms for structured concepts that allow the cost to be submodular, i.e., nonseparable. In particular, we show regret bounds for three Hannan-consistent strategies that capture different settings. Our results also tighten a regret bound for unconstrained online submodular minimization.
  • Better Algorithms for Selective Sampling Authors: Francesco Orabona; Nicolò Cesa-Bianchi
    We study online algorithms for selective sampling that use regularized least squares (RLS) as base classifier. These algorithms typically perform well in practice, and some of them have formal guarantees on their mistake and query rates. We refine and extend these guarantees in various ways, proposing algorithmic variants that exhibit better empirical behavior while enjoying performance guarantees under much more general conditions. We also show a simple way of coupling a generic gradient-based classifier with a specific RLS-based selective sampler, obtaining hybrid algorithms with combined performance guarantees.
  • Learning Linear Functions with Quadratic and Linear Multiplicative Updates Authors: Tom Bylander
    We analyze variations of multiplicative updates for learning linear functions online. These can be described as substituting exponentiation in the Exponentiated Gradient (EG) algorithm with quadratic and linear functions. Both kinds of updates substitute exponentiation with simpler operations and reduce dependence on the parameter that specifies the sum of the weights during learning. In particular, the linear multiplicative update places no restrictions on the sum of the weights, and, under a wide range of conditions, achieves worst-case behavior close to the EG algorithm. We perform our analysis for square loss and absolute loss, and for regression and classification. We also describe some experiments showing that the performance of our algorithms are comparable to EG and the $p$-norm algorithm.
  • Optimal Distributed Online Prediction Authors: Ofer Dekel; Ran Gilad-Bachrach; Ohad Shamir; Lin Xiao
    Online prediction methods are typically studied as serial algorithms running on a single processor. In this paper, we present the distributed mini-batch (DMB) framework, a method of converting a serial gradient-based online algorithm into a distributed algorithm, and prove an asymptotically optimal regret bound for smooth convex loss functions and stochastic examples. Our analysis explicitly takes into account communication latencies between computing nodes in a network. We also present robust variants, which are resilient to failures and node heterogeneity in an synchronous distributed environment. Our method can also be used for distributed stochastic optimization, attaining an asymptotically linear speedup. Finally, we empirically demonstrate the merits of our approach on large-scale online prediction problems.