TechTalks from event: ICML 2011

Latent-Variable Models

  • On the Integration of Topic Modeling and Dictionary Learning Authors: Lingbo Li; Mingyuan Zhou; Guillermo Sapiro; Lawrence Carin
    A new nonparametric Bayesian model is developed to integrate dictionary learning and topic model into a unified framework. The model is employed to analyze partially annotated images, with the dictionary learning performed directly on image patches. Efficient inference is performed with a Gibbs-slice sampler, and encouraging results are reported on widely used datasets.
  • Beam Search based MAP Estimates for the Indian Buffet Process Authors: Piyush Rai; Hal Daume III
    Nonparametric latent feature models offer a flexible way to discover the latent features underlying the data, without having to a priori specify their number. The Indian Buffet Process (IBP) is a popular example of such a model. Inference in IBP based models, however, remains a challenge. Sampling techniques such as MCMC can be computationally expensive and can take a long time to converge to the stationary distribution. Variational techniques, although faster than sampling, can be difficult to design, and can still remain slow on large data. In many problems, however, we only seek a maximum a posteriori (MAP) estimate of the latent feature assignment matrix. For such cases, we show that techniques such as beam search can give fast, approximate MAP estimates in the IBP based models. If samples from the posterior are desired, these MAP estimates can also serve as sensible initializers for MCMC based algorithms. Experimental results on a variety of datasets suggest that our algorithms can be a computationally viable alternative to Gibbs sampling, the particle filter, and variational inference based approaches for the IBP, and also perform better than other heuristics such as greedy search.
  • Tree-Structured Infinite Sparse Factor Model Authors: XianXing Zhang; David Dunson; Lawrence Carin
    A new tree-structured multiplicative gamma process (TMGP) is developed, for inferring the depth of a tree-based factor-analysis model. This new model is coupled with the nested Chinese restaurant process, to nonparametrically infer the depth and width (structure) of the tree. In addition to developing the model, theoretical properties of the TMGP are addressed, and a novel MCMC sampler is developed. The structure of the inferred tree is used to learn relationships between high-dimensional data, and the model is also applied to compressive sensing and interpolation of incomplete images.
  • Sparse Additive Generative Models of Text Authors: Jacob Eisenstein; Amr Ahmed; Eric Xing
    Generative models of text typically associate a multinomial with every class label or topic. Even in simple models this requires the estimation of thousands of parameters; in multifaceted latent variable models, standard approaches require additional latent ``switching'' variables for every token, complicating inference. In this paper, we propose an alternative generative model for text. The central idea is that each class label or latent topic is endowed with a model of the deviation in log-frequency from a constant background distribution. This approach has two key advantages: we can enforce sparsity to prevent overfitting, and we can combine generative facets through simple addition in log space, avoiding the need for latent switching variables. We demonstrate the applicability of this idea to a range of scenarios: classification, topic modeling, and more complex multifaceted generative models.

Large-Scale Learning

  • Hashing with Graphs Authors: Wei Liu; Jun Wang; Sanjiv Kumar; Shih-Fu Chang
    Hashing is becoming increasingly popular for efficient nearest neighbor search in massive databases. However, learning short codes that yield good search performance is still a challenge. Moreover, in many cases real-world data lives on a low-dimensional manifold, which should be taken into account to capture meaningful nearest neighbors. In this paper, we propose a novel graph-based hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes. To make such an approach computationally feasible, we utilize Anchor Graphs to obtain tractable low-rank adjacency matrices. Our formulation allows constant time hashing of a new data point by extrapolating graph Laplacian eigenvectors to eigenfunctions. Finally, we describe a hierarchical threshold learning procedure in which each eigenfunction yields multiple bits, leading to higher search accuracy. Experimental comparison with the other state-of-the-art methods on two large datasets demonstrates the efficacy of the proposed method.
  • Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes Authors: Jiang Su; Jelber Sayyad Shirab; Stan Matwin
    Numerous semi-supervised learning methods have been proposed to augment Multinomial Naive Bayes (MNB) using unlabeled documents, but their use in practice is often limited due to implementation difficulty, inconsistent prediction performance, or high computational cost. In this paper, we propose a new, very simple semi-supervised extension of MNB, called Semi-supervised Frequency Estimate (SFE). Our experiments show that it consistently improves MNB with additional data (labeled or unlabeled) in terms of AUC and accuracy, which is not the case when combining MNB with Expectation Maximization (EM). We attribute this to the fact that SFE consistently produces better conditional log likelihood values than both EM+MNB and MNB in labeled training data.
  • Parallel Coordinate Descent for L1-Regularized Loss Minimization Authors: Joseph Bradley; Aapo Kyrola; Daniel Bickson; Carlos Guestrin
    We propose Shotgun, a parallel coordinate descent algorithm for minimizing L1-regularized losses. Though coordinate descent seems inherently sequential, we prove convergence bounds for Shotgun which predict linear speedups, up to a problem-dependent limit. We present a comprehensive empirical study of Shotgun for Lasso and sparse logistic regression. Our theoretical predictions on the potential for parallelism closely match behavior on real data. Shotgun outperforms other published solvers on a range of large problems, proving to be one of the most scalable algorithms for L1.
  • OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning Authors: Arvind Sujeeth; HyoukJoong Lee; Kevin Brown; Tiark Rompf; Hassan Chafi; Michael Wu; Anand Atreya; Martin Odersky; Kunle Olukotun
    As the size of datasets continues to grow, machine learning applications are becoming increasingly limited by the amount of available computational power. Taking advantage of modern hardware requires using multiple parallel programming models targeted at different devices (e.g. CPUs and GPUs). However, programming these devices to run efficiently and correctly is difficult, error-prone, and results in software that is harder to read and maintain. We present OptiML, a domain-specific language (DSL) for machine learning. OptiML is an implicitly parallel, expressive and high performance alternative to MATLAB and C++. OptiML performs domain-specific analyses and optimizations and automatically generates CUDA code for GPUs. We show that OptiML outperforms explicitly parallelized MATLAB code in nearly all cases.

Learning Theory

  • On the Necessity of Irrelevant Variables Authors: Dave Helmbold; Phil Long
    This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant.
  • A PAC-Bayes Sample-compression Approach to Kernel Methods Authors: Pascal Germain; Alexandre Lacoste; Francois Laviolette; Mario Marchand; Sara Shanian
    We propose a PAC-Bayes sample compression approach to kernel methods that can accommodate any bounded similarity function and show that the support vector machine (SVM) classifier is a particular case of a more general class of data-dependent classifiers known as majority votes of sample-compressed classifiers. We provide novel risk bounds for these majority votes and learning algorithms that minimize these bounds.
  • Simultaneous Learning and Covering with Adversarial Noise Authors: Andrew Guillory; Jeff Bilmes
    We study simultaneous learning and covering problems: submodular set cover problems that depend on the solution to an active (query) learning problem. The goal is to jointly minimize the cost of both learning and covering. We extend recent work in this setting to allow for a limited amount of adversarial noise. Certain noisy query learning problems are a special case of our problem. Crucial to our analysis is a lemma showing the logical OR of two submodular cover constraints can be reduced to a single submodular set cover constraint. Combined with known results, this new lemma allows for arbitrary monotone circuits of submodular cover constraints to be reduced to a single constraint. As an example practical application, we present a movie recommendation website that minimizes the total cost of learning what the user wants to watch and recommending a set of movies.
  • Risk-Based Generalizations of f-divergences Authors: Darío García-García; Ulrike von Luxburg; Raúl Santos-Rodríguez
    We derive a generalized notion of f-divergences, called (f,l)-divergences. We show that this generalization enjoys many of the nice properties of f-divergences, although it is a richer family. It also provides alternative definitions of standard divergences in terms of surrogate risks. As a first practical application of this theory, we derive a new estimator for the Kulback-Leibler divergence that we use for clustering sets of vectors.