TechTalks from event: ICML 2011

Neural Networks and Statistical Methods

  • Minimum Probability Flow Learning Authors: Jascha Sohl-Dickstein; Peter Battaglino; Michael DeWeese
    Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function and its derivatives. Here we propose a new parameter estimation technique that does not require computing an intractable normalization factor or sampling from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data distribution into the model distribution, and then setting as the objective the minimization of the KL divergence between the data distribution and the distribution produced by running the dynamics for an infinitesimal time. Score matching, minimum velocity learning, and certain forms of contrastive divergence are shown to be special cases of this learning technique. We demonstrate parameter estimation in Ising models, deep belief networks and an independent component analysis model of natural scenes. In the Ising model case, current state of the art techniques are outperformed by at least an order of magnitude in learning time, with lower error in recovered coupling parameters.
  • The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization Authors: Adam Coates; Andrew Ng
    While vector quantization (VQ) has been applied widely to generate features for visual recognition problems, much recent work has focused on more powerful methods. In particular, sparse coding has emerged as a strong alternative to traditional VQ approaches and has been shown to achieve consistently higher performance on benchmark datasets. Both approaches can be split into a training phase, where the system learns a dictionary of basis functions from unlabeled data, and an encoding phase, where the dictionary is used to extract features from new inputs. In this work, we investigate the reasons for the success of sparse coding over VQ by decoupling these phases, allowing us to separate out the contributions of the training and encoding in a controlled way. Through extensive experiments on CIFAR, NORB and Caltech 101 datasets, we compare sparse coding and several other training and encoding schemes, including a form of VQ paired with a soft threshold activation function. Our results show not only that we can use fast VQ algorithms for training without penalty, but that we can just as well use randomly chosen exemplars from the training set. Rather than spend resources on training, we find it is more important to choose a good encoder---which can often be as simple as a feed forward non-linearity. Among our results, we demonstrate state-of-the-art performance on both CIFAR and NORB.
  • Learning Recurrent Neural Networks with Hessian-Free Optimization Authors: James Martens; Ilya Sutskever
    In this work we resolve the long-outstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain long-term data dependencies. Utilizing recent advances in the Hessian-free optimization approach citep{hf}, together with a novel damping scheme, we successfully train RNNs on two sets of challenging problems. First, a collection of pathological synthetic datasets which are known to be impossible for standard optimization approaches (due to their extremely long-term dependencies), and second, on three natural and highly complex real-world sequence datasets where we find that our method significantly outperforms the previous state-of-the-art method for training neural sequence models: the Long Short-term Memory approach of citet{lstm}. Additionally, we offer a new interpretation of the generalized Gauss-Newton matrix of citet{schraudolph} which is used within the HF approach of Martens.
  • On Random Weights and Unsupervised Feature Learning Authors: Andrew Saxe; Pang Wei Koh; Zhenghao Chen; Maneesh Bhand; Bipin Suresh; Andrew Ng
    Recently two anomalous results in the literature have shown that certain feature learning architectures can yield useful features for object recognition tasks even with untrained, random weights. In this paper we pose the question: why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the time-consuming learning process. We then show that a surprising fraction of the performance of certain state-of-the-art methods can be attributed to the architecture alone.

Learning Theory

  • On the Necessity of Irrelevant Variables Authors: Dave Helmbold; Phil Long
    This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant.
  • A PAC-Bayes Sample-compression Approach to Kernel Methods Authors: Pascal Germain; Alexandre Lacoste; Francois Laviolette; Mario Marchand; Sara Shanian
    We propose a PAC-Bayes sample compression approach to kernel methods that can accommodate any bounded similarity function and show that the support vector machine (SVM) classifier is a particular case of a more general class of data-dependent classifiers known as majority votes of sample-compressed classifiers. We provide novel risk bounds for these majority votes and learning algorithms that minimize these bounds.
  • Simultaneous Learning and Covering with Adversarial Noise Authors: Andrew Guillory; Jeff Bilmes
    We study simultaneous learning and covering problems: submodular set cover problems that depend on the solution to an active (query) learning problem. The goal is to jointly minimize the cost of both learning and covering. We extend recent work in this setting to allow for a limited amount of adversarial noise. Certain noisy query learning problems are a special case of our problem. Crucial to our analysis is a lemma showing the logical OR of two submodular cover constraints can be reduced to a single submodular set cover constraint. Combined with known results, this new lemma allows for arbitrary monotone circuits of submodular cover constraints to be reduced to a single constraint. As an example practical application, we present a movie recommendation website that minimizes the total cost of learning what the user wants to watch and recommending a set of movies.
  • Risk-Based Generalizations of f-divergences Authors: Darío García-García; Ulrike von Luxburg; Raúl Santos-Rodríguez
    We derive a generalized notion of f-divergences, called (f,l)-divergences. We show that this generalization enjoys many of the nice properties of f-divergences, although it is a richer family. It also provides alternative definitions of standard divergences in terms of surrogate risks. As a first practical application of this theory, we derive a new estimator for the Kulback-Leibler divergence that we use for clustering sets of vectors.

Feature Selection, Dimensionality Reduction

  • Eigenvalue Sensitive Feature Selection Authors: Yi Jiang; Jiangtao Ren
    In recent years, some spectral feature selection methods are proposed to choose those features with high power of preserving sample similarity. However, when there exist lots of irrelevant or noisy features in data, the similarity matrix constructed from all the un-weighted features may be not reliable, which then misleads existing spectral feature selection methods to select 'wrong' features. To solve this problem, we propose that feature importance should be evaluated according to their impacts on similarity matrix, which means features with high impacts on similarity matrix should be chosen as important ones. Since graph Laplaciancite{luxbury2007} is defined on the similarity matrix, then the impact of each feature on similarity matrix can be reflected on the change of graph Laplacian, especially on its eigen-system. Based on this point of view, we propose an Eigenvalue Sensitive Criteria (EVSC) for feature selection, which aims at seeking those features with high impact on graph Laplacian's eigenvalues. Empirical analysis demonstrates our proposed method outperforms some traditional spectral feature selection methods.
  • Cauchy Graph Embedding Authors: Dijun Luo; Chris Ding; Feiping Nie; Heng Huang
    Laplacian embedding provides a low-dimensional representation for the nodes of a graph where the edge weights denote pairwise similarity among the node objects. It is commonly assumed that the Laplacian embedding results preserve the local topology of the original data on the low-dimensional projected subspaces, i.e., for any pair of graph nodes with large similarity, they should be embedded closely in the embedded space. However, in this paper, we will show that the Laplacian embedding often cannot preserve local topology well as we expected. To enhance the local topology preserving property in graph embedding, we propose a novel Cauchy graph embedding which preserves the similarity relationships of the original data in the embedded space via a new objective. Consequentially the machine learning tasks (such as k Nearest Neighbor type classifications) can be easily conducted on the embedded data with better performance. The experimental results on both synthetic and real world benchmark data sets demonstrate the usefulness of this new type of embedding.
  • Tree preserving embedding Authors: Albert Shieh; Tatsunori Hashimoto; Edo Airoldi
    Visualization techniques for complex data are a workhorse of modern scientific pursuits. The goal of visualization is to embed high dimensional data in a low dimensional space, while preserving structure in the data relevant to exploratory data analysis, such as the existence of clusters. However, existing visualization methods often either entirely fail to preserve clusters in embeddings due to the crowding problem or can only preserve clusters at a single resolution. Here, we develop a new approach to visualization, tree preserving embedding (TPE). Our approach takes advantage of the topological notion of connectedness to provably preserve clusters at all resolutions. Our performance guarantee holds for finite samples, which makes TPE a robust method for applications. Our approach suggests new strategies for robust data visualization in practice.
  • Stochastic Low-Rank Kernel Learning for Regression Authors: Pierre Machart; Thomas Peel; Sandrine Anthoine; Liva Ralaivola; Hervé Glotin,
    We present a novel approach to learn a kernel-based regression function. It is based on the use of conical combinations of data-based parameterized kernels and on a new stochastic convex optimization procedure of which we establish convergence guarantees. The overall learning procedure has the nice properties that a) the learned conical combination is automatically designed to perform the regression task at hand and b) the updates implicated by the optimization procedure are quite inexpensive. In order to shed light on the appositeness of our learning strategy, we present empirical results from experiments conducted on various benchmark datasets.