ICML 2011
TechTalks from event: ICML 2011
Neural Networks and Deep Learning

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann MachinesBoltzmann machines are often used as building blocks in greedy learning of deep networks. However, training even a simplified model, known as restricted Boltzmann machine (RBM), can be extremely laborious: Traditional learning algorithms often converge only with the right choice of the learning rate scheduling and the scale of the initial weights. They are also sensitive to specific data representation: An equivalent RBM can be obtained by flipping some bits and changing the weights and biases accordingly, but traditional learning rules are not invariant to such transformations. Without careful tuning of these training settings, traditional algorithms can easily get stuck at plateaus or even diverge. In this work, we present an enhanced gradient which is derived such that it is invariant to bitflipping transformations. We also propose a way to automatically adjust the learning rate by maximizing a local likelihood estimate. Our experiments confirm that the proposed improvements yield more stable training of RBMs.

On optimization methods for deep learningThe predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated offtheshelf optimization methods such as Limited memory BFGS (LBFGS) and Conjugate gradient (CG) with linesearch can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between LBFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of LBFGS with locally connected networks and convolutional neural networks. Using LBFGS, our convolutional network model achieves 0.69\% on the standard MNIST dataset. This is a stateoftheart result on MNIST among algorithms that do not use distortions or pretraining.

The Hierarchical Beta Process for Convolutional Factor Analysis and Deep LearningA convolutional factoranalysis model is developed, with the number of filters (factors) inferred via the beta process (BP) and hierarchical BP, for singletask and multitask learning, respectively. The computation of the model parameters is implemented within a Bayesian setting, employing Gibbs sampling; we explicitly exploit the convolutional nature of the expansion to accelerate computations. The model is used in a multilevel (“deep”) analysis of general data, with specific results presented for imageprocessing data sets, e.g., classification.

Multimodal Deep LearningDeep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audioonly data but tested with videoonly data and viceversa. Our methods are validated on the CUAVE and AVLetters datasets with an audiovisual speech classification task, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.
Reinforcement Learning

MeanVariance Optimization in Markov Decision ProcessesWe consider finite horizon Markov decision processes under performance measures that involve both the mean and the variance of the cumulative reward. We show that either randomized or historybased policies can improve performance. We prove that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NPhard for some cases, and strongly NPhard for others. We finally offer pseudopolynomial exact and approximation algorithms.

Incremental Basis Construction from Temporal Difference ErrorIn many reinforcementlearning (RL) systems, the value function is approximated as a linear combination of a fixed set of basis functions. Performance can be improved by adding to this set. Previous approaches construct a series of basis functions that in sufficient number can eventually represent the value function. In contrast, we show that there is a single, ideal basis function, which can directly represent the value function. Its addition to the set immediately reduces the error to zerowithout changing existing weights. Moreover, this ideal basis function is simply the value function that results from replacing the MDP's reward function with its Bellman error. This result suggests a novel method for improving valuefunction estimation: a primary reinforcement learner estimates its value function using its present basis functions; it then sends its TD error to a secondary learner, which interprets that error as a reward function and estimates the corresponding value function; the resulting value function then becomes the primary learner's new basis function. We present both batch and online versions in combination with incremental basis projection, and demonstrate that the performance is superior to existing methods, especially in the case of large discount factors.

Variational Inference for Policy Search in changing situationsMany policy search algorithms minimize the KullbackLeibler (KL) divergence to a certain target distribution in order to fit their policy. The commonly used KLdivergence forces the resulting policy to be 'rewardattracted'. The policy tries to reproduce all positively rewarded experience while negative experience is neglected. However, the KLdivergence is not symmetric and we can also minimize the the reversed KLdivergence, which is typically used in variational inference. The policy now becomes 'costaverse'. It tries to avoid reproducing any negativelyrewarded experience while maximizing exploration. Due to this 'costaverseness' of the policy, Variational Inference for Policy Search (VIP) has several interesting properties. It requires no kernelbandwith nor exploration rate, such settings are determined automatically by the inference. The algorithm meets the performance of stateoftheart methods while being applicable to simultaneously learning in multiple situations. We concentrate on using VIP for policy search in robotics. We apply our algorithm to learn dynamic counterbalancing of different kinds of pushes with a humanlike 4link robot.

FiniteSample Analysis of LassoTDIn this paper, we analyze the performance of LassoTD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that LassoTD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARSTD and LCTD methods. We then derive two bounds on the prediction error of LassoTD in the Markov design setting, i.e., when the performance is evaluated on the same states used by the method. The first bound makes no assumption, but has a slow rate w.r.t. the number of samples. The second bound is under an assumption on the empirical Gram matrix, called the compatibility condition, but has an improved rate and directly relates the prediction error to the sparsity of the value function in the feature space at hand.
Bayesian Inference and Probabilistic Models

Estimating the Bayes Point Using Linear Knapsack ProblemsA Bayes Point machine is a binary classifier that approximates the Bayesoptimal classifier by estimating the mean of the posterior distribution of classifier parameters. Past Bayes Point machines have overcome the intractability of this goal by using message passing techniques that approximate the posterior of the classifier parameters as a Gaussian distribution. In this paper, we investigate alternative message passing approaches that do not rely on Gaussian approximation. To make this possible, we introduce a new computational shortcut based on linear multiplechoice knapsack problems that reduces the complexity of computing Bayes Point belief propagation messages from exponential to linear in the number of data features. Empirical tests of our approach show significant improvement in linear classification over both softmargin SVMs and Expectation Propagation Bayes Point machines for several realworld UCI datasets.

Message Passing Algorithms for the Dirichlet Diffusion TreeWe demonstrate efficient approximate inference for the Dirichlet Diffusion Tree, a Bayesian nonparametric prior over tree structures. Although DDTs provide a powerful and elegant approach for modeling hierarchies they haven't seen much use to date. One problem is the computational cost of MCMC inference. We provide the first deterministic approximate inference methods for DDT models and show excellent performance compared to the MCMC alternative. We present message passing algorithms to approximate the Bayesian model evidence for a specific tree. This is used to drive sequential tree building and greedy search to find optimal tree structures, corresponding to hierarchical clusterings of the data. We demonstrate appropriate observation models for continuous and binary data. The empirical performance of our method is very close to the computationally expensive MCMC alternative on a density estimation problem, and significantly outperforms kernel density estimators.

Variational Inference for StickBreaking Beta Process PriorsWe present a variational Bayesian inference algorithm for the stickbreaking construction of the beta process. We derive an alternate representation of the beta process that is amenable to variational inference, and present a bound relating the truncated beta process to its infinite counterpart. We assess performance on two matrix factorization problems, using a nonnegative factorization model and a linearGaussian model.

Infinite Dynamic Bayesian NetworksWe present the infinite dynamic Bayesian network model (iDBN), a nonparametric, factored statespace model that generalizes dynamic Bayesian networks (DBNs). The iDBN can infer every aspect of a DBN: the number of hidden factors, the number of values each factor can take, and (arbitrarily complex) connections and conditionals between factors and observations. In this way, the iDBN generalizes other nonparametric state space models, which until now generally focused on binary hidden nodes and more restricted connection structures. We show how this new prior allows us to find interesting structure in benchmark tests and on two realworld datasets involving weather data and neural information flow networks.
 All Sessions
 Keynotes
 Bandits and Online Learning
 Structured Output
 Reinforcement Learning
 Graphical Models and Optimization
 Recommendation and Matrix Factorization
 Neural Networks and Statistical Methods
 LatentVariable Models
 LargeScale Learning
 Learning Theory
 Feature Selection, Dimensionality Reduction
 Invited CrossConference Track
 Neural Networks and Deep Learning
 LatentVariable Models
 Active and Online Learning
 Tutorial: Learning Kernels
 Tutorial : Collective Intelligence and Machine Learning
 Tutorial: Machine Learning in Ecological Science and Environmental Policy
 Tutorial: Machine Learning and Robotics
 Tutorial: Introduction to Bandits: Algorithms and Theory
 Ensemble Methods
 Tutorial: Machine Learning for Large Scale Recommender Systems
 TestofTime
 Best Paper
 Robotics and Reinforcement Learning
 Transfer Learning
 Kernel Methods
 Optimization
 Learning Theory
 Invited CrossConference Session
 Neural Networks and Deep Learning
 Reinforcement Learning
 Bayesian Inference and Probabilistic Models
 Supervised Learning
 Social Networks
 Evaluation Metrics
 statistical relational learning
 Outlier Detection
 Time Series
 Graphical Models and Bayesian Inference
 Sparsity and Compressed Sensing
 Clustering
 Game Theory and Planning and Control
 SemiSupervised Learning
 Kernel Methods and Optimization
 Neural Networks and NLP
 Probabilistic Models & MCMC
 Online Learning
 Ranking and Information Retrieval