ICML 2011
TechTalks from event: ICML 2011
Graphical Models and Bayesian Inference
-
Variational Heteroscedastic Gaussian Process RegressionStandard Gaussian processes (GPs) model observations' noise as constant throughout input space. This is often a too restrictive assumption, but one that is needed for GP inference to be tractable. In this work we present a non-standard variational approximation that allows accurate inference in heteroscedastic GPs (i.e., under input-dependent noise conditions). Computational cost is roughly twice that of the standard GP, and also scales as O(n^3). Accuracy is verified by comparing with the golden standard MCMC and its effectiveness is illustrated on several synthetic and real datasets of diverse characteristics. An application to volatility forecasting is also considered.
-
Predicting Legislative Roll Calls from TextWe develop several predictive models linking legislative sentiment to legislative text. Our models, which draw on ideas from ideal point estimation and topic models, predict voting patterns based on the contents of bills and infer the political leanings of legislators. With supervised topics, we provide an exploratory window into how the language of the law is correlated with political support. We also derive approximate posterior inference algorithms based on variational methods. Across 12 years of legislative data, we predict specific voting patterns with high accuracy.
-
Bounding the Partition Function using Holder's InequalityWe describe an algorithm for approximate inference in graphical models based on Holder's inequality that provides upper and lower bounds on common summation problems such as computing the partition function or probability of evidence in a graphical model. Our algorithm unifies and extends several existing approaches, including variable elimination techniques such as mini-bucket elimination and variational methods such as tree reweighted belief propagation and conditional entropy decomposition. We show that our method inherits benefits from each approach to provide significantly better bounds on sum-product tasks.
-
On Bayesian PCA: Automatic Dimensionality Selection and Analytic SolutionIn probabilistic PCA, the fully Bayesian estimation is computationally intractable. To cope with this problem, two types of approximation schemes were introduced: the partially Bayesian PCA (PB-PCA) where only the latent variables are integrated out, and the variational Bayesian PCA (VB-PCA) where the loading vectors are also integrated out. The VB-PCA was proposed as an improved variant of PB-PCA for enabling automatic dimensionality selection (ADS). In this paper, we investigate whether VB-PCA is really the best choice from the viewpoints of computational efficiency and ADS. We first show that ADS is not the unique feature of VB-PCA---PB-PCA is also actually equipped with ADS. We further show that PB-PCA is more advantageous in computational efficiency than VB-PCA because the global solution of PB-PCA can be computed analytically. However, we also show the negative fact that PB-PCA results in a trivial solution in the empirical Bayesian framework. We next consider a simplified variant of VB-PCA, where the latent variables and loading vectors are assumed to be mutually independent (while the ordinary VB-PCA only requires matrix-wise independence). We show that this simplified VB-PCA is the most advantageous in practice because its empirical Bayes solution experimentally works as well as the original VB-PCA, and its global optimal solution can be computed efficiently in a closed form.
-
Bayesian CCA via Group SparsityBayesian treatments of Canonical Correlation Analysis (CCA) -type latent variable models have been recently proposed for coping with overfitting in small sample sizes, as well as for producing factorizations of the data sources into correlated and non-shared effects. However, all of the current implementations of Bayesian CCA and its extensions are computationally inefficient for high-dimensional data and, as shown in this paper, break down completely for high-dimensional sources with low sample count. Furthermore, they cannot reliably separate the correlated effects from non-shared ones. We propose a new Bayesian CCA variant that is computationally efficient and works for high-dimensional data, while also learning the factorization more accurately. The improvements are gained by introducing a group sparsity assumption and an improved variational approximation. The method is demonstrated to work well on multi-label prediction tasks and in analyzing brain correlates of naturalistic audio stimulation.
Sparsity and Compressed Sensing
-
Efficient Sparse Modeling with Automatic Feature GroupingThe grouping of features is highly beneficial in learning with high-dimensional data. It reduces the variance in the estimation and improves the stability of feature selection, leading to improved generalization. Moreover, it can also help in data understanding and interpretation. OSCAR is a recent sparse modeling tool that achieves this by using a $ell_1$-regularizer and a pairwise $ell_infty$-regularizer. However, its optimization is computationally expensive. In this paper, we propose an efficient solver based on the accelerated gradient methods. We show that its key projection step can be solved by a simple iterative group merging algorithm. It is highly efficient and reduces the empirical time complexity from $O(d^3 sim d^5)$ for the existing solvers to just $O(d)$, where $d$ is the number of features. Experimental results on toy and real-world data sets demonstrate that OSCAR is a competitive sparse modeling approach with the added ability of automatic feature grouping.
-
Robust Matrix Completion and Corrupted ColumnsThis paper considers the problem of matrix completion, when some number of the columns are arbitrarily corrupted. It is well-known that standard algorithms for matrix completion can return arbitrarily poor results, if even a single column is corrupted. What can be done if a large number, or even a constant fraction of columns are corrupted? In this paper, we study this very problem, and develop an robust and efficient algorithm for its solution. One direct application comes from robust collaborative filtering. Here, some number of users are so-called manipulators, and try to skew the predictions of the algorithm. Significantly, our results hold {it without any assumptions on the observed entries of the manipulated columns}.
-
Clustering Partially Observed Graphs via Convex OptimizationThis paper considers the problem of clustering a partially observed unweighted graph -- i.e. one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements" - i.e. the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank matrix and an (unknown) sparse matrix from their partially observed sum. We show that our algorithm succeeds under certain natural assumptions on the optimal clustering and its disagreements. Our results significantly strengthen existing matrix splitting results for the special case of our clustering problem. Our results directly enhance solutions to the problem of Correlation Clustering with partial observations
-
Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensionsWe analyze a class of estimators based on a convex relaxation for solving high-dimensional matrix decomposition problems. The observations are the noisy realizations of the sum of an (approximately) low rank matrix $Theta^star$ with a second matrix $Gamma^star$ endowed with a complementary form of low-dimensional structure. We derive a general theorem that gives upper bounds on the Frobenius norm error for an estimate of the pair $(Theta^star, Gamma^star)$ obtained by solving a convex optimization problem. We then specialize our general result to two cases that have been studied in the context of robust PCA: low rank plus sparse structure, and low rank plus a column sparse structure. Our theory yields Frobenius norm error bounds for both deterministic and stochastic noise matrices, and in the latter case, they are minimax optimal. The sharpness of our theoretical predictions is also confirmed in numerical simulations.
-
Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary SelectionWe study the problem of selecting a subset of $k$ random variables from a large set, in order to obtain the best linear prediction of another variable of interest. This problem can be viewed in the context of both feature selection and sparse approximation. We analyze the performance of widely used greedy heuristics, using insights from the maximization of submodular functions and spectral analysis. We introduce the submodularity ratio as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated. Using our techniques, we obtain the strongest known approximation guarantees for this problem, both in terms of the submodularity ratio and the smallest $k$-sparse eigenvalue of the covariance matrix. We also analyze greedy algorithms for the dictionary selection problem, and significantly improve the previously known guarantees. Our theoretical analysis is complemented by experiments on real-world and synthetic data sets; the experiments show that the submodular ratio is a stronger predictor of the performance of greedy algorithms than other spectral parameters.
Clustering
-
On Information-Maximization Clustering: Tuning Parameter Selection and Analytic SolutionInformation-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve non-convex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative information-maximization clustering method based on a squared-loss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach.
-
Pruning nearest neighbor cluster treesNearest neighbor ($k$-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a $k$-NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering.
-
A Co-training Approach for Multi-view Spectral ClusteringWe propose a spectral clustering algorithm for the multi-view setting where we have access to multiple views of the data, each of which can be independently used for clustering. Our spectral clustering algorithm has a flavor of co-training, which is already a widely used idea in semi-supervised learning. We work on the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view. Hence, we constrain our approach to only search for the clusterings that agree across the views. Our algorithm does not have any hyperparameters to set, which is a major advantage in unsupervised learning. We empirically compare with a number of baseline methods on synthetic and real-world datasets to show the efficacy of the proposed algorithm.
-
Clusterpath: an Algorithm for Clustering using Convex Fusion PenaltiesWe present a new clustering algorithm by proposing a convex relaxation of hierarchical clustering, which results in a family of objective functions with a natural geometric interpretation. We give efficient algorithms for calculating the continuous regularization path of solutions, and discuss relative advantages of the parameters. Our method experimentally gives state-of-the-art results similar to spectral clustering for non-convex clusters, and has the added benefit of learning a tree structure from the data.
-
A Unified Probabilistic Model for Global and Local Unsupervised Feature SelectionExisting algorithms for joint clustering and feature selection can be categorized as either global or local approaches. Global methods select a single cluster-independent subset of features, whereas local methods select cluster-specific subsets of features. In this paper, we present a unified probabilistic model that can perform both global and local feature selection for clustering. Our approach is based on a hierarchical beta-Bernoulli prior combined with a Dirichlet process mixture model. We obtain global or local feature selection by adjusting the variance of the beta prior. We provide a variational inference algorithm for our model. In addition to simultaneously learning the clusters and features, this Bayesian formulation allows us to learn both the number of clusters and the number of features to retain. Experiments on synthetic and real data show that our unified model can find global and local features and cluster data as well as competing methods of each type.
- All Sessions
- Keynotes
- Bandits and Online Learning
- Structured Output
- Reinforcement Learning
- Graphical Models and Optimization
- Recommendation and Matrix Factorization
- Neural Networks and Statistical Methods
- Latent-Variable Models
- Large-Scale Learning
- Learning Theory
- Feature Selection, Dimensionality Reduction
- Invited Cross-Conference Track
- Neural Networks and Deep Learning
- Latent-Variable Models
- Active and Online Learning
- Tutorial : Collective Intelligence and Machine Learning
- Tutorial: Machine Learning in Ecological Science and Environmental Policy
- Tutorial: Machine Learning and Robotics
- Ensemble Methods
- Tutorial: Introduction to Bandits: Algorithms and Theory
- Tutorial: Machine Learning for Large Scale Recommender Systems
- Tutorial: Learning Kernels
- Test-of-Time
- Best Paper
- Robotics and Reinforcement Learning
- Transfer Learning
- Kernel Methods
- Optimization
- Learning Theory
- Invited Cross-Conference Session
- Neural Networks and Deep Learning
- Reinforcement Learning
- Bayesian Inference and Probabilistic Models
- Supervised Learning
- Social Networks
- Evaluation Metrics
- statistical relational learning
- Outlier Detection
- Time Series
- Graphical Models and Bayesian Inference
- Sparsity and Compressed Sensing
- Clustering
- Game Theory and Planning and Control
- Semi-Supervised Learning
- Kernel Methods and Optimization
- Neural Networks and NLP
- Probabilistic Models & MCMC
- Online Learning
- Ranking and Information Retrieval