ICML 2011
TechTalks from event: ICML 2011
Supervised Learning

MultiLabel Classification on Tree and DAGStructured HierarchiesMany realworld applications involve multilabel classification, in which the labels are organized in the form of a tree or directed acyclic graph (DAG). However, current research efforts typically ignore the label dependencies or can only exploit the dependencies in treestructured hierarchies. In this paper, we present a novel hierarchical multilabel classification algorithm which can be used on both tree and DAGstructured hierarchies. The key idea is to formulate the search for the optimal consistent multilabel as the finding of the best subgraph in a tree/DAG. Using a simple greedy strategy, the proposed algorithm is computationally efficient, easy to implement, does not suffer from the problem of insufficient/skewed training data in classifier training, and can be readily used on large hierarchies. Theoretical results guarantee the optimality of the obtained solution. Experiments are performed on a large number of functional genomics data sets. The proposed method consistently outperforms the stateoftheart method on both tree and DAGstructured hierarchies.

Surrogate losses and regret bounds for costsensitive classification with exampledependent costsWe study surrogate losses in the context of costsensitive classification with exampledependent costs, a problem also known as regression level set estimation. We give sufficient conditions on the surrogate loss for the existence of a surrogate regret bound. Such bounds imply that as the surrogate risk tends to its optimal value, so too does the expected misclassification cost. Our sufficient conditions encompass exampledependent versions of the hinge, exponential, and other common losses. These results provide theoretical justification for some previously proposed surrogatebased algorithms, and suggests others that have not yet been developed.

Support Vector Machines as Probabilistic ModelsWe show how the SVM can be viewed as a maximum likelihood estimate of a class of probabilistic models. This model class can be viewed as a reparametrization of the SVM in a similar vein to the $ u$SVM reparametrizing the original SVM. It is not discriminative, but has a nonuniform marginal. We illustrate the benefits of this new view by rederiving and reinvestigating two established SVMrelated algorithms.

Locally Linear Support Vector MachinesLinear support vector machines ({sc svm}s) have become popular for solving classification tasks due to their fast and simple online application to large scale data sets. However, many problems are not linearly separable. For these problems kernelbased {sc svm}s are often used, but unlike their linear variant they suffer from various drawbacks in terms of computational and memory efficiency. Their response can be represented only as a function of the set of support vectors, which has been experimentally shown to grow linearly with the size of the training set. In this paper we propose a novel locally linear {sc svm} classifier with smooth decision boundary and bounded curvature. We show how the functions defining the classifier can be approximated using local codings and show how this model can be optimized in an online fashion by performing stochastic gradient descent with the same convergence guarantees as standard gradient descent method for linear {sc svm}. Our method achieves comparable performance to the stateoftheart whilst being significantly faster than competing kernel {sc svm}s. We generalise this model to locally finite dimensional kernel {sc svm}.
Social Networks

Uncovering the Temporal Dynamics of Diffusion NetworksTime plays an essential role in the diffusion of information, influence and disease over networks. In many cases we only observe when a node copies information, makes a decision or becomes infected  but the connectivity, transmission rates between nodes and transmission sources are unknown. Inferring the underlying dynamics is of outstanding interest since it enables forecasting, influencing and retarding infections, broadly construed. To this end, we model diffusion processes as discrete networks of continuous temporal processes occurring at different rates. Given cascade data  observed infection times of nodes  we infer the edges of the global diffusion network and estimate the transmission rates of each edge that best explain the observed data. The optimization problem is convex. The model naturally (without heuristics) imposes sparse solutions and requires no parameter tuning. The problem decouples into a collection of independent smaller problems, thus scaling easily to networks on the order of hundreds of thousands of nodes. Experiments on real and synthetic data show that our algorithm both recovers the edges of diffusion networks and accurately estimates their transmission rates from cascade data.

Dynamic Egocentric Models for Citation NetworksThe analysis of the formation and evolution of networks over time is of fundamental importance to social science, biology, and many other fields. While longitudinal network data sets are increasingly being recorded at the granularity of individual timestamped events, most studies only focus on collapsed crosssectional snapshots of the network. In this paper, we introduce a dynamic egocentric framework that models continuoustime network data using multivariate counting processes. For inference, an efficient partial likelihood approach is used, allowing our methods to scale to large networks. We apply our techniques to various citation networks and demonstrate the predictive power and interpretability of the learned statistical models.
Evaluation Metrics

Brier Curves: a New CostBased Visualisation of Classifier PerformanceIt is often necessary to evaluate classifier performance over a range of operating conditions, rather than as a point estimate. This is typically assessed through the construction of ‘curves’over a ‘space’, visualising how one or two performance metrics vary with the operating condition. For binary classifiers in particular, cost space is a natural way of showing this range of performance, visualising loss against operating condition. However, the curves which have been traditionally drawn in cost space, known as cost curves, show the optimal loss, and hence assume knowledge of the optimal decision threshold for a given operating condition. Clearly, this leads to an optimistic assessment of classifier performance. In this paper we propose a more natural way of visualising classifier performance in cost space, which is to plot probabilistic loss on the yaxis, i.e., the loss arising from the probability estimates. This new curve provides new ways of understanding classifier performance and new tools to compare classifiers. In addition, we show that the area under this curve is exactly the Brier score, one of the most popular performance metrics for probabilistic classifiers.

A Coherent Interpretation of AUC as a Measure of Aggregated Classification PerformanceThe area under the ROC curve (AUC), a wellknown measure of ranking performance, is also often used as a measure of classification performance, aggregating over decision thresholds as well as class and cost skews. However, David Hand has recently argued that AUC is fundamentally incoherent as a measure of aggregated classifier performance and proposed an alternative measure. Specifically, Hand derives a linear relationship between AUC and expected minimum loss, where the expectation is taken over a distribution of the misclassification cost parameter that depends on the model under consideration. Replacing this distribution with a Beta(2;2) distribution, Hand derives his alternative measure H. In this paper we offer an alternative, coherent interpretation of AUC as linearly related to expected loss. We use a distribution over cost parameter and a distribution over data points, both uniform and hence model independent. Should one wish to consider only optimal thresholds, we demonstrate that a simple and more intuitive alternative to Hand’s H measure is already available in the form of the area under the cost curve.
 All Sessions
 Keynotes
 Bandits and Online Learning
 Structured Output
 Reinforcement Learning
 Graphical Models and Optimization
 Recommendation and Matrix Factorization
 Neural Networks and Statistical Methods
 LatentVariable Models
 LargeScale Learning
 Learning Theory
 Feature Selection, Dimensionality Reduction
 Invited CrossConference Track
 Neural Networks and Deep Learning
 LatentVariable Models
 Active and Online Learning
 Ensemble Methods
 Tutorial: Introduction to Bandits: Algorithms and Theory
 Tutorial: Machine Learning for Large Scale Recommender Systems
 Tutorial: Learning Kernels
 Tutorial : Collective Intelligence and Machine Learning
 Tutorial: Machine Learning in Ecological Science and Environmental Policy
 Tutorial: Machine Learning and Robotics
 TestofTime
 Best Paper
 Robotics and Reinforcement Learning
 Transfer Learning
 Kernel Methods
 Optimization
 Learning Theory
 Invited CrossConference Session
 Neural Networks and Deep Learning
 Reinforcement Learning
 Bayesian Inference and Probabilistic Models
 Supervised Learning
 Social Networks
 Evaluation Metrics
 statistical relational learning
 Outlier Detection
 Time Series
 Graphical Models and Bayesian Inference
 Sparsity and Compressed Sensing
 Clustering
 Game Theory and Planning and Control
 SemiSupervised Learning
 Kernel Methods and Optimization
 Neural Networks and NLP
 Probabilistic Models & MCMC
 Online Learning
 Ranking and Information Retrieval