Conference on Learning Theory
TechTalks from event: Conference on Learning Theory

Competitive Classification and Closeness TestingWe study the problems of classification and closeness testing. A classifier associates a test sequence with the one of two training sequences that was generated by the same distribution. A closeness test determines whether two sequences were generated by the same or by different distributions. For both problems all natural algorithms are symmetric  they make the same decision under all symbol relabelings. With no assumptions on the distributions' support size or relative distance, we construct a classifier and closeness test that require at most (n3/2) samples to attain the nsample accuracy of the best symmetric classifier or closeness test designed with knowledge of the underlying distributions. Both algorithms run in time linear in the number of samples. Conversely we also show that for any classifier or closeness test, there are distributions that require ?(n7/6) samples to achieve the nsample accuracy of the best symmetric algorithm that knows the underlying distributions.

Kernels Based Tests with Nonasymptotic Bootstrap Approaches for Twosample ProblemsConsidering either two independent i.i.d. samples, or two independent samples generated from a heteroscedastic regression model, or two independent Poisson processes, we address the question of testing equality of their respective distributions. We first propose single testing procedures based on a general symmetric kernel. The corresponding critical values are chosen from a wild or permutation bootstrap approach, and the obtained tests are exactly (and not just asymptotically) of level. We then introduce an aggregation method, which enables to overcome the difficulty of choosing a kernel and/or the parameters of the kernel. We derive nonasymptotic properties for the aggregated tests, proving that they may be optimal in a classical statistical sense.

Differentially Private Online LearningIn this paper, we consider the problem of preserving privacy in the context of online learning. Online learning involves learning from data in realtime, due to which the learned model as well as its predictions are continuously changing. This makes preserving privacy of each data point significantly more challenging as its effect on the learned model can be easily tracked by observing changes in the subsequent predictions. Furthermore, with more and more online systems (e.g. search engines like Bing, Google etc.) trying to learn their customers' behavior by leveraging their access to sensitive customer data (through cookies etc.), the problem of privacy preserving online learning has become critical. We study the problem in the framework of online convex programming (OCP)  a popular online learning setting with several theoretical and practical implications  while using differential privacy as the formal measure of privacy. For this problem, we provide a generic framework that can be used to convert any given OCP algorithm into a private OCP algorithm with provable privacy as well as regret guarantees (utility), provided that the given OCP algorithm satisfies the following two criteria: 1) linearly decreasing sensitivity, i.e., the effect of the new data points on the learned model decreases linearly, 2) sublinear regret. We then illustrate our approach by converting two popular OCP algorithms into corresponding differentially private algorithms while guaranteeing (?T) regret for strongly convex functions. Next, we consider the practically important class of online linear regression problems, for which we generalize the approach by Dwork et al. (2010a) to provide a differentially private algorithm with just polylog regret. Finally, we show that our online learning framework can be used to provide differentially private algorithms for the offline learning problem as well. For the offline learning problem, our approach guarantees better error bounds and is more practical than the existing stateoftheart methods (Chaudhuri et al., 2011; Rubinstein et al., 2009).

Private Convex Empirical Risk Minimization and Highdimensional RegressionWe consider differentially private algorithms for convex empirical risk minimization (ERM). Differential privacy (Dwork et al., 2006b) is a recently introduced notion of privacy which guarantees that an algorithm's output does not depend on the data of any individual in the dataset. This is crucial in fields that handle sensitive data, such as genomics, collaborative filtering, and economics. Our motivation is the design of private algorithms for sparse learning problems, in which one aims to find solutions (e.g., regression parameters) with few nonzero coefficients. To this end: (a)We significantly extend the analysis of the "objective perturbation" algorithm of Chaudhuri et al. (2011) for convex ERM problems. We show that their method can be modified to use less noise (be more accurate), and to apply to problems with hard constraints and nondifferentiable regularizers. We also give a tighter, datadependent analysis of the additional error introduced by their method. A key tool in our analysis is a new nontrivial limit theorem for differential privacy which is of independent interest: if a sequence of differentially private algorithms converges, in a weak sense, then the limit algorithm is also differentially private. In particular, our methods give the best known algorithms for differentially private linear regression. These methods work in settings where the number of parameters p is less than the number of samples n. (b)We give the first two private algorithms for sparse regression problems in highdimensional settings, where p is much larger than n. We analyze their performance for linear regression: under standard assumptions on the data, our algorithms have vanishing empirical risk for n = poly(s, log p) when there exists a good regression vector with s nonzero coefficients. Our algorithms demonstrate that randomized algorithms for sparse regression problems can be both stable and accurate  a combination which is impossible for deterministic algorithms.

Distributed Learning, Communication Complexity and PrivacyWe consider the problem of PAClearning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VCdimension and covering number, quantities such as the teachingdimension and mistakebound of a class play an important role. We also present tight results for a number of common concept classes including conjunctions, parity functions, and decision lists. For linear separators, we show that for nonconcentrated distributions, we can use a version of the Perceptron algorithm to learn with much less communication than the number of updates given by the usual margin bound. We also show how boosting can be performed in a generic manner in the distributed setting to achieve communication with only logarithmic dependence on 1/? for any concept class, and demonstrate how recent work on agnostic learning from classconditional queries can be used to achieve low communication in agnostic settings as well. We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.

A Characterization of Scoring Rules for Linear PropertiesWe consider the design of proper scoring rules, equivalently proper losses, when the goal is to elicit some function, known as a property, of the underlying distribution. We provide a full characterization of the class of proper scoring rules when the property is linear as a function of the input distribution. A key conclusion is that any such scoring rule can be written in the form of a Bregman divergence for some convex function. We also apply our results to the design of prediction market mechanisms, showing a strong equivalence between scoring rules for linear properties and automated prediction market makers.

Divergences and Risks for Multiclass ExperimentsCsiszr's fdivergence is a way to measure the similarity of two probability distributions. We study the extension of fdivergence to more than two distributions to measure their joint similarity. By exploiting classical results from the comparison of experiments literature we prove the resulting divergence satisfies all the same properties as the traditional binary one. Considering the multidistribution case actually makes the proofs simpler. The key to these results is a formal bridge between these multidistribution fdivergences and Bayes risks for multiclass classification problems.

A Conjugate Property between Loss Functions and Uncertainty Sets in Classification ProblemsIn binary classification problems, mainly two approaches have been proposed; one is loss function approach and the other is minimum distance approach. The loss function approach is applied to major learning algorithms such as support vector machine (SVM) and boosting methods. The loss function represents the penalty of the decision function on the training samples. In the learning algorithm, the empirical mean of the loss function is minimized to obtain the classifier. Against a backdrop of the development of mathematical programming, nowadays learning algorithms based on loss functions are widely applied to realworld data analysis. In addition, statistical properties of such learning algorithms are wellunderstood based on a lots of theoretical works. On the other hand, some learning methods such as ?SVM, minimax probability machine (MPM) can be formulated as minimum distance problems. In the minimum distance approach, firstly, the socalled uncertainty set is defined for each binary label based on the training samples. Then, the best separating hyperplane between the two uncertainty sets is employed as the decision function. This is regarded as an extension of the maximummargin approach. The minimum distance approach is considered to be useful to construct the statistical models with an intuitive geometric interpretation, and the interpretation is helpful to develop the learning algorithms. However, the statistical properties of the minimum distance approach have not been intensively studied. In this paper, we consider the relation between the above two approaches. We point out that the uncertainty set in the minimum distance approach is described by using the level set of the conjugate of the loss function. Based on such relation, we study statistical properties of the minimum distance approach.

New Bounds for Learning Intervals with Implications for SemiSupervised LearningWe study learning of initial intervals in the prediction model. We show that for each distribution D over the domain, there is an algorithm AD, whose probability of a mistake in round m is at most ( + o(1))/m. We also show that the best possible bound that can be achieved in the case in which the same algorithm A must be applied for all distributions D is at least (1??e  o(1))1?m > (3?5o(1))1?m. Informally, "knowing" the distribution D enables an algorithm to reduce its error rate by a constant factor strictly greater than 1. As advocated by BenDavid et al. (2008), knowledge of D can be viewed as an idealized proxy for a large number of unlabeled examples.

Tight Bounds on Proper Equivalence Query Learning of DNFWe prove a new structural lemma for partial Boolean functions f, which we call the seed lemma for DNF. Using the lemma, we give the first subexponential algorithm for proper learning of poly(n)term DNF in Angluin's Equivalence Query (EQ) model. The algorithm has time and query complexity 2(?n), which is optimal. We also give a new result on certificates for DNFsize, a simple algorithm for properly PAClearning DNF, and new results on EQlearning log nterm DNF and decision trees.
 All Talks
 Unsupervised SVMs: On the Complexity of the Furthest Hyperplane Problem
 (weak) Calibration is Computationally Hard
 Learning Valuation Functions
 Unified Algorithms for Online Learning and Competitive Analysis
 Online Optimization with Gradual Variations
 The Optimality of Jeffreys Prior for Online Density Estimation and the Asymptotic Normality of Maximum Likelihood Estimators
 PACBayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model
 Random Design Analysis of Ridge Regression
 Reconstruction from Anisotropic Random Measurements
 Toward a Noncommutative Arithmeticgeometric Mean Inequality: Conjectures, Casestudies, and Consequences
 L1 Covering Numbers for Uniformly Bounded Convex Functions
 Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions
 AttributeEfficient Learning andWeightDegree Tradeoffs for Polynomial Threshold Functions
 Learning Functions of Halfspaces Using Prefix Covers
 Computational Bounds on Statistical Query Learning
 Learning DNF Expressions from Fourier Spectrum
 Consistency of Nearest Neighbor Classification under Selective Sampling
 Active Learning Using Smooth Relative Regret Approximations with Applications
 Robust Interactive Learning
 Rare Probability Estimation under Regularly Varying Heavy Tails
 Competitive Classification and Closeness Testing
 Kernels Based Tests with Nonasymptotic Bootstrap Approaches for Twosample Problems
 Differentially Private Online Learning
 Private Convex Empirical Risk Minimization and Highdimensional Regression
 Distributed Learning, Communication Complexity and Privacy
 A Characterization of Scoring Rules for Linear Properties
 Divergences and Risks for Multiclass Experiments
 A Conjugate Property between Loss Functions and Uncertainty Sets in Classification Problems
 New Bounds for Learning Intervals with Implications for SemiSupervised Learning
 Tight Bounds on Proper Equivalence Query Learning of DNF
 Distance Preserving Embeddings for General nDimensional Manifolds
 A Method of Moments for Mixture Models and Hidden Markov Models
 A Correlation Clustering Approach to Link Classification in Signed Networks
 Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model
 Toward Understanding Complex Spaces: Graph Laplacians on Manifolds with Singularities and Boundaries
 Exact Recovery of SparselyUsed Dictionaries
 NearOptimal Algorithms for Online Matrix Prediction
 Analysis of Thompson Sampling for the Multiarmed Bandit Problem
 Autonomous Exploration For Navigating In MDPs
 Towards Minimax Policies for Online Linear Optimization with Bandit Feedback
 The Best of Both Worlds: Stochastic and Adversarial Bandits
 Open Problem: Regret Bounds for Thompson Sampling
 Open Problem: Better Bounds for Online Logistic Regression
 Open Problem: Learning Dynamic Network Models from a Static Snapshot
 Open Problem: Does AdaBoost Always Cycle?
 Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent?