TechTalks from event: Conference on Learning Theory

  • Competitive Classification and Closeness Testing Authors: Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan and Ananda Suresh
    We study the problems of classification and closeness testing. A classifier associates a test sequence with the one of two training sequences that was generated by the same distribution. A closeness test determines whether two sequences were generated by the same or by different distributions. For both problems all natural algorithms are symmetric -- they make the same decision under all symbol relabelings. With no assumptions on the distributions' support size or relative distance, we construct a classifier and closeness test that require at most (n3/2) samples to attain the n-sample accuracy of the best symmetric classifier or closeness test designed with knowledge of the underlying distributions. Both algorithms run in time linear in the number of samples. Conversely we also show that for any classifier or closeness test, there are distributions that require ?(n7/6) samples to achieve the n-sample accuracy of the best symmetric algorithm that knows the underlying distributions.
  • Kernels Based Tests with Non-asymptotic Bootstrap Approaches for Two-sample Problems Authors: Magalie Fromont, Batrice Laurent, Matthieu Lerasle and Patricia Reynaud-Bouret
    Considering either two independent i.i.d. samples, or two independent samples generated from a heteroscedastic regression model, or two independent Poisson processes, we address the question of testing equality of their respective distributions. We first propose single testing procedures based on a general symmetric kernel. The corresponding critical values are chosen from a wild or permutation bootstrap approach, and the obtained tests are exactly (and not just asymptotically) of level. We then introduce an aggregation method, which enables to overcome the difficulty of choosing a kernel and/or the parameters of the kernel. We derive non-asymptotic properties for the aggregated tests, proving that they may be optimal in a classical statistical sense.
  • Differentially Private Online Learning Authors: Prateek Jain, Pravesh Kothari and Abhradeep Thakurta
    In this paper, we consider the problem of preserving privacy in the context of online learning. Online learning involves learning from data in real-time, due to which the learned model as well as its predictions are continuously changing. This makes preserving privacy of each data point significantly more challenging as its effect on the learned model can be easily tracked by observing changes in the subsequent predictions. Furthermore, with more and more online systems (e.g. search engines like Bing, Google etc.) trying to learn their customers' behavior by leveraging their access to sensitive customer data (through cookies etc.), the problem of privacy preserving online learning has become critical. We study the problem in the framework of online convex programming (OCP) -- a popular online learning setting with several theoretical and practical implications -- while using differential privacy as the formal measure of privacy. For this problem, we provide a generic framework that can be used to convert any given OCP algorithm into a private OCP algorithm with provable privacy as well as regret guarantees (utility), provided that the given OCP algorithm satisfies the following two criteria: 1) linearly decreasing sensitivity, i.e., the effect of the new data points on the learned model decreases linearly, 2) sub-linear regret. We then illustrate our approach by converting two popular OCP algorithms into corresponding differentially private algorithms while guaranteeing (?T) regret for strongly convex functions. Next, we consider the practically important class of online linear regression problems, for which we generalize the approach by Dwork et al. (2010a) to provide a differentially private algorithm with just poly-log regret. Finally, we show that our online learning framework can be used to provide differentially private algorithms for the offline learning problem as well. For the offline learning problem, our approach guarantees better error bounds and is more practical than the existing state-of-the-art methods (Chaudhuri et al., 2011; Rubinstein et al., 2009).
  • Private Convex Empirical Risk Minimization and High-dimensional Regression Authors: Daniel Kifer, Adam Smith and Abhradeep Thakurta
    We consider differentially private algorithms for convex empirical risk minimization (ERM). Differential privacy (Dwork et al., 2006b) is a recently introduced notion of privacy which guarantees that an algorithm's output does not depend on the data of any individual in the dataset. This is crucial in fields that handle sensitive data, such as genomics, collaborative filtering, and economics. Our motivation is the design of private algorithms for sparse learning problems, in which one aims to find solutions (e.g., regression parameters) with few non-zero coefficients. To this end: (a)We significantly extend the analysis of the "objective perturbation" algorithm of Chaudhuri et al. (2011) for convex ERM problems. We show that their method can be modified to use less noise (be more accurate), and to apply to problems with hard constraints and non-differentiable regularizers. We also give a tighter, data-dependent analysis of the additional error introduced by their method. A key tool in our analysis is a new nontrivial limit theorem for differential privacy which is of independent interest: if a sequence of differentially private algorithms converges, in a weak sense, then the limit algorithm is also differentially private. In particular, our methods give the best known algorithms for differentially private linear regression. These methods work in settings where the number of parameters p is less than the number of samples n. (b)We give the first two private algorithms for sparse regression problems in high-dimensional settings, where p is much larger than n. We analyze their performance for linear regression: under standard assumptions on the data, our algorithms have vanishing empirical risk for n = poly(s, log p) when there exists a good regression vector with s nonzero coefficients. Our algorithms demonstrate that randomized algorithms for sparse regression problems can be both stable and accurate - a combination which is impossible for deterministic algorithms.
  • Distributed Learning, Communication Complexity and Privacy Authors: Maria Florina Balcan, Avrim Blum, Shai Fine and Yishay Mansour
    We consider the problem of PAC-learning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VC-dimension and covering number, quantities such as the teaching-dimension and mistake-bound of a class play an important role. We also present tight results for a number of common concept classes including conjunctions, parity functions, and decision lists. For linear separators, we show that for non-concentrated distributions, we can use a version of the Perceptron algorithm to learn with much less communication than the number of updates given by the usual margin bound. We also show how boosting can be performed in a generic manner in the distributed setting to achieve communication with only logarithmic dependence on 1/? for any concept class, and demonstrate how recent work on agnostic learning from class-conditional queries can be used to achieve low communication in agnostic settings as well. We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.
  • A Characterization of Scoring Rules for Linear Properties Authors: Jacob D. Abernethy and Rafael M. Frongillo
    We consider the design of proper scoring rules, equivalently proper losses, when the goal is to elicit some function, known as a property, of the underlying distribution. We provide a full characterization of the class of proper scoring rules when the property is linear as a function of the input distribution. A key conclusion is that any such scoring rule can be written in the form of a Bregman divergence for some convex function. We also apply our results to the design of prediction market mechanisms, showing a strong equivalence between scoring rules for linear properties and automated prediction market makers.
  • Divergences and Risks for Multiclass Experiments Authors: Dario Garca Garca and Robert C. Williamson
    Csiszr's f-divergence is a way to measure the similarity of two probability distributions. We study the extension of f-divergence to more than two distributions to measure their joint similarity. By exploiting classical results from the comparison of experiments literature we prove the resulting divergence satisfies all the same properties as the traditional binary one. Considering the multidistribution case actually makes the proofs simpler. The key to these results is a formal bridge between these multidistribution f-divergences and Bayes risks for multiclass classification problems.
  • A Conjugate Property between Loss Functions and Uncertainty Sets in Classification Problems Authors: Takafumi Kanamori, Akiko Takeda and Taiji Suzuki
    In binary classification problems, mainly two approaches have been proposed; one is loss function approach and the other is minimum distance approach. The loss function approach is applied to major learning algorithms such as support vector machine (SVM) and boosting methods. The loss function represents the penalty of the decision function on the training samples. In the learning algorithm, the empirical mean of the loss function is minimized to obtain the classifier. Against a backdrop of the development of mathematical programming, nowadays learning algorithms based on loss functions are widely applied to real-world data analysis. In addition, statistical properties of such learning algorithms are well-understood based on a lots of theoretical works. On the other hand, some learning methods such as ?-SVM, mini-max probability machine (MPM) can be formulated as minimum distance problems. In the minimum distance approach, firstly, the so-called uncertainty set is defined for each binary label based on the training samples. Then, the best separating hyperplane between the two uncertainty sets is employed as the decision function. This is regarded as an extension of the maximum-margin approach. The minimum distance approach is considered to be useful to construct the statistical models with an intuitive geometric interpretation, and the interpretation is helpful to develop the learning algorithms. However, the statistical properties of the minimum distance approach have not been intensively studied. In this paper, we consider the relation between the above two approaches. We point out that the uncertainty set in the minimum distance approach is described by using the level set of the conjugate of the loss function. Based on such relation, we study statistical properties of the minimum distance approach.
  • New Bounds for Learning Intervals with Implications for Semi-Supervised Learning Authors: David P. Helmbold and Philip M. Long
    We study learning of initial intervals in the prediction model. We show that for each distribution D over the domain, there is an algorithm AD, whose probability of a mistake in round m is at most ( + o(1))/m. We also show that the best possible bound that can be achieved in the case in which the same algorithm A must be applied for all distributions D is at least (1??e - o(1))1?m > (3?5-o(1))1?m. Informally, "knowing" the distribution D enables an algorithm to reduce its error rate by a constant factor strictly greater than 1. As advocated by Ben-David et al. (2008), knowledge of D can be viewed as an idealized proxy for a large number of unlabeled examples.
  • Tight Bounds on Proper Equivalence Query Learning of DNF Authors: Lisa Hellerstein, Devorah Kletenik, Linda Sellie and Rocco Servedio
    We prove a new structural lemma for partial Boolean functions f, which we call the seed lemma for DNF. Using the lemma, we give the first subexponential algorithm for proper learning of poly(n)-term DNF in Angluin's Equivalence Query (EQ) model. The algorithm has time and query complexity 2(?n), which is optimal. We also give a new result on certificates for DNF-size, a simple algorithm for properly PAC-learning DNF, and new results on EQ-learning log n-term DNF and decision trees.