TechTalks from event: Conference on Learning Theory

  • L1 Covering Numbers for Uniformly Bounded Convex Functions Authors: Adityanand Guntuboyina and Bodhisattva Sen
    In this paper we study the covering numbers of the space of convex and uniformly bounded functions in multi-dimension. We find optimal upper and lower bounds for the ?-covering number M(C([a, b]d, B), ?, L1) in terms of the relevant constants, where d > 1, a < b ? R, B > 0, and C([a, b]d, B) denotes the set of all convex functions on [a, b]d that are uniformly bounded by B. We summarize previously known results on covering numbers for convex functions and also provide alternate proofs of some known results. Our results have direct implications in the study of rates of convergence of empirical minimization procedures as well as optimal convergence rates in the numerous convexity constrained function estimation problems.
  • Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Authors: Yuyang Wang, Roni Khardon, Dmitry Pechyony and Rosie Jones
    Efficient online learning with pairwise loss functions is a crucial component in building largescale learning system that maximizes the area under the Receiver Operator Characteristic (ROC) curve. In this paper we investigate the generalization performance of online learning algorithms with pairwise loss functions. We show that the existing proof techniques for generalization bounds of online algorithms with a pointwise loss can not be directly applied to pairwise losses. Using the Hoeffding-Azuma inequality and various proof techniques for the risk bounds in the batch learning, we derive data-dependent bounds for the average risk of the sequence of hypotheses generated by an arbitrary online learner in terms of an easily computable statistic, and show how to extract a low risk hypothesis from the sequence. In addition, we analyze a natural extension of the perceptron algorithm for the bipartite ranking problem providing a bound on the empirical pairwise loss. Combining these results we get a complete risk analysis of the proposed algorithm.
  • Attribute-Efficient Learning andWeight-Degree Tradeoffs for Polynomial Threshold Functions Authors: Rocco Servedio, Li-Yang Tan and Justin Thaler
    We study the challenging problem of learning decision lists attribute-efficiently, giving both positive and negative results. Our main positive result is a new tradeoff between the running time and mistake bound for learning length-k decision lists over n Boolean variables. When the allowed running time is relatively high, our new mistake bound improves significantly on the mistake bound of the best previous algorithm of Klivans and Servedio (Klivans and Servedio, 2006). Our main negative result is a new lower bound on the weight of any degree-d polynomial threshold function (PTF) that computes a particular decision list over k variables (the "ODD-MAX-BIT" function). The main result of Beigel (Beigel, 1994) is a weight lower bound of 2?(k/d2), which was shown to be essentially optimal for d ? k1/3 by Klivans and Servedio. Here we prove a 2?(?k/d) lower bound, which improves on Beigel's lower bound for d > k1/3. This lower bound establishes strong limitations on the effectiveness of the Klivans and Servedio approach and suggests that it may be difficult to improve on our positive result. The main tool used in our lower bound is a new variant of Markov's classical inequality which may be of independent interest; it provides a bound on the derivative of a univariate polynomial in terms of both its degree and the size of its coefficients.
  • Learning Functions of Halfspaces Using Prefix Covers Authors: Parikshit Gopalan, Adam R. Klivans and Raghu Meka
    We present a simple query-algorithm for learning arbitrary functions of k halfspaces under any product distribution on the Boolean hypercube. Our algorithms learn any function of k halfspaces to within accuracy ? in time O((nk/?)k+1) under any product distribution on {0, 1}n using read-once branching programs as a hypothesis. This gives the first poly(n, 1/?) algorithm for learning even the intersection of 2 halfspaces under the uniform distribution on {0, 1}n previously known algorithms had an exponential dependence either on the accuracy parameter ? or the dimension n. To prove this result, we identify a new structural property of Boolean functions that yields learnability with queries: that of having a small prefix cover.
  • Computational Bounds on Statistical Query Learning Authors: Vitaly Feldman and Varun Kanade
    We study the complexity of learning in Kearns' well-known statistical query (SQ) learning model (Kearns, 1993). A number of previous works have addressed the definition and estimation of the information-theoretic bounds on the SQ learning complexity, in other words, bounds on the query complexity. Here we give the first strictly computational upper and lower bounds on the complexity of several types of learning in the SQ model. As it was already observed, the known characterization of distribution-specific SQ learning (Blum, et al. 1994) implies that for weak learning over a fixed distribution, the query complexity and computational complexity are essentially the same. In contrast, we show that for both distribution-specific and distribution-independent (strong) learning there exists a concept class of polynomial query complexity that is not efficiently learnable unless RP = NP. We then prove that our distribution-specific lower bound is essentially tight by showing that for every concept class C of polynomial query complexity there exists a polynomial time algorithm that given access to random points from any distribution D and an NP oracle, can SQ learn C over D. We also consider a restriction of the SQ model, the correlational statistical query (CSQ) model (Bshouty and Feldman, 2001; Feldman, 2008) of learning which is closely-related to Valiant's model of evolvability (Valiant, 2007). We show a similar separation result for distribution-independent CSQ learning under a stronger assumption: there exists a concept class of polynomial CSQ query complexity which is not efficiently learnable unless every problem in W[P] has a randomized fixed parameter tractable algorithm.
  • Learning DNF Expressions from Fourier Spectrum Authors: Vitaly Feldman
    Since its introduction by Valiant in 1984, PAC learning of DNF expressions remains one of the central problems in learning theory. We consider this problem in the setting where the underlying distribution is uniform, or more generally, a product distribution. Kalai, Samorodnitsky, and Teng (2009b) showed that in this setting a DNF expression can be efficiently approximated from its "heavy" low-degree Fourier coefficients alone. This is in contrast to previous approaches where boosting was used and thus Fourier coefficients of the target function modified by various distributions were needed. This property is crucial for learning of DNF expressions over smoothed product distributions, a learning model introduced by Kalai et al. (2009b) and inspired by the seminal smoothed analysis model of Spielman and Teng (2004). We introduce a new approach to learning (or approximating) a polynomial threshold functions which is based on creating a function with range [-1, 1] that approximately agrees with the unknown function on low-degree Fourier coefficients. We then describe conditions under which this is sufficient for learning polynomial threshold functions. Our approach yields a new, simple algorithm for approximating any polynomial-size DNF expression from its "heavy" low-degree Fourier coefficients alone. This algorithm greatly simplifies the proof of learnability of DNF expressions over smoothed product distributions and is simpler than all previous algorithm for PAC learning of DNF expression using membership queries. We also describe an application of our algorithm to learning monotone DNF expressions over product distributions. Building on the work of Servedio (2004), we give an algorithm that runs in time poly((s? log (s/?))log (s/?), n), where s is the size of the DNF expression and ? is the accuracy. This improves on poly((s? log (ns/?))log (s/?)? log(1/?), n) bound of Servedio (2004).
  • Consistency of Nearest Neighbor Classification under Selective Sampling Authors: Sanjoy Dasgupta
    This paper studies nearest neighbor classification in a model where unlabeled data points arrive in a stream, and the learner decides, for each one, whether to ask for its label. Are there generic ways to augment or modify any selective sampling strategy so as to ensure the consistency of the resulting nearest neighbor classifier?
  • Active Learning Using Smooth Relative Regret Approximations with Applications Authors: Nir Ailon, Ron Begleiter and Esther Ezra
    The disagreement coefficient of Hanneke has become a central concept in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rates that are superior to passive learning ones. We present a different tool for pool based active learning which follows from the existence of a certain uniform version of low disagreement coefficient, but is not equivalent to it. In fact, we present two fundamental active learning problems of significant interest for which our approach allows nontrivial active learning bounds. However, any general purpose method relying on the disagreement coefficient bounds only fails to guarantee any useful bounds for these problems. The tool we use is based on the learner's ability to compute an estimator of the difference between the loss of any hypotheses and some fixed "pivotal" hypothesis to within an absolute error of at most ? times the l1 distance (the disagreement measure) between the two hypotheses. We prove that such an estimator implies the existence of a learning algorithm which, at each iteration, reduces its excess risk to within a constant factor. Each iteration replaces the current pivotal hypothesis with the minimizer of the estimated loss difference function with respect to the previous pivotal hypothesis. The label complexity essentially becomes that of computing this estimator. The two applications of interest are: learning to rank from pairwise preferences, and clustering with side information (a.k.a. semi-supervised clustering). They are both fundamental, and have started receiving more attention from active learning theoreticians and practitioners. Keywords: active learning, learning to rank from pairwise preferences, semi-supervised clustering, clustering with side information, disagreement coefficient, smooth relative regret approximation.
  • Robust Interactive Learning Authors: Maria Florina Balcan and Steve Hanneke
    In this paper we propose and study a generalization of the standard active-learning model where a more general type of queries including class conditional queries and mistake queries are allowed. Such queries have been quite useful in applications, but have been lacking theoretical understanding. In this work, we characterize the power of such queries under several well-known noise models. We give nearly tight upper and lower bounds on the number of queries needed to learn both for the general agnostic setting and for the bounded noise model. We further show that our methods can be made adaptive to the (unknown) noise rate, with only negligible loss in query complexity.
  • Rare Probability Estimation under Regularly Varying Heavy Tails Authors: Mesrob I. Ohannessian and Munther A. Dahleh
    This paper studies the problem of estimating the probability of symbols that have occurred very rarely, in samples drawn independently from an unknown, possibly infinite, discrete distribution. In particular, we study the multiplicative consistency of estimators, defined as the ratio of the estimate to the true quantity converging to one. We first show that the classical Good-Turing estimator is not universally consistent in this sense, despite enjoying favorable additive properties. We then use Karamata's theory of regular variation to prove that regularly varying heavy tails are sufficient for consistency. At the core of this result is a multiplicative concentration that we establish both by extending the McAllester-Ortiz additive concentration for the missing mass to all rare probabilities and by exploiting regular variation. We also derive a family of estimators which, in addition to being consistent, address some of the shortcomings of the Good-Turing estimator. For example, they perform smoothing implicitly and have the absolute discounting structure of many heuristic algorithms. This also establishes a discrete parallel to extreme value theory, and many of the techniques therein can be adapted to the framework that we set forth.