Conference on Learning Theory
TechTalks from event: Conference on Learning Theory

L1 Covering Numbers for Uniformly Bounded Convex FunctionsIn this paper we study the covering numbers of the space of convex and uniformly bounded functions in multidimension. We find optimal upper and lower bounds for the ?covering number M(C([a, b]d, B), ?, L1) in terms of the relevant constants, where d > 1, a < b ? R, B > 0, and C([a, b]d, B) denotes the set of all convex functions on [a, b]d that are uniformly bounded by B. We summarize previously known results on covering numbers for convex functions and also provide alternate proofs of some known results. Our results have direct implications in the study of rates of convergence of empirical minimization procedures as well as optimal convergence rates in the numerous convexity constrained function estimation problems.

Generalization Bounds for Online Learning Algorithms with Pairwise Loss FunctionsEfficient online learning with pairwise loss functions is a crucial component in building largescale learning system that maximizes the area under the Receiver Operator Characteristic (ROC) curve. In this paper we investigate the generalization performance of online learning algorithms with pairwise loss functions. We show that the existing proof techniques for generalization bounds of online algorithms with a pointwise loss can not be directly applied to pairwise losses. Using the HoeffdingAzuma inequality and various proof techniques for the risk bounds in the batch learning, we derive datadependent bounds for the average risk of the sequence of hypotheses generated by an arbitrary online learner in terms of an easily computable statistic, and show how to extract a low risk hypothesis from the sequence. In addition, we analyze a natural extension of the perceptron algorithm for the bipartite ranking problem providing a bound on the empirical pairwise loss. Combining these results we get a complete risk analysis of the proposed algorithm.

AttributeEfficient Learning andWeightDegree Tradeoffs for Polynomial Threshold FunctionsWe study the challenging problem of learning decision lists attributeefficiently, giving both positive and negative results. Our main positive result is a new tradeoff between the running time and mistake bound for learning lengthk decision lists over n Boolean variables. When the allowed running time is relatively high, our new mistake bound improves significantly on the mistake bound of the best previous algorithm of Klivans and Servedio (Klivans and Servedio, 2006). Our main negative result is a new lower bound on the weight of any degreed polynomial threshold function (PTF) that computes a particular decision list over k variables (the "ODDMAXBIT" function). The main result of Beigel (Beigel, 1994) is a weight lower bound of 2?(k/d2), which was shown to be essentially optimal for d ? k1/3 by Klivans and Servedio. Here we prove a 2?(?k/d) lower bound, which improves on Beigel's lower bound for d > k1/3. This lower bound establishes strong limitations on the effectiveness of the Klivans and Servedio approach and suggests that it may be difficult to improve on our positive result. The main tool used in our lower bound is a new variant of Markov's classical inequality which may be of independent interest; it provides a bound on the derivative of a univariate polynomial in terms of both its degree and the size of its coefficients.

Learning Functions of Halfspaces Using Prefix CoversWe present a simple queryalgorithm for learning arbitrary functions of k halfspaces under any product distribution on the Boolean hypercube. Our algorithms learn any function of k halfspaces to within accuracy ? in time O((nk/?)k+1) under any product distribution on {0, 1}n using readonce branching programs as a hypothesis. This gives the first poly(n, 1/?) algorithm for learning even the intersection of 2 halfspaces under the uniform distribution on {0, 1}n previously known algorithms had an exponential dependence either on the accuracy parameter ? or the dimension n. To prove this result, we identify a new structural property of Boolean functions that yields learnability with queries: that of having a small prefix cover.

Computational Bounds on Statistical Query LearningWe study the complexity of learning in Kearns' wellknown statistical query (SQ) learning model (Kearns, 1993). A number of previous works have addressed the definition and estimation of the informationtheoretic bounds on the SQ learning complexity, in other words, bounds on the query complexity. Here we give the first strictly computational upper and lower bounds on the complexity of several types of learning in the SQ model. As it was already observed, the known characterization of distributionspecific SQ learning (Blum, et al. 1994) implies that for weak learning over a fixed distribution, the query complexity and computational complexity are essentially the same. In contrast, we show that for both distributionspecific and distributionindependent (strong) learning there exists a concept class of polynomial query complexity that is not efficiently learnable unless RP = NP. We then prove that our distributionspecific lower bound is essentially tight by showing that for every concept class C of polynomial query complexity there exists a polynomial time algorithm that given access to random points from any distribution D and an NP oracle, can SQ learn C over D. We also consider a restriction of the SQ model, the correlational statistical query (CSQ) model (Bshouty and Feldman, 2001; Feldman, 2008) of learning which is closelyrelated to Valiant's model of evolvability (Valiant, 2007). We show a similar separation result for distributionindependent CSQ learning under a stronger assumption: there exists a concept class of polynomial CSQ query complexity which is not efficiently learnable unless every problem in W[P] has a randomized fixed parameter tractable algorithm.

Learning DNF Expressions from Fourier SpectrumSince its introduction by Valiant in 1984, PAC learning of DNF expressions remains one of the central problems in learning theory. We consider this problem in the setting where the underlying distribution is uniform, or more generally, a product distribution. Kalai, Samorodnitsky, and Teng (2009b) showed that in this setting a DNF expression can be efficiently approximated from its "heavy" lowdegree Fourier coefficients alone. This is in contrast to previous approaches where boosting was used and thus Fourier coefficients of the target function modified by various distributions were needed. This property is crucial for learning of DNF expressions over smoothed product distributions, a learning model introduced by Kalai et al. (2009b) and inspired by the seminal smoothed analysis model of Spielman and Teng (2004). We introduce a new approach to learning (or approximating) a polynomial threshold functions which is based on creating a function with range [1, 1] that approximately agrees with the unknown function on lowdegree Fourier coefficients. We then describe conditions under which this is sufficient for learning polynomial threshold functions. Our approach yields a new, simple algorithm for approximating any polynomialsize DNF expression from its "heavy" lowdegree Fourier coefficients alone. This algorithm greatly simplifies the proof of learnability of DNF expressions over smoothed product distributions and is simpler than all previous algorithm for PAC learning of DNF expression using membership queries. We also describe an application of our algorithm to learning monotone DNF expressions over product distributions. Building on the work of Servedio (2004), we give an algorithm that runs in time poly((s? log (s/?))log (s/?), n), where s is the size of the DNF expression and ? is the accuracy. This improves on poly((s? log (ns/?))log (s/?)? log(1/?), n) bound of Servedio (2004).

Consistency of Nearest Neighbor Classification under Selective SamplingThis paper studies nearest neighbor classification in a model where unlabeled data points arrive in a stream, and the learner decides, for each one, whether to ask for its label. Are there generic ways to augment or modify any selective sampling strategy so as to ensure the consistency of the resulting nearest neighbor classifier?

Active Learning Using Smooth Relative Regret Approximations with ApplicationsThe disagreement coefficient of Hanneke has become a central concept in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rates that are superior to passive learning ones. We present a different tool for pool based active learning which follows from the existence of a certain uniform version of low disagreement coefficient, but is not equivalent to it. In fact, we present two fundamental active learning problems of significant interest for which our approach allows nontrivial active learning bounds. However, any general purpose method relying on the disagreement coefficient bounds only fails to guarantee any useful bounds for these problems. The tool we use is based on the learner's ability to compute an estimator of the difference between the loss of any hypotheses and some fixed "pivotal" hypothesis to within an absolute error of at most ? times the l1 distance (the disagreement measure) between the two hypotheses. We prove that such an estimator implies the existence of a learning algorithm which, at each iteration, reduces its excess risk to within a constant factor. Each iteration replaces the current pivotal hypothesis with the minimizer of the estimated loss difference function with respect to the previous pivotal hypothesis. The label complexity essentially becomes that of computing this estimator. The two applications of interest are: learning to rank from pairwise preferences, and clustering with side information (a.k.a. semisupervised clustering). They are both fundamental, and have started receiving more attention from active learning theoreticians and practitioners. Keywords: active learning, learning to rank from pairwise preferences, semisupervised clustering, clustering with side information, disagreement coefficient, smooth relative regret approximation.

Robust Interactive LearningIn this paper we propose and study a generalization of the standard activelearning model where a more general type of queries including class conditional queries and mistake queries are allowed. Such queries have been quite useful in applications, but have been lacking theoretical understanding. In this work, we characterize the power of such queries under several wellknown noise models. We give nearly tight upper and lower bounds on the number of queries needed to learn both for the general agnostic setting and for the bounded noise model. We further show that our methods can be made adaptive to the (unknown) noise rate, with only negligible loss in query complexity.

Rare Probability Estimation under Regularly Varying Heavy TailsThis paper studies the problem of estimating the probability of symbols that have occurred very rarely, in samples drawn independently from an unknown, possibly infinite, discrete distribution. In particular, we study the multiplicative consistency of estimators, defined as the ratio of the estimate to the true quantity converging to one. We first show that the classical GoodTuring estimator is not universally consistent in this sense, despite enjoying favorable additive properties. We then use Karamata's theory of regular variation to prove that regularly varying heavy tails are sufficient for consistency. At the core of this result is a multiplicative concentration that we establish both by extending the McAllesterOrtiz additive concentration for the missing mass to all rare probabilities and by exploiting regular variation. We also derive a family of estimators which, in addition to being consistent, address some of the shortcomings of the GoodTuring estimator. For example, they perform smoothing implicitly and have the absolute discounting structure of many heuristic algorithms. This also establishes a discrete parallel to extreme value theory, and many of the techniques therein can be adapted to the framework that we set forth.
 All Talks
 Unsupervised SVMs: On the Complexity of the Furthest Hyperplane Problem
 (weak) Calibration is Computationally Hard
 Learning Valuation Functions
 Unified Algorithms for Online Learning and Competitive Analysis
 Online Optimization with Gradual Variations
 The Optimality of Jeffreys Prior for Online Density Estimation and the Asymptotic Normality of Maximum Likelihood Estimators
 PACBayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model
 Random Design Analysis of Ridge Regression
 Reconstruction from Anisotropic Random Measurements
 Toward a Noncommutative Arithmeticgeometric Mean Inequality: Conjectures, Casestudies, and Consequences
 L1 Covering Numbers for Uniformly Bounded Convex Functions
 Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions
 AttributeEfficient Learning andWeightDegree Tradeoffs for Polynomial Threshold Functions
 Learning Functions of Halfspaces Using Prefix Covers
 Computational Bounds on Statistical Query Learning
 Learning DNF Expressions from Fourier Spectrum
 Consistency of Nearest Neighbor Classification under Selective Sampling
 Active Learning Using Smooth Relative Regret Approximations with Applications
 Robust Interactive Learning
 Rare Probability Estimation under Regularly Varying Heavy Tails
 Competitive Classification and Closeness Testing
 Kernels Based Tests with Nonasymptotic Bootstrap Approaches for Twosample Problems
 Differentially Private Online Learning
 Private Convex Empirical Risk Minimization and Highdimensional Regression
 Distributed Learning, Communication Complexity and Privacy
 A Characterization of Scoring Rules for Linear Properties
 Divergences and Risks for Multiclass Experiments
 A Conjugate Property between Loss Functions and Uncertainty Sets in Classification Problems
 New Bounds for Learning Intervals with Implications for SemiSupervised Learning
 Tight Bounds on Proper Equivalence Query Learning of DNF
 Distance Preserving Embeddings for General nDimensional Manifolds
 A Method of Moments for Mixture Models and Hidden Markov Models
 A Correlation Clustering Approach to Link Classification in Signed Networks
 Spectral Clustering of Graphs with General Degrees in the Extended Planted Partition Model
 Toward Understanding Complex Spaces: Graph Laplacians on Manifolds with Singularities and Boundaries
 Exact Recovery of SparselyUsed Dictionaries
 NearOptimal Algorithms for Online Matrix Prediction
 Analysis of Thompson Sampling for the Multiarmed Bandit Problem
 Autonomous Exploration For Navigating In MDPs
 Towards Minimax Policies for Online Linear Optimization with Bandit Feedback
 The Best of Both Worlds: Stochastic and Adversarial Bandits
 Open Problem: Regret Bounds for Thompson Sampling
 Open Problem: Better Bounds for Online Logistic Regression
 Open Problem: Learning Dynamic Network Models from a Static Snapshot
 Open Problem: Does AdaBoost Always Cycle?
 Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent?