Please help transcribe this video using our simple transcription tool. You need to be logged in to do so.


For a broad class of practically relevant distribution properties, which includes entropy and support size, nearly all of the proposed estimators have an especially simple form. Given a set of independent samples from a discrete distribution, these estimators tally the vector of summary statistics---the number of species seen once, twice, etc. in the sample---and output the dot product between these summary statistics, and a fixed vector of coefficients. We term such estimators \emph{linear}. This historical proclivity towards linear estimators is slightly perplexing, since, despite many efforts over nearly 60 years, all proposed such estimators have significantly suboptimal convergence. Our main result, in some sense vindicating this insistence on linear estimators, is that for any property in this broad class, there exists a near-optimal linear estimator. Additionally, we give a practical and polynomial-time algorithm for constructing such estimators for any given parameters. While this result does not yield explicit bounds on the sample complexities of these estimation tasks, we leverage the insights provided by this result, to give explicit constructions of a linear estimators for three properties: entropy, $L_1$ distance to uniformity, and for pairs of distributions, $L_1$ distance.Our entropy estimator, when given $O(\frac{n}{\eps \log n})$ independent samples from a distribution of support at most $n,$ will estimate the entropy of the distribution to within accuracy $\epsilon$, with probability of failure $o(1/poly(n)).$ From recent lower bounds, this estimator is optimal, to constant factor, both in its dependence on $n$, and its dependence on $\epsilon.$ In particular, the inverse-linear convergence rate of this estimator resolves the main open question of [VV11], which left open the possibility that the error decreased only with the square root of the number of samples. Our distance to uniformity estimator, on given $O(\frac{m}{\eps^2\log m})$ independent samples from any distribution, returns an $\eps$-accurate estimate of the $L_1$ distance to the uniform distribution of support $m$. This is the first sublinear-sample estimator for this problem, and is constant-factor optimal, for constant $\epsilon$. Finally, our framework extends naturally to properties of pairs of distributions, including estimating the $L_1$ distance and KL-divergence between pairs of distributions. We give an explicit linear estimator for estimating $L_1$ distance to accuracy $\epsilon$ using $O(\frac{n}{\eps^2\log n})$ samples from each distribution, which is constant-factor optimal, for constant $\epsilon$.

Questions and Answers

You need to be logged in to be able to post here.