TechTalks from event: Big Learning: Algorithms, Systems, and Tools for Learning at Scale

We are still uploading slides and videos for this event. Please excuse any discrepancy.

Day 1 Afternoon Session

  • Tutorial: Vowpal Wabbit Authors: John Langford
  • Towards Human Behavior Understanding from Pervasive Data: Opportunities and Challenges Ahead Authors: Nuria Oliver
    We live in an increasingly digitized world where our -- physical and digital -- interactions leave digital footprints. It is through the analysis of these digital footprints that we can learn and model some of the many facets that characterize people, including their tastes, personalities, social network interactions, and mobility and communication patterns. In my talk, I will present a summary of our research efforts on transforming these massive amounts of user behavioral data into meaningful insights, where machine learning and data mining techniques play a central role. The projects that I will describe cover a broad set of areas, including smart cities and urban computing, psychographics, socioeconomic status prediction and disease propagation. For each of the projects, I will highlight the main results and point at technical challenges still to be solved from a data analysis perspective.
  • Parallelizing Training of the Kinect Body Parts Labeling Algorithm Authors: Derek Murray
    We present the parallelized implementation of decision forest training as used in Kinect to train the body parts classification system. We describe the practical details of dealing with large training sets and deep trees, and describe how to parallelize over multiple dimensions of the problem.
  • Big Machine Learning made Easy Authors: Miguel Araujo and Charles Parker
    While machine learning has made its way into certain industrial applications, there are many important real-world domains, especially domains with large-scale data, that remain unexplored. There are a number of reasons for this, and they occur at all places in the technology stack. One concern is ease-of-use, so that practitioners with access to big data who are not necessarily machine learning experts are able to create models. Another is transparency. Users are more likely to want models they can easily visualize and understand. A flexible API layer is required so users can integrate models into their business process with a minimum of hassle. Finally, a robust back-end is required to parallelize machine learning algorithms and scale up or down as needed.. In this talk, we discuss our attempt at building a system that satisfies all of these requirements. We will briefly demonstrate the functionality of the system and discuss major architectural concerns and future work.
  • Fast Cross-Validation via Sequential Analysis Authors: Tammo Kruger
    With the increasing size of today's data sets, finding the right parameter configuration via cross-validation can be an extremely time-consuming task. In this paper we propose an improved cross-validation procedure which uses non-parametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible the method speeds up the computation while preserving the capability of the full cross-validation. The experimental evaluation shows that our method reduces the computation time by a factor of up to 70 compared to a full cross-validation with a negligible impact on the accuracy.
  • Machine Learning's Role in the Search for Fundamental Particles Authors: Daniel Whiteson
    High-energy physicists try to decompose matter into its most fundamental pieces by colliding particles at extreme energies. But to extract clues about the structure of matter from these collisions is not a trivial task, due to the incomplete data we can gather regarding the collisions, the subtlety of the signals we seek and the large rate and dimensionality of the data. These challenges are not unique to high energy physics, and there is the potential for great progress in collaboration between high energy physicists and machine learning experts. I will describe the nature of the physics problem, the challenges we face in analyzing the data, the previous successes and failures of some ML techniques, and the open challenges.
  • Bootstrapping Big Data Authors: Ariel Kleiner
    The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively computationally demanding. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a more computationally efficient, though still robust, means of quantifying the quality of estimators. BLB shares the generic applicability and statistical efficiency of the bootstrap and is furthermore well suited for application to very large datasets using modern distributed computing architectures, as it uses only small subsets of the observed data at any point during its execution. We provide both empirical and theoretical results which demonstrate the efficacy of BLB.