Intelligent perceptual tasks such as vision and audition require the construction of good internal representations. Machine Learning has been very successful for producing classifiers, but the next big challenge for ML, computer vision, and computational neuroscience is to devise learning algorithms that can learn features and internal representations automatically.
Theoretical and empirical evidence suggest that the perceptual world is best represented by a multi-stage hierarchy in which features in successive stages are increasingly global, invariant, and abstract. An important question is to devise “deep learning” methods for multi-stage architecture than can automatically learn invariant feature hierarchies from labeled and unlabeled data.
A number of unsupervised methods for learning invariant features will be described that are based on sparse coding and sparse auto-encoders: convolutional sparse auto-encoders, invariance through group sparsity, invariance through lateral inhibition, and invariance through temporal constancy. The methods are used to pre-train convolutional networks (ConvNets). ConvNets are biologically-inspired architectures consisting of multiple stages of filter banks, interspersed with non-linear operations, spatial pooling, and contrast normalization operations.
Several applications will be shown through videos and live demos, including a a pedestrian detector, a category-level object recognition system that can be trained on the fly, and a system that can label every pixel in an image with the category of the object it belongs to (scene parsing). Specialized hardware architecture that run these systems in real time will also be described.