TechTalks from event: CVPR 2014 Oral Talks

Orals 2D : Attribute-Based Recognition & Human Pose Estimation

  • DeepPose: Human Pose Estimation via Deep Neural Networks Authors: Alexander Toshev, Christian Szegedy
    We propose a method for human pose estimation based on Deep Neural Networks (DNNs). The pose estimation is formulated as a DNN-based regression problem towards body joints. We present a cascade of such DNN regres- sors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a holistic fashion and has a simple but yet powerful formula- tion which capitalizes on recent advances in Deep Learn- ing. We present a detailed empirical analysis with state-of- art or better performance on four academic benchmarks of diverse real-world images.
  • Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation Authors: Catalin Ionescu, Joao Carreira, Cristian Sminchisescu
    Recently, the emergence of Kinect systems has demonstrated the benefits of predicting an intermediate body part labeling for 3D human pose estimation, in conjunction with RGB-D imagery. The availability of depth information plays a critical role, so an important question is whether a similar representation can be developed with sufficient robustness in order to estimate 3D pose from RGB images. This paper provides evidence for a positive answer, by leveraging (a) 2D human body part labeling in images, (b) second-order label-sensitive pooling over dynamically computed regions resulting from a hierarchical decomposition of the body, and (c) iterative structured-output modeling to contextualize the process based on 3D pose estimates. For robustness and generalization, we take advantage of a recent large-scale 3D human motion capture dataset, Human3.6M[18] that also has human body part labeling annotations available with images. We provide extensive experimental studies where alternative intermediate representations are compared and report a substantial 33% error reduction over competitive discriminative baselines that regress 3D human pose against global HOG features.
  • 3D Pictorial Structures for Multiple Human Pose Estimation Authors: Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, Slobodan Ilic
    In this work, we address the problem of 3D pose estimation of multiple humans from multiple views. This is a more challenging problem than single human 3D pose estimation due to the much larger state space, partial occlusions as well as across view ambiguities when not knowing the identity of the humans in advance. To address these problems, we first create a reduced state space by triangulation of corresponding body joints obtained from part detectors in pairs of camera views. In order to resolve the ambiguities of wrong and mixed body parts of multiple humans after triangulation and also those coming from false positive body part detections, we introduce a novel 3D pictorial structures (3DPS) model. Our model infers 3D human body configurations from our reduced state space. The 3DPS model is generic and applicable to both single and multiple human pose estimation. In order to compare to the state-of-the art, we first evaluate our method on single human 3D pose estimation on HumanEva-I [22] and KTH Multiview Football Dataset II [8] datasets. Then, we introduce and evaluate our method on two datasets for multiple human 3D pose estimation.
  • Decorrelating Semantic Visual Attributes by Resisting the Urge to Share Authors: Dinesh Jayaraman, Fei Sha, Kristen Grauman
    Existing methods to learn visual attributes are prone to learning the wrong thing---namely, properties that are correlated with the attribute of interest among training samples. Yet, many proposed applications of attributes rely on being able to learn the correct semantic concept corresponding to each attribute. We propose to resolve such confusions by jointly learning decorrelated, discriminative attribute models. Leveraging side information about semantic relatedness, we develop a multi-task learning approach that uses structured sparsity to encourage feature competition among unrelated attributes and feature sharing among related attributes. On three challenging datasets, we show that accounting for structure in the visual attribute space is key to learning attribute models that preserve semantics, yielding improved generalizability that helps in the recognition and discovery of unseen object categories.
  • PANDA: Pose Aligned Networks for Deep Attribute Modeling Authors: Ning Zhang, Manohar Paluri, Marc'Aurelio Ranzato, Trevor Darrell, Lubomir Bourdev
    We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by shallow low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.
  • Learning Scalable Discriminative Dictionary with Sample Relatedness Authors: Jiashi Feng, Stefanie Jegelka, Shuicheng Yan, Trevor Darrell
    Attributes are widely used as mid-level descriptors of object properties in object recognition and retrieval. Mostly, such attributes are manually pre-defined based on domain knowledge, and their number is fixed. However, pre-defined attributes may fail to adapt to the properties of the data at hand, may not necessarily be discriminative, and/or may not generalize well. In this work, we propose a dictionary learning framework that flexibly adapts to the complexity of the given data set and reliably discovers the inherent discriminative middle-level binary features in the data. We use sample relatedness information to improve the generalization of the learned dictionary. %to steer the structure of the attribute representation for discrimination and generalization. We demonstrate that our framework is applicable to both object recognition and complex image retrieval tasks even with few training examples. Moreover, the learned dictionary also help classify novel object categories. Experimental results on the Animals with Attributes, ILSVRC2010 and PASCAL VOC2007 datasets indicate that using relatedness information leads to significant performance gains over established baselines.

Orals 2F : Convolutional Neural Networks

  • Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks Authors: Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic
    Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The suc- cess of CNNs is attributed to their ability to learn rich mid- level image representations as opposed to hand-designed low-level features used in other image classification meth- ods. Learning CNNs, however, amounts to estimating mil- lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi- ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep- resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.
  • Large-scale Video Classification with Convolutional Neural Networks Authors: Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei
    Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
  • Convolutional Neural Networks for No-Reference Image Quality Assessment Authors: Le Kang, Peng Ye, Yi Li, David Doermann
    In this work we describe a Convolutional Neural Network (CNN) to accurately predict image quality without a reference image. Taking image patches as input, the CNN works in the spatial domain without using hand-crafted features that are employed by most previous methods. The network consists of one convolutional layer with max and min pooling, two fully connected layers and an output node. Within the network structure, feature learning and regression are integrated into one optimization process, which leads to a more effective model for estimating image quality. This approach achieves state of the art performance on the LIVE dataset and shows excellent generalization ability in cross dataset experiments. Further experiments on images with local distortions demonstrate the local quality estimation ability of our CNN, which is rarely reported in previous literature.

Orals 3A : Physics-Based Vision & Shape-from-X

  • Multiview Shape and Reflectance from Natural Illumination Authors: Geoffrey Oxholm, Ko Nishino
    The world is full of objects with complex reflectances, situated in complex illumination environments. Past work on full 3D geometry recovery, however, has tried to handle this complexity by framing it into simplistic models of reflectance (Lambetian, mirrored, or diffuse plus specular) or illumination (one or more point light sources). Though there has been some recent progress in directly utilizing such complexities for recovering a single view geometry, it is not clear how such single-view methods can be extended to reconstruct the full geometry. To this end, we derive a probabilistic geometry estimation method that fully exploits the rich signal embedded in complex appearance. Though each observation provides partial and unreliable information, we show how to estimate the reflectance responsible for the diverse appearance, and unite the orientation cues embedded in each observation to reconstruct the underlying geometry. We demonstrate the effectiveness of our method on synthetic and real-world objects. The results show that our method performs accurately across a wide range of real-world environments and reflectances that lies between the extremes that have been the focus of past work.
  • Reflectance and Fluorescent Spectra Recovery based on Fluorescent Chromaticity Invariance under Varying Illumination Authors: Ying Fu, Antony Lam, Yasuyuki Kobashi, Imari Sato, Takahiro Okabe, Yoichi Sato
    In recent years, fluorescence analysis of scenes has received attention. Fluorescence can provide additional information about scenes, and has been used in applications such as camera spectral sensitivity estimation, 3D reconstruction, and color relighting. In particular, hyperspectral images of reflective-fluorescent scenes provide a rich amount of data. However, due to the complex nature of fluorescence, hyperspectral imaging methods rely on specialized equipment such as hyperspectral cameras and specialized illuminants. In this paper, we propose a more practical approach to hyperspectral imaging of reflective-fluorescent scenes using only a conventional RGB camera and varied colored illuminants. The key idea of our approach is to exploit a unique property of fluorescence: the chromaticity of fluorescence emissions are invariant under different illuminants. This allows us to robustly estimate spectral reflectance and fluorescence emission chromaticity. We then show that given the spectral reflectance and fluorescent chromaticity, the fluorescence absorption and emission spectra can also be estimated. We demonstrate in results that all scene spectra can be accurately estimated from RGB images. Finally, we show that our method can be used to accurately relight scenes under novel lighting.
  • What Camera Motion Reveals About Shape With Unknown BRDF Authors: Manmohan Chandraker
    Psychophysical studies show motion cues inform about shape even with unknown reflectance. Recent works in computer vision have considered shape recovery for an object of unknown BRDF using light source or object motions. This paper addresses the remaining problem of determining shape from the (small or differential) motion of the camera, for unknown isotropic BRDFs. Our theory derives a differential stereo relation that relates camera motion to depth of a surface with unknown isotropic BRDF, which generalizes traditional Lambertian assumptions. Under orthographic projection, we show shape may not be constrained in general, but two motions suffice to yield an invariant for several restricted (still unknown) BRDFs exhibited by common materials. For the perspective case, we show that three differential motions suffice to yield surface depth for unknown isotropic BRDF and unknown directional lighting, while additional constraints are obtained with restrictions on BRDF or lighting. The limits imposed by our theory are intrinsic to the shape recovery problem and independent of choice of reconstruction method. We outline with experiments how potential reconstruction methods may exploit our theory. We illustrate trends shared by theories on shape from motion of light, object or camera, relating reconstruction hardness to imaging complexity.
  • Photometric Stereo using Constrained Bivariate Regression for General Isotropic Surfaces Authors: Satoshi Ikehata, Kiyoharu Aizawa
    This paper presents a photometric stereo method that is purely pixelwise and handles general isotropic surfaces in a stable manner. Following the recently proposed sum-of-lobes representation of the isotropic reflectance function, we constructed a constrained bivariate regression problem where the regression function is approximated by smooth, bivariate Bernstein polynomials. The unknown normal vector was separated from the unknown reflectance function by considering the inverse representation of the image formation process, and then we could accurately compute the unknown surface normals by solving a simple and efficient quadratic programming problem. Extensive evaluations that showed the state-of-the-art performance using both synthetic and real-world images were performed.
  • Robust Separation of Reflection from Multiple Images Authors: Xiaojie Guo, Xiaochun Cao, Yi Ma
    When one records a video/image sequence through a transparent medium (e.g. glass), the image is often a superposition of a transmitted layer (scene behind the medium) and a reflected layer. Recovering the two layers from such images seems to be a highly ill-posed problem since the number of unknowns to recover is twice as many as the given measurements. In this paper, we propose a robust method to separate these two layers from multiple images, which exploits the correlation of the transmitted layer across multiple images, and the sparsity and independence of the gradient fields of the two layers. A novel Augmented Lagrangian Multiplier based algorithm is designed to efficiently and effectively solve the decomposition problem. The experimental results on both simulated and real data demonstrate the superior performance of the proposed method over the state of the arts, in terms of accuracy and simplicity.
  • Surface-from-Gradients: An Approach Based on Discrete Geometry Processing Authors: Wuyuan Xie, Yunbo Zhang, Charlie C. L. Wang, Ronald C.-K. Chung
    In this paper, we propose an efficient method to reconstruct surface-from-gradients (SfG). Our method is formulated under the framework of discrete geometry processing. Unlike the existing SfG approaches, we transfer the continuous reconstruction problem into a discrete space and efficiently solve the problem via a sequence of least-square optimization steps. Our discrete formulation brings three advantages: 1) the reconstruction preserves sharp-features, 2) sparse/incomplete set of gradients can be well handled, and 3) domains of computation can have irregular boundaries. Our formulation is direct and easy to implement, and the comparisons with state-of-the-arts show the effectiveness of our method.