TechTalks from event: CVPR 2014 Video Spotlights

Posters 4A : Computational Photography, Motion & Tracking, Recognition

  • Aliasing Detection and Reduction in Plenoptic Imaging Authors: Zhaolin Xiao, Qing Wang, Guoqing Zhou, Jingyi Yu
    When using plenoptic camera for digital refocusing, angular undersampling can cause severe (angular) aliasing artifacts. Previous approaches have focused on avoiding aliasing by pre-processing the acquired light field via prefiltering, demosaicing, reparameterization, etc. In this paper, we present a different solution that first detects and then removes aliasing at the light field refocusing stage. Different from previous frequency domain aliasing analysis, we carry out a spatial domain analysis to reveal whether the aliasing would occur and uncover where in the image it would occur. The spatial analysis also facilitates easy separation of the aliasing vs. non-aliasing regions and aliasing removal. Experiments on both synthetic scene and real light field camera array data sets demonstrate that our approach has a number of advantages over the classical prefiltering and depth-dependent light field rendering techniques.
  • Image Pre-compensation: Balancing Contrast and Ringing Authors: Yu Ji, Jinwei Ye, Sing Bing Kang, Jingyi Yu
    The goal of image pre-compensation is to process an image such that after being convolved with a known kernel, will appear close to the sharp reference image. In a practical setting, the pre-compensated image has significantly higher dynamic range than the latent image. As a result, some form of tone mapping is needed. In this paper, we show how global tone mapping functions affect contrast and ringing in image pre-compensation. In particular, we show that linear tone mapping eliminates ringing but incurs severe contrast loss, while non-linear tone mapping functions such as Gamma curves slightly enhances contrast but introduces ringing. To enable quantitative analysis, we design new metrics to measure the contrast of an image with ringing. Specifically, we set out to find its "equivalent ringing-free" image that matches its intensity histogram and uses its contrast as the measure. We illustrate our approach on projector defocus compensation and visual acuity enhancement. Compared with the state-of-the-art, our approach significantly improves the contrast. We believe our technique is the first to analytically trade-off between contrast and ringing.
  • Gyro-Based Multi-Image Deconvolution for Removing Handshake Blur Authors: Sung Hee Park, Marc Levoy
    Image deblurring to remove blur caused by camera shake has been intensively studied. Nevertheless, most methods are brittle and computationally expensive. In this paper we analyze multi-image approaches, which capture and combine multiple frames in order to make deblurring more robust and tractable. In particular, we compare the performance of two approaches: align-and-average and multi-image deconvolution. Our deconvolution is non-blind, using a blur model obtained from real camera motion as measured by a gyroscope. We show that in most situations such deconvolution outperforms align-and-average. We also show, perhaps surprisingly, that deconvolution does not benefit from increasing exposure time beyond a certain threshold. To demonstrate the effectiveness and efficiency of our method, we apply it to still-resolution imagery of natural scenes captured using a mobile camera with flexible camera control and an attached gyroscope.
  • Similarity-Aware Patchwork Assembly for Depth Image Super-Resolution Authors: Jing Li, Zhichao Lu, Gang Zeng, Rui Gan, Hongbin Zha
    This paper describes a patchwork assembly algorithm for depth image super-resolution. An input low resolution depth image is disassembled into parts by matching similar regions on a set of high resolution training images, and a super-resolution image is then assembled using these corresponding matched counterparts. We convert the super resolution problem into a Markov Random Field (MRF) labeling problem, and propose a unified formulation embedding (1) the consistency between the resolution enhanced image and the original input, (2) the similarity of disassembled parts with the corresponding regions on training images, (3) the depth smoothness in local neighborhoods, (4) the additional geometric constraints from self-similar structures in the scene, and (5) the boundary coincidence between the resolution enhanced depth image and an optional aligned high resolution intensity image. Experimental results on both synthetic and real-world data demonstrate that the proposed algorithm is capable of recovering high quality depth images with X4 resolution enhancement along each coordinate direction, and that it outperforms state-of-the-arts [14] in both qualitative and quantitative evaluations.
  • Deblurring Low-light Images with Light Streaks Authors: Zhe Hu, Sunghyun Cho, Jue Wang, Ming-Hsuan Yang
    Images taken in low-light conditions with handheld cameras are often blurry due to the required long exposure time. Although significant progress has been made recently on image deblurring, state-of-the-art approaches often fail on low-light images, as these images do not contain a sufficient number of salient features that deblurring methods rely on. On the other hand, light streaks are common phenomena in low-light images that contain rich blur information, but have not been extensively explored in previous approaches. In this work, we propose a new method that utilizes light streaks to help deblur low-light images. We introduce a non-linear blur model that explicitly models light streaks and their underlying light sources, and poses them as constraints for estimating the blur kernel in an optimization framework. Our method also automatically detects useful light streaks in the input image. Experimental results show that our approach obtains good results on challenging real-world examples that no other methods could achieve before.
  • Raw-to-Raw: Mapping between Image Sensor Color Responses Authors: Rang Nguyen, Dilip K. Prasad, Michael S. Brown
    Camera images saved in raw format are being adopted in computer vision tasks since raw values represent minimally processed sensor responses. Camera manufacturers, however, have yet to adopt a standard for raw images and current raw-rgb values are device specific due to different sensors spectral sensitivities. This results in significantly different raw images for the same scene captured with different cameras. This paper focuses on estimating a mapping that can convert a raw image of an arbitrary scene and illumination from one camera's raw space to another. To this end, we examine various mapping strategies including linear and non-linear transformations applied both in a global and illumination-specific manner. We show that illumination-specific mappings give the best result, however, at the expense of requiring a large number of transformations. To address this issue, we introduce an illumination-independent mapping approach that uses white-balancing to assist in reducing the number of required transformations. We show that this approach achieves state-of-the-art results on a range of consumer cameras and images of arbitrary scenes and illuminations.
  • Robust 3D Tracking with Descriptor Fields Authors: Alberto Crivellaro, Vincent Lepetit
    We introduce a method that can register challenging images from specular and poorly textured 3D environments, on which previous approaches fail. We assume that a small set of reference images of the environment and a partial 3D model are available. Like previous approaches, we register the input images by aligning them with one of the reference images using the 3D information. However, these approaches typically rely on the pixel intensities for the alignment, which is prone to fail in presence of specularities or in absence of texture. Our main contribution is an efficient novel local descriptor that we use to describe each image location. We show that we can rely on this descriptor in place of the intensities to significantly improve the alignment robustness at a minor increase of the computational cost, and we analyze the reasons behind the success of our descriptor.
  • Better Feature Tracking Through Subspace Constraints Authors: Bryan Poling, Gilad Lerman, Arthur Szlam
    Feature tracking in video is a crucial task in computer vision. Usually, the tracking problem is handled one feature at a time, using a single-feature tracker like the Kanade-Lucas-Tomasi algorithm, or one of its derivatives. While this approach works quite well when dealing with high-quality video and "strong" features, it often falters when faced with dark and noisy video containing low-quality features. We present a framework for jointly tracking a set of features, which enables sharing information between the different features in the scene. We show that our method can be employed to track features for both rigid and non-rigid motions (possibly of few moving bodies) even when some features are occluded. Furthermore, it can be used to significantly improve tracking results in poorly-lit scenes (where there is a mix of good and bad features). Our approach does not require direct modeling of the structure or the motion of the scene, and runs in real time on a single CPU core.
  • Region-based Particle Filter for Video Object Segmentation Authors: David Varas, Ferran Marques
    We present a video object segmentation approach that extends the particle filter to a region-based image representation. Image partition is considered part of the particle filter measurement, which enriches the available information and leads to a re-formulation of the particle filter. The prediction step uses a co-clustering between the previous image object partition and a partition of the current one, which allows us to tackle the evolution of non-rigid structures. Particles are defined as unions of regions in the current image partition and their propagation is computed through a single co-clustering. The proposed technique is assessed on the SegTrack dataset, leading to satisfactory perceptual results and obtaining very competitive pixel error rates compared with the state-of-the-art methods.
  • Visual Tracking via Probability Continuous Outlier Model Authors: Dong Wang, Huchuan Lu
    In this paper, we present a novel online visual tracking method based on linear representation. First, we present a novel probability continuous outlier model (PCOM) to depict the continuous outliers that occur in the linear representation model. In the proposed model, the element of the noisy observation sample can be either represented by a PCA subspace with small Guassian noise or treated as an arbitrary value with a uniform prior, in which the spatial consistency prior is exploited by using a binary Markov random field model. Then, we derive the objective function of the PCOM method, the solution of which can be iteratively obtained by the outlier-free least squares and standard max-flow/min-cut steps. Finally, based on the proposed PCOM method, we design an effective observation likelihood function and a simple update scheme for visual tracking. Both qualitative and quantitative evaluations demonstrate that our tracker achieves very favorable performance in terms of both accuracy and speed.
  • Three Guidelines of Online Learning for Large-Scale Visual Recognition Authors: Yoshitaka Ushiku, Masatoshi Hidaka, Tatsuya Harada
    In this paper, we would like to evaluate online learning algorithms for large-scale visual recognition using state-of-the-art features which are preselected and held fixed. Today, combinations of high-dimensional features and linear classifiers are widely used for large-scale visual recognition. Numerous so-called mid-level features have been developed and mutually compared on an experimental basis. Although various learning methods for linear classification have also been proposed in the machine learning and natural language processing literature, they have rarely been evaluated for visual recognition. Therefore, we give guidelines via investigations of state-of-the-art online learning methods of linear classifiers. Many methods have been evaluated using toy data and natural language processing problems such as document classification. Consequently, we gave those methods a unified interpretation from the viewpoint of visual recognition. Results of controlled comparisons indicate three guidelines that might change the pipeline for visual recognition.
  • Using k-Poselets for Detecting People and Localizing Their Keypoints Authors: Georgia Gkioxari, Bharath Hariharan, Ross Girshick, Jitendra Malik
    A k-poselet is a deformable part model (DPM) with k parts, where each of the parts is a poselet, aligned to a specific configuration of keypoints based on ground-truth annotations. A separate template is used to learn the appearance of each part. The parts are allowed to move with respect to each other with a deformation cost that is learned at training time. This model is richer than both the traditional version of poselets and DPMs. It enables a unified approach to person detection and keypoint prediction which, barring contemporaneous approaches based on CNN features, achieves state-of-the-art keypoint prediction while maintaining competitive detection performance.
  • Large-Scale Visual Font Recognition Authors: Guang Chen, Jianchao Yang, Hailin Jin, Jonathan Brandt, Eli Shechtman, Aseem Agarwala, Tony X. Han
    This paper addresses the large-scale visual font recogni- tion (VFR) problem, which aims at automatic identification of the typeface, weight, and slope of the text in an image or photo without any knowledge of content. Although vi- sual font recognition has many practical applications, it has largely been neglected by the vision community. To address the VFR problem, we construct a large-scale dataset con- taining 2,420 font classes, which easily exceeds the scale of most image categorization datasets in computer vision. As font recognition is inherently dynamic and open-ended, i.e., new classes and data for existing categories are constantly added to the database over time, we propose a scalable so- lution based on the nearest class mean classifier (NCM). The core algorithm is built on local feature embedding, lo- cal feature metric learning and max-margin template se- lection, which is naturally amenable to NCM and thus to such open-ended classification problems. The new algo- rithm can generalize to new classes and new data at lit- tle added cost. Extensive experiments demonstrate that our approach is very effective on our synthetic test images, and achieves promising results on real world test images.
  • Predicting User Annoyance Using Visual Attributes Authors: Gordon Christie, Amar Parkash, Ujwal Krothapalli, Devi Parikh
    Computer Vision algorithms make mistakes. In human-centric applications, some mistakes are more annoying to users than others. In order to design algorithms that minimize the annoyance to users, we need access to an annoyance or cost matrix that holds the annoyance of each type of mistake. Such matrices are not readily available, especially for a wide gamut of human-centric applications where annoyance is tied closely to human perception. To avoid having to conduct extensive user studies to gather the annoyance matrix for all possible mistakes, we propose predicting the annoyance of previously unseen mistakes by learning from example mistakes and their corresponding annoyance. We promote the use of attribute-based representations to transfer this knowledge of annoyance. Our experimental results with faces and scenes demonstrate that our approach can predict annoyance more accurately than baselines. We show that as a result, our approach makes less annoying mistakes in a real-world image retrieval application.
  • Transformation Pursuit for Image Classification Authors: Mattis Paulin, J
    A simple approach to learning invariances in image clas- sification consists in augmenting the training set with transformed versions of the original images. However, given a large set of possible transformations, selecting a com- pact subset is challenging. Indeed, all transformations are not equally informative and adding uninformative transfor- mations increases training time with no gain in accuracy. We propose a principled algorithm � Image Transformation Pursuit (ITP) � for the automatic selection of a compact set of transformations. ITP works in a greedy fashion, by se- lecting at each iteration the one that yields the highest accuracy gain. ITP also allows to efficiently explore complex transformations, that combine basic transformations. We report results on two public benchmarks: the CUB dataset of bird images and the ImageNet 2010 challenge. Using Fisher Vector representations, we achieve an improvement from 28.2% to 45.2% in top-1 accuracy on CUB, and an im- provement from 70.1% to 74.9% in top-5 accuracy on Im- ageNet. We also show significant improvements for deep convnet features: from 47.3% to 55.4% on CUB and from 77.9% to 81.4% on ImageNet.
  • Incremental Learning of NCM Forests for Large-Scale Image Classification Authors: Marko Ristin, Matthieu Guillaumin, Juergen Gall, Luc Van Gool
    In recent years, large image data sets such as "ImageNet", "TinyImages" or ever-growing social networks like "Flickr" have emerged, posing new challenges to image classification that were not apparent in smaller image sets. In particular, the efficient handling of dynamically growing data sets, where not only the amount of training images, but also the number of classes increases over time, is a relatively unexplored problem. To remedy this, we introduce Nearest Class Mean Forests (NCMF), a variant of Random Forests where the decision nodes are based on nearest class mean (NCM) classification. NCMFs not only outperform conventional random forests, but are also well suited for integrating new classes. To this end, we propose and compare several approaches to incorporate data from new classes, so as to seamlessly extend the previously trained forest instead of re-training them from scratch. In our experiments, we show that NCMFs trained on small data sets with 10 classes can be extended to large data sets with 1000 classes without significant loss of accuracy compared to training from scratch on the full data.
  • Discriminative Ferns Ensemble for Hand Pose Recognition Authors: Eyal Krupka, Alon Vinnikov, Ben Klein, Aharon Bar Hillel, Daniel Freedman, Simon Stachniak
    We present the Discriminative Ferns Ensemble (DFE) classifier for efficient visual object recognition. The classifier architecture is designed to optimize both classification speed and accuracy when a large training set is available. Speed is obtained using simple binary features and direct indexing into a set of tables, and accuracy by using a large capacity model and careful discriminative optimization. The proposed framework is applied to the problem of hand pose recognition in depth and infra-red images, using a very large training set. Both the accuracy and the classification time obtained are considerably superior to relevant competing methods, allowing one to reach accuracy targets with run times orders of magnitude faster than the competition. We show empirically that using DFE, we can significantly reduce classification time by increasing training sample size for a fixed target accuracy. Finally a DFE result is shown for the MNIST dataset, showing the method's merit extends beyond depth images.
  • Are Cars Just 3D Boxes? ? Jointly Estimating the 3D Shape of Multiple Objects Authors: Muhammad Zeeshan Zia, Michael Stark, Konrad Schindler
    Current systems for scene understanding typically represent objects as 2D or 3D bounding boxes. While these representations have proven robust in a variety of applications, they provide only coarse approximations to the true 2D and 3D extent of objects. As a result, object-object interactions, such as occlusions or ground-plane contact, can be represented only superficially. In this paper, we approach the problem of scene understanding from the perspective of 3D shape modeling, and design a 3D scene representation that reasons jointly about the 3D shape of multiple objects. This representation allows to express 3D geometry and occlusion on the fine detail level of individual vertices of 3D wireframe models, and makes it possible to treat dependencies between objects, such as occlusion reasoning, in a deterministic way. In our experiments, we demonstrate the benefit of jointly estimating the 3D shape of multiple objects in a scene over working with coarse boxes, on the recently proposed KITTI dataset of realistic street scenes.
  • 2D Human Pose Estimation: New Benchmark and State of the Art Analysis Authors: Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, Bernt Schiele
    Human pose estimation has made significant progress during the last years. However current datasets are limited in their coverage of the overall pose estimation challenges. Still these serve as the common sources to evaluate, train and compare different models on. In this paper we introduce a novel benchmark "MPII Human Pose" that makes a significant advance in terms of diversity and difficulty, a contribution that we feel is required for future developments in human body models. This comprehensive dataset was collected using an established taxonomy of over 800 human activities. The collected images cover a wider variety of human activities than previous datasets including various recreational, occupational and householding activities, and capture people from a wider range of viewpoints. We provide a rich set of labels including positions of body joints, full 3D torso and head orientation, occlusion labels for joints and body parts, and activity labels. For each image we provide adjacent video frames to facilitate the use of motion information. Given these rich annotations we perform a detailed analysis of leading human pose estimation approaches and gaining insights for the success and failures of these methods.
  • Orientational Pyramid Matching for Recognizing Indoor Scenes Authors: Lingxi Xie, Jingdong Wang, Baining Guo, Bo Zhang, Qi Tian
    Scene recognition is a basic task towards image understanding. Spatial Pyramid Matching (SPM) has been shown to be an efficient solution for spatial context modeling. In this paper, we introduce an alternative approach, Orientational Pyramid Matching (OPM), for orientational context modeling. Our approach is motivated by the observation that the 3D orientations of objects are a crucial factor to discriminate indoor scenes. The novelty lies in that OPM uses the 3D orientations to form the pyramid and produce the pooling regions, which is unlike SPM that uses the spatial positions to form the pyramid. Experimental results on challenging scene classification tasks show that OPM achieves the performance comparable with SPM and that OPM and SPM make complementary contributions so that their combination gives the state-of-the-art performance.
  • Time-Mapping Using Space-Time Saliency Authors: Feng Zhou, Sing Bing Kang, Michael F. Cohen
    We describe a new approach for generating regular-speed, low-frame-rate (LFR) video from a high-frame-rate (HFR) input while preserving the important moments in the original. We call this {\em time-mapping}, a time-based analogy to high dynamic range to low dynamic range spatial tone-mapping. Our approach makes these contributions: (1) a robust space-time saliency method for evaluating visual importance, (2) a re-timing technique to temporally resample based on frame importance, and (3) temporal filters to enhance the rendering of salient motion. Results of our space-time saliency method on a benchmark dataset show it is state-of-the-art. In addition, the benefits of our approach to HFR-to-LFR time-mapping over more direct methods are demonstrated in a user study.
  • Fast Edge-Preserving PatchMatch for Large Displacement Optical Flow Authors: Linchao Bao, Qingxiong Yang, Hailin Jin
    We present a fast optical flow algorithm that can handle large displacement motions. Our algorithm is inspired by recent successes of local methods in visual correspondence searching as well as approximate nearest neighbor field algorithms. The main novelty is a fast randomized edge-preserving approximate nearest neighbor field algorithm which propagates self-similarity patterns in addition to offsets. Experimental results on public optical flow benchmarks show that our method is significantly faster than state-of-the-art methods without compromising on quality, especially when scenes contain large motions.
  • Scalable 3D Tracking of Multiple Interacting Objects Authors: Nikolaos Kyriazis, Antonis Argyros
    We consider the problem of tracking multiple interacting objects in 3D, using RGBD input and by considering a hypothesize-and-test approach. Due to their interaction, objects to be tracked are expected to occlude each other in the field of view of the camera observing them. A naive approach would be to employ a Set of Independent Trackers (SIT) and to assign one tracker to each object. This approach scales well with the number of objects but fails as occlusions become stronger due to their disjoint consideration. The solution representing the current state of the art employs a single Joint Tracker (JT) that accounts for all objects simultaneously. This directly resolves ambiguities due to occlusions but has a computational complexity that grows geometrically with the number of tracked objects. We propose a middle ground, namely an Ensemble of Collaborative Trackers (ECT), that combines best traits from both worlds to deliver a practical and accurate solution to the multi-object 3D tracking problem. We present quantitative and qualitative experiments with several synthetic and real world sequences of diverse complexity. Experiments demonstrate that ECT manages to track far more complex scenes than JT at a computational time that is only slightly larger than that of SIT.
  • Evolutionary Quasi-random Search for Hand Articulations Tracking Authors: Iason Oikonomidis, Manolis I.A. Lourakis, Antonis A. Argyros
    We present a new method for tracking the 3D position, global orientation and full articulation of human hands. Following recent advances in model-based, hypothesize-and-test methods, the high-dimensional parameter space of hand configurations is explored with a novel evolutionary optimization technique specifically tailored to the problem. The proposed method capitalizes on the fact that samples from quasi-random sequences such as the Sobol have low discrepancy and exhibit a more uniform coverage of the sampled space compared to random samples obtained from the uniform distribution. The method has been tested for the problems of tracking the articulation of a single hand (27D parameter space) and two hands (54D space). Extensive experiments have been carried out with synthetic and real data, in comparison with state of the art methods. The quantitative evaluation shows that for cases of limited computational resources, the new approach achieves a speed-up of four (single hand tracking) and eight (two hands tracking) without compromising tracking accuracy. Interestingly, the proposed method is preferable compared to the state of the art either in the case of limited computational resources or in the case of more complex (i.e., higher dimensional) problems, thus improving the applicability of the method in a number of application domains.