TechTalks from event: CVPR 2014 Video Spotlights

Posters 2A : Motion & Tracking, Optimization, Statistical Methods & Learning, Stereo & SFM

  • Multi-Cue Visual Tracking Using Robust Feature-Level Fusion Based on Joint Sparse Representation Authors: Xiangyuan Lan, Andy J. Ma, Pong C. Yuen
    The use of multiple features for tracking has been proved as an effective approach because limitation of each feature could be compensated. Since different types of variations such as illumination, occlusion and pose may happen in a video sequence, especially long sequence videos, how to dynamically select the appropriate features is one of the key problems in this approach. To address this issue in multicue visual tracking, this paper proposes a new joint sparse representation model for robust feature-level fusion. The proposed method dynamically removes unreliable features to be fused for tracking by using the advantages of sparse representation. As a result, robust tracking performance is obtained. Experimental results on publicly available videos show that the proposed method outperforms both existing sparse representation based and fusion-based trackers.
  • Robust Online Multi-Object Tracking based on Tracklet Confidence and Online Discriminative Appearance Learning Authors: Seung-Hwan Bae, Kuk-Jin Yoon
    Online multi-object tracking aims at producing complete tracks of multiple objects using the information accumulated up to the present moment. It still remains a difficult problem in complex scenes, because of frequent occlusion by clutter or other objects, similar appearances of different objects, and other factors. In this paper, we propose a robust online multi-object tracking method that can handle these difficulties effectively. We first propose the tracklet confidence using the detectability and continuity of a tracklet, and formulate a multi-object tracking problem based on the tracklet confidence. The multi-object tracking problem is then solved by associating tracklets in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections, and fragmented tracklets are linked up with others without any iterative and expensive associations. Here, for reliable association between tracklets and detections, we also propose a novel online learning method using an incremental linear discriminant analysis for discriminating the appearances of objects. By exploiting the proposed learning method, tracklet association can be successfully achieved even under severe occlusion. Experiments with challenging public datasets show distinct performance improvement over other batch and online tracking methods.
  • Pyramid-based Visual Tracking Using Sparsity Represented Mean Transform Authors: Zhe Zhang, Kin Hong Wong
    In this paper, we propose a robust method for visual tracking relying on mean shift, sparse coding and spatial pyramids. Firstly, we extend the original mean shift approach to handle orientation space and scale space and name this new method as mean transform. The mean transform method estimates the motion, including the location, orientation and scale, of the interested object window simultaneously and effectively. Secondly, a pixel-wise dense patch sampling technique and a region-wise trivial template designing scheme are introduced which enable our approach to run very accurately and efficiently. In addition, instead of using either holistic representation or local representation only, we apply spatial pyramids by combining these two representations into our approach to deal with partial occlusion problems robustly. Observed from the experimental results, our approach outperforms state-of-the-art methods in many benchmark sequences.
  • Tracklet Association with Online Target-Specific Metric Learning Authors: Bing Wang, Gang Wang, Kap Luk Chan, Li Wang
    This paper presents a novel introduction of online target-specific metric learning in track fragment (tracklet) association by network flow optimization for long-term multi-person tracking. Different from other network flow formulation, each node in our network represents a tracklet, and each edge represents the likelihood of neighboring tracklets belonging to the same trajectory as measured by our proposed affinity score. In our method, target-specific similarity metrics are learned, which give rise to the appearance-based models used in the tracklet affinity estimation. Trajectory-based tracklets are refined by using the learned metrics to account for appearance consistency and to identify reliable tracklets. The metrics are then re-learned using reliable tracklets for computing tracklet affinity scores. Long-term trajectories are then obtained through network flow optimization. Occlusions and missed detections are handled by a trajectory completion step. Our method is effective for long-term tracking even when the targets are spatially close or completely occluded by others. We validate our proposed framework on several public datasets and show that it outperforms several state of art methods.
  • Subspace Tracking under Dynamic Dimensionality for Online Background Subtraction Authors: Matthew Berger, Lee M. Seversky
    Long-term modeling of background motion in videos is an important and challenging problem used in numerous applications such as segmentation and event recognition. A major challenge in modeling the background from point trajectories lies in dealing with the variable length duration of trajectories, which can be due to such factors as trajectories entering and leaving the frame or occlusion from different depth layers. This work proposes an online method for background modeling of dynamic point trajectories via tracking of a linear subspace describing the background motion. To cope with variability in trajectory durations, we cast subspace tracking as an instance of subspace estimation under missing data, using a least-absolute deviations formulation to robustly estimate the background in the presence of arbitrary foreground motion. Relative to previous works, our approach is very fast and scales to arbitrarily long videos as our method processes new frames sequentially as they arrive.
  • Multiple Target Tracking Based on Undirected Hierarchical Relation Hypergraph Authors: Longyin Wen, Wenbo Li, Junjie Yan, Zhen Lei, Dong Yi, Stan Z. Li
    Multi-target tracking is an interesting but challenging task in computer vision field. Most previous data association based methods merely consider the relationships (e.g. appearance and motion pattern similarities) between detections in local limited temporal domain, leading to their difficulties in handling long-term occlusion and distinguishing the spatially close targets with similar appearance in crowded scenes. In this paper, a novel data association approach based on undirected hierarchical relation hypergraph is proposed, which formulates the tracking task as a hierarchical dense neighborhoods searching problem on the dynamically constructed undirected affinity graph. The relationships between different detections across the spatiotemporal domain are considered in a high-order way, which makes the tracker robust to the spatially close targets with similar appearance. Meanwhile, the hierarchical design of the optimization process fuels our tracker to long-term occlusion with more robustness. Extensive experiments on various challenging datasets (i.e. PETS2009 dataset, ParkingLot), including both low and high density sequences, demonstrate that the proposed method performs favorably against the state-of-the-art methods.
  • A Probabilistic Framework for Multitarget Tracking with Mutual Occlusions Authors: Menglong Yang, Yiguang Liu, Longyin Wen, Zhisheng You, Stan Z. Li
    Mutual occlusions among targets can cause track loss or target position deviation, because the observation likelihood of an occluded target may vanish even when we have the estimated location of the target. This paper presents a novel probability framework for multitarget tracking with mutual occlusions. The primary contribution of this work is the introduction of a vectorial occlusion variable as part of the solution. The occlusion variable describes occlusion states of the targets. This forms the basis of the proposed probability framework, with the following further contributions: 1) Likelihood: A new observation likelihood model is presented, in which the likelihood of an occluded target is computed by referring to both of the occluded and oc-cluding targets. 2) Priori: Markov random field (MRF) is used to model the occlusion priori such that less likely "circular" or "cascading" types of occlusions have lower priori probabilities. Both the occlusion priori and the motion priori take into consideration the state of occlusion. 3) Optimization: A realtime RJMCMC-based algorithm with a newmove type called "occlusion state update" is presented. Experimental results show that the proposed framework can handle occlusions well, even including long-duration full occlusions, which may cause tracking failures in the traditional methods.
  • Occlusion Geodesics for Online Multi-Object Tracking Authors: Horst Possegger, Thomas Mauthner, Peter M. Roth, Horst Bischof
    Robust multi-object tracking-by-detection requires the correct assignment of noisy detection results to object trajectories. We address this problem by proposing an online approach based on the observation that object detectors primarily fail if objects are significantly occluded. In contrast to most existing work, we only rely on geometric information to efficiently overcome detection failures. In particular, we exploit the spatio-temporal evolution of occlusion regions, detector reliability, and target motion prediction to robustly handle missed detections. In combination with a conservative association scheme for visible objects, this allows for real-time tracking of multiple objects from a single static camera, even in complex scenarios. Our evaluations on publicly available multi-object tracking benchmark datasets demonstrate favorable performance compared to the state-of-the-art in online and offline multi-object tracking.
  • Efficient Nonlinear Markov Models for Human Motion Authors: Andreas M. Lehrmann, Peter V. Gehler, Sebastian Nowozin
    Dynamic Bayesian networks such as Hidden Markov Models (HMMs) are successfully used as probabilistic models for human motion. The use of hidden variables makes them expressive models, but inference is only approximate and requires procedures such as particle filters or Markov chain Monte Carlo methods. In this work we propose to instead use simple Markov models that only model observed quantities. We retain a highly expressive dynamic model by using interactions that are nonlinear and non-parametric. A presentation of our approach in terms of latent variables shows logarithmic growth for the computation of exact log-likelihoods in the number of latent states. We validate our model on human motion capture data and demonstrate state-of-the-art performance on action recognition and motion completion tasks.
  • Scanline Sampler without Detailed Balance: An Efficient MCMC for MRF Optimization Authors: Wonsik Kim, Kyoung Mu Lee
    Markov chain Monte Carlo (MCMC) is an elegant tool, widely used in variety of areas. In computer vision, it has been used for the inference on the Markov random field model (MRF). However, MCMC less concerned than other deterministic approaches although it converges to global optimal solution in theory. The major obstacle is its slow convergence. To come up with faster sampling method, we investigate two ideas: breaking detailed balance and updating multiple nodes at a time. Although detailed balance is considered to be essential element of MCMC, it actually is not the necessary condition for the convergence. In addition, exploiting the structure of MRF, we introduce a new kernel which updates multiple nodes in a scanline rather than a single node. Those two ideas are integrated in a novel way to develop an efficient method called scanline sampler without detailed balance. In experimental section, we apply our method to the OpenGM2 benchmark of MRF optimization and show the proposed method achieves faster convergence than the conventional approaches.
  • Higher-Order Clique Reduction Without Auxiliary Variables Authors: Hiroshi Ishikawa
    We introduce a method to reduce most higher-order terms of Markov Random Fields with binary labels into lower-order ones without introducing any new variables, while keeping the minimizer of the energy unchanged. While the method does not reduce all terms, it can be used with existing techniques that transformsarbitrary terms (by introducing auxiliary variables) and improve the speed. The method eliminates a higher-order term in the polynomial representation of the energy by finding the value assignment to the variables involved that cannot be part of a global minimizer and increasing the potential value only when that particular combination occurs by the exact amount that makes the potential of lower order. We also introduce a faster approximation that forego the guarantee of exact equivalence of minimizer in favor of speed. With experiments on the same field of experts dataset used in previous work, we show that the roof-dual algorithm after the reduction labels significantly more variables and the energy converges more rapidly.
  • Learning Fine-grained Image Similarity with Deep Ranking Authors: Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, Ying Wu
    Learning fine-grained image similarity is a challenging task. It needs to capture between-class and within-class image differences. This paper proposes a deep ranking model that employs deep learning techniques to learn similarity metric directly from images. It has higher learning capability than models based on hand-crafted features. A novel multiscale network structure has been developed to describe the images effectively. An efficient triplet sampling algorithm is also proposed to learn the model with distributed asynchronized stochastic gradient. Extensive experiments show that the proposed algorithm outperforms models based on hand-crafted visual features and deep classification models.
  • Instance-weighted Transfer Learning of Active Appearance Models Authors: Daniel Haase, Erik Rodner, Joachim Denzler
    There has been a lot of work on face modeling, analysis, and landmark detection, with Active Appearance Models being one of the most successful techniques. A major drawback of these models is the large number of detailed annotated training examples needed for learning. Therefore, we present a transfer learning method that is able to learn from related training data using an instance-weighted transfer technique. Our method is derived using a generalization of importance sampling and in contrast to previous work we explicitly try to tackle the transfer already during learning instead of adapting the fitting process. In our studied application of face landmark detection, we efficiently transfer facial expressions from other human individuals and are thus able to learn a precise face Active Appearance Model only from neutral faces of a single individual. Our approach is evaluated on two common face datasets and outperforms previous transfer methods.
  • Scalable Multitask Representation Learning for Scene Classification Authors: Maksim Lapin, Bernt Schiele, Matthias Hein
    The underlying idea of multitask learning is that learning tasks jointly is better than learning each task individually. In particular, if only a few training examples are available for each task, sharing a jointly trained representation improves classification performance. In this paper, we propose a novel multitask learning method that learns a low-dimensional representation jointly with the corresponding classifiers, which are then able to profit from the latent inter-class correlations. Our method scales with respect to the original feature dimension and can be used with high-dimensional image descriptors such as the Fisher Vector. Furthermore, it consistently outperforms the current state of the art on the SUN397 scene classification benchmark with varying amounts of training data.
  • A Fast and Robust Algorithm to Count Topologically Persistent Holes in Noisy Clouds Authors: Vitaliy Kurlin
    Preprocessing a 2D image often produces a noisy cloud of interest points. We study the problem of counting holes in noisy clouds in the plane. The holes in a given cloud are quantified by the topological persistence of their boundary contours when the cloud is analyzed at all possible scales. We design the algorithm to count holes that are most persistent in the filtration of offsets (neighborhoods) around given points. The input is a cloud of n points in the plane without any user-defined parameters. The algorithm has a near linear time and a linear space O(n). The output is the array (number of holes, relative persistence in the filtration). We prove theoretical guarantees when the algorithm finds the correct number of holes (components in the complement) of an unknown shape approximated by a cloud.
  • The Photometry of Intrinsic Images Authors: Marc Serra, Olivier Penacchio, Robert Benavente, Maria Vanrell, Dimitris Samaras
    Intrinsic characterization of scenes is often the best way to overcome the illumination variability artifacts that complicate most computer vision problems, from 3D reconstruction to object or material recognition. This paper examines the deficiency of existing intrinsic image models to accurately account for the effects of illuminant color and sensor characteristics in the estimation of intrinsic images and presents a generic framework which incorporates insights from color constancy research to the intrinsic image decomposition problem. The proposed mathematical formulation includes information about the color of the illuminant and the effects of the camera sensors, both of which modify the observed color of the reflectance of the objects in the scene during the acquisition process. By modeling these effects, we get a "truly intrinsic" reflectance image, which we call absolute reflectance, which is invariant to changes of illuminant or camera sensors. This model allows us to represent a wide range of intrinsic image decompositions depending on the specific assumptions on the geometric properties of the scene configuration and the spectral properties of the light source and the acquisition system, thus unifying previous models in a single general framework. We demonstrate that even partial information about sensors improves significantly the estimated reflectance images, thus making our method applicable for a wide range of sensors. We validate our general intrinsic image framework experimentally with both synthetic data and natural images.
  • PatchMatch Based Joint View Selection and Depthmap Estimation Authors: Enliang Zheng, Enrique Dunn, Vladimir Jojic, Jan-Michael Frahm
    We propose a multi-view depthmap estimation approach aimed at adaptively ascertaining the pixel level data associations between a reference image and all the elements of a source image set. Namely, we address the question, what aggregation subset of the source image set should we use to estimate the depth of a particular pixel in the reference image? We pose the problem within a probabilistic framework that jointly models pixel-level view selection and depthmap estimation given the local pairwise image photoconsistency. The corresponding graphical model is solved by EM-based view selection probability inference and PatchMatch-like depth sampling and propagation. Experimental results on standard multi-view benchmarks convey the state-of-the art estimation accuracy afforded by mitigating spurious pixel level data associations. Additionally, experiments on large Internet crowd sourced data demonstrate the robustness of our approach against unstructured and heterogeneous image capture characteristics. Moreover, the linear computational and storage requirements of our formulation, as well as its inherent parallelism, enables an efficient and scalable GPU-based implementation.
  • Light Field Stereo Matching Using Bilateral Statistics of Surface Cameras Authors: Can Chen, Haiting Lin, Zhan Yu, Sing Bing Kang, Jingyi Yu
    In this paper, we introduce a bilateral consistency metric on the surface camera (SCam) for light field stereo matching to handle significant occlusions. The concept of SCam is used to model angular radiance distribution with respect to a 3D point. Our bilateral consistency metric is used to indicate the probability of occlusions by analyzing the SCams. We further show how to distinguish between on-surface and free space, textured and non-textured, and Lambertian and specular through bilateral SCam analysis. To speed up the matching process, we apply the edge preserving guided filter on the consistency-disparity curves. Experimental results show that our technique outperforms both the state-of-the-art and the recent light field stereo matching methods, especially near occlusion boundaries.
  • Complex Non-Rigid Motion 3D Reconstruction by Union of Subspaces Authors: Yingying Zhu, Dong Huang, Fernando De La Torre, Simon Lucey
    The task of estimating complex non-rigid 3D motion through a monocular camera is of increasing interest to the wider scientific community. Assuming one has the 2D point tracks of the non-rigid object in question, the vision community refers to this problem as Non-Rigid Structure from Motion (NRSfM). In this paper we make two contributions. First, we demonstrate empirically that the current state of the art approach to NRSfM (i.e. Dai et al. [5]) exhibits poor reconstruction performance on complex motion (i.e motions involving a sequence of primitive actions such as walk, sit and stand involving a human object). Second, we propose that this limitation can be circumvented by modeling complex motion as a union of subspaces. This does not naturally occur in Dai et al.'s approach which instead makes a less compact summation of subspaces assumption. Experiments on both synthetic and real videos illustrate the benefits of our approach for the complex nonrigid motion analysis.
  • Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving Authors: Shiyu Song, Manmohan Chandraker
    Scale drift is a crucial challenge for monocular autonomous driving to emulate the performance of stereo. This paper presents a real-time monocular SFM system that corrects for scale drift using a novel cue combination framework for ground plane estimation, yielding accuracy comparable to stereo over long driving sequences. Our ground plane estimation uses multiple cues like sparse features, dense inter-frame stereo and (when applicable) object detection. A data-driven mechanism is proposed to learn models from training data that relate observation covariances for each cue to error behavior of its underlying variables. During testing, this allows per-frame adaptation of observation covariances based on relative confidences inferred from visual data. Our framework significantly boosts not only the accuracy of monocular self-localization, but also that of applications like object localization that rely on the ground plane. Experiments on the KITTI dataset demonstrate the accuracy of our ground plane estimation, monocular SFM and object localization relative to ground truth, with detailed comparisons to prior art.
  • On the Quotient Representation for the Essential Manifold Authors: Roberto Tron, Kostas Daniilidis
    The essential matrix, which encodes the epipolar constraint between points in two projective views, is a cornerstone of modern computer vision. Previous works have proposed different characterizations of the space of essential matrices as a Riemannian manifold. However, they either do not consider the symmetric role played by the two views, or do not fully take into account the geometric peculiarities of the epipolar constraint. We address these limitations with a characterization as a quotient manifold which can be easily interpreted in terms of camera poses. While our main focus in on theoretical aspects, we include experiments in pose averaging, and show that the proposed formulation produces a meaningful distance between essential matrices.
  • Cross-Scale Cost Aggregation for Stereo Matching Authors: Kang Zhang, Yuqiang Fang, Dongbo Min, Lifeng Sun, Shiqiang Yang, Shuicheng Yan, Qi Tian
    Human beings process stereoscopic correspondence across multiple scales. However, this bio-inspiration is ignored by state-of-the-art cost aggregation methods for dense stereo correspondence. In this paper, a generic cross-scale cost aggregation framework is proposed to allow multi-scale interaction in cost aggregation. We firstly reformulate cost aggregation from a unified optimization perspective and show that different cost aggregation methods essentially differ in the choices of similarity kernels. Then, an inter-scale regularizer is introduced into optimization and solving this new optimization problem leads to the proposed framework. Since the regularization term is independent of the similarity kernel, various cost aggregation methods can be integrated into the proposed general framework. We show that the cross-scale framework is important as it effectively and efficiently expands state-of-the-art cost aggregation methods and leads to significant improvements, when evaluated on Middlebury, KITTI and New Tsukuba datasets.
  • Fast and Reliable Two-View Translation Estimation Authors: Johan Fredriksson, Olof Enqvist, Fredrik Kahl
    It has long been recognized that one of the fundamental difficulties in theestimation of two-view epipolar geometry is the capability of handling outliers. In this paper, we develop a fast and tractable algorithm that maximizes the number of inliers under the assumption of a purely translating camera. Compared to classical random sampling methods, our approach is guaranteed to compute the optimal solution of a cost function based on reprojection errors and it has better time complexity. The performance is in fact independent of the inlier/outlier ratio of the data.This opens up for a more reliable approach to robust ego-motion estimation. Our basic translation estimator can be embedded into a system that computes the full camera rotation. We demonstrate the applicability in several difficult settings with large amounts of outliers. It turns out to be particularly well-suited for small rotations and rotations around a known axis (which is the case for cellular phones where the gravitation axis can be measured). Experimental results show that compared to standard ransac methods based on minimal solvers, ouralgorithm produces more accurate estimates in the presence of large outlier ratios.
  • Learning to Detect Ground Control Points for Improving the Accuracy of Stereo Matching Authors: Aristotle Spyropoulos, Nikos Komodakis, Philippos Mordohai
    While machine learning has been instrumental to the ongoing progress in most areas of computer vision, it has not been applied to the problem of stereo matching with similar frequency or success. We present a supervised learning approach for predicting the correctness of stereo matches based on a random forest and a set of features that capture various forms of information about each pixel. We show highly competitive results in predicting the correctness of matches and in confidence estimation, which allows us to rank pixels according to the reliability of their assigned disparities. Moreover, we show how these confidence values can be used to improve the accuracy of disparity maps by integrating them with an MRF-based stereo algorithm. This is an important distinction from current literature that has mainly focused on sparsification by removing potentially erroneous disparities to generate quasi-dense disparity maps.
  • High Resolution 3D Shape Texture from Multiple Videos Authors: Vagia Tsiminaki, Jean-S
    We examine the problem of retrieving high resolution textures of objects observed in multiple videos under small object deformations. In the monocular case, the data redundancy necessary to reconstruct a high-resolution image stems from temporal accumulation. This has been vastly explored and is known as image super-resolution. On the other hand, a handful of methods have considered the texture of a static 3D object observed from several cameras, where the data redundancy is obtained through the different viewpoints. We introduce a unified framework to leverage both possibilities for the estimation of an object's high resolution texture. This framework uniformly deals with any related geometric variability introduced by the acquisition chain or by the evolution over time. To this goal we use 2D warps for all viewpoints and all temporal frames and a linear image formation model from texture to image space. Despite its simplicity, the method is able to successfully handle different views over space and time. As shown experimentally, it demonstrates the interest of temporal information to improve the texture quality. Additionally, we also show that our method outperforms state of the art multi-view super-resolution methods existing for the static case.
  • A Procrustean Markov Process for Non-Rigid Structure Recovery Authors: Minsik Lee, Chong-Ho Choi, Songhwai Oh
    Recovering a non-rigid 3D structure from a series of 2D observations is still a difficult problem to solve accurately. Many constraints have been proposed to facilitate the recovery, and one of the most successful constraints is smoothness due to the fact that most real-world objects change continuously. However, many existing methods require to determine the degree of smoothness beforehand, which is not viable in practical situations. In this paper, we propose a new probabilistic model that incorporates the smoothness constraint without requiring any prior knowledge. Our approach regards the sequence of 3D shapes as a simple stationary Markov process with Procrustes alignment, whose parameters are learned during the fitting process. The Markov process is assumed to be stationary because deformation is finite and recurrent in general, and the 3D shapes are assumed to be Procrustes aligned in order to discriminate deformation from motion. The proposed method outperforms the state-of-the-art methods, even though the computation time is rather moderate compared to the other existing methods.
  • Speeding Up Tracking by Ignoring Features Authors: Lu Zhang, Hamdi Dibeklio?lu, Laurens van der Maaten
    Most modern object trackers combine a motion prior with sliding-window detection, using binary classifiers that predict the presence of the target object based on histogram features. Although the accuracy of such trackers is generally very good, they are often impractical because of their high computational requirements. To resolve this problem, the paper presents a new approach that limits the computational costs of trackers by ignoring features in image regions that --- after inspecting a few features --- are unlikely to contain the target object. To this end, we derive an upper bound on the probability that a location is most likely to contain the target object, and we ignore (features in) locations for which this upper bound is small. We demonstrate the effectiveness of our new approach in experiments with model-free and model-based trackers that use linear models in combination with HOG features. The results of our experiments demonstrate that our approach allows us to reduce the average number of inspected features by up to $90\%$ without affecting the accuracy of the tracker.