TechTalks from event: CVPR 2014 Oral Talks

Orals 3B : Video: Events, Activities & Surveillance

  • Socially-aware Large-scale Crowd Forecasting Authors: Alexandre Alahi, Vignesh Ramanathan, Li Fei-Fei
    In crowded spaces such as city centers or train stations, human mobility looks complex, but is often influenced only by a few causes. We propose to quantitatively study crowded environments by introducing a dataset of 42 million trajectories collected in train stations. Given this dataset, we address the problem of forecasting pedestrians' destinations, a central problem in understanding large-scale crowd mobility. We need to overcome the challenges posed by a limited number of observations (e.g. sparse cameras), and change in pedestrian appearance cues across different cameras. In addition, we often have restrictions in the way pedestrians can move in a scene, encoded as priors over origin and destination (OD) preferences. We propose a new descriptor coined as Social Affinity Maps (SAM) to link broken or unobserved trajectories of individuals in the crowd, while using the OD-prior in our framework. Our experiments show improvement in performance through the use of SAM features and OD prior. To the best of our knowledge, our work is one of the first studies that provides encouraging results towards a better understanding of crowd behavior at the scale of million pedestrians.
  • L0 Regularized Stationary Time Estimation for Crowd Group Analysis Authors: Shuai Yi, Xiaogang Wang, Cewu Lu, Jiaya Jia
    We tackle stationary crowd analysis in this paper, which is similarly important as modeling mobile groups in crowd scenes and finds many applications in surveillance. Our key contribution is to propose a robust algorithm of estimating how long a foreground pixel becomes stationary. It is much more challenging than only subtracting background because failure at a single frame due to local movement of objects, lighting variation, and occlusion could lead to large errors on stationary time estimation. To accomplish decent results, sparse constraints along spatial and temporal dimensions are jointly added by mixed partials to shape a 3D stationary time map. It is formulated as a L0 optimization problem. Besides background subtraction, it distinguishes among different foreground objects, which are close or overlapped in the spatio-temporal space by using a locally shared foreground codebook. The proposed technologies are used to detect four types of stationary group activities and analyze crowd scene structures. We provide the first public benchmark dataset1 for stationary time estimation and stationary group analysis.
  • Scene-Independent Group Profiling in Crowd Authors: Jing Shao, Chen Change Loy, Xiaogang Wang
    Groups are the primary entities that make up a crowd. Understanding group-level dynamics and properties is thus scientifically important and practically useful in a wide range of applications, especially for crowd understanding. In this study we show that fundamental group-level properties, such as intra-group stability and inter-group conflict, can be systematically quantified by visual descriptors. This is made possible through learning a novel Collective Transition prior, which leads to a robust approach for group segregation in public spaces. From the prior, we further devise a rich set of group property visual descriptors. These descriptors are scene-independent, and can be effectively applied to public-scene with variety of crowd densities and distributions. Extensive experiments on hundreds of public scene video clips demonstrate that such property descriptors are not only useful but also necessary for group state analysis and crowd scene understanding.
  • Temporal Sequence Modeling for Video Event Detection Authors: Yu Cheng, Quanfu Fan, Sharath Pankanti, Alok Choudhary
    We present a novel approach for event detection in video by temporal sequence modeling. Exploiting temporal information has lain at the core of many approaches for video analysis (i.e., action, activity and event recognition). Unlike previous works doing temporal modeling at semantic event level, we propose to model temporal dependencies in the data at sub-event level without using event annotations. This frees our model from ground truth and addresses several limitations in previous work on temporal modeling. Based on this idea, we represent a video by a sequence of visual words learnt from the video, and apply the Sequence Memoizer [21] to capture long-range dependencies in a temporal context in the visual sequence. This data-driven temporal model is further integrated with event classification for jointly performing segmentation and classification of events in a video. We demonstrate the efficacy of our approach on two challenging datasets for visual recognition.
  • Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts Authors: Subhabrata Bhattacharya, Mahdi M. Kalayeh, Rahul Sukthankar, Mubarak Shah
    While approaches based on bags of features excel at low-level action classification, they are ill-suited for recognizing complex events in video, where concept-based temporal representations currently dominate. This paper proposes a novel representation that captures the temporal dynamics of windowed mid-level concept detectors in order to improve complex event recognition. We first express each video as an ordered vector time series, where each time step consists of the vector formed from the concatenated confidences of the pre-trained concept detectors. We hypothesize that the dynamics of time series for different instances from the same event class, as captured by simple linear dynamical system (LDS) models, are likely to be similar even if the instances differ in terms of low-level visual features. We propose a two-part representation composed of fusing: (1) a singular value decomposition of block Hankel matrices (SSID-S) and (2) a harmonic signature (HS) computed from the corresponding eigen-dynamics matrix. The proposed method offers several benefits over alternate approaches: our approach is straightforward to implement, directly employs existing concept detectors and can be plugged into linear classification frameworks. Results on standard datasets such as NIST's TRECVID Multimedia Event Detection task demonstrate the improved accuracy of the proposed method.
  • Video Event Detection by Inferring Temporal Instance Labels Authors: Kuan-Ting Lai, Felix X. Yu, Ming-Syan Chen, Shih-Fu Chang
    Video event detection allows intelligent indexing of video content based on events. Traditional approaches extract features from video frames or shots, then quantize and pool the features to form a single vector representation for the entire video. Though simple and efficient, the final pooling step may lead to loss of temporally local information, which is important in indicating which part in a long video signifies presence of the event. In this work, we propose a novel instance-based video event detection approach. We represent each video as multiple "instances", defined as video segments of different temporal intervals. The objective is to learn an instance-level event detection model based on only video-level labels. To solve this problem, we propose a large-margin formulation which treats the instance labels as hidden latent variables, and simultaneously infers the instance labels as well as the instance-level classification model. Our framework infers optimal solutions that assume positive videos have a large number of positive instances while negative videos have the fewest ones. Extensive experiments on large-scale video event datasets demonstrate significant performance gains. The proposed method is also useful in explaining the detection results by localizing the temporal segments in a video which is responsible for the positive detection.

Orals 3C : Medical & Biological Image Analysis

  • Joint Coupled-Feature Representation and Coupled Boosting for AD Diagnosis Authors: Yinghuan Shi, Heung-Il Suk, Yang Gao, Dinggang Shen
    Recently, there has been a great interest in computer- aided Alzheimer's Disease (AD) and Mild Cognitive Im- pairment (MCI) diagnosis. Previous learning based meth- ods defined the diagnosis process as a classification task and directly used the low-level features extracted from neu- roimaging data without considering relations among them. However, from a neuroscience point of view, it's well known that a human brain is a complex system that multiple brain regions are anatomically connected and functionally inter- act with each other. Therefore, it is natural to hypothesize that the low-level features extracted from neuroimaging da- ta are related to each other in some ways. To this end, in this paper, we first devise a coupled feature representa- tion by utilizing intra-coupled and inter-coupled interaction relationship. Regarding multi-modal data fusion, we pro- pose a novel coupled boosting algorithm that analyzes the pairwise coupled-diversity correlation between modalities. Specifically, we formulate a new weight updating function, which considers both incorrectly and inconsistently classi- fied samples. In our experiments on the ADNI dataset, the proposed method presented the best performance with accu- racies of 94.7% and 80.1% for AD vs. Normal Control (NC) and MCI vs. NC classifications, respectively, outperforming the competing methods and the state-of-the-art methods.
  • Deformable Registration of Feature-Endowed Point Sets Based on Tensor Fields Authors: Demian Wassermann, James Ross, George Washko, William M. Wells III, Raul San Jose-Estepar
    The main contribution of this work is a framework to register anatomical structures characterized as a point set where each point has an associated symmetric matrix. These matrices can represent problem-dependent characteristics of the registered structure. For example, in airways, matrices can represent the orientation and thickness of the structure. Our framework relies on a dense tensor field representation which we implement sparsely as a kernel mixture of tensor fields. We equip the space of tensor fields with a norm that serves as a similarity measure. To calculate the optimal transformation between two structures we minimize this measure using an analytical gradient for the similarity measure and the deformation field, which we restrict to be a diffeomorphism. We illustrate the value of our tensor field model by comparing our results with scalar and vector field based models. Finally, we evaluate our registration algorithm on synthetic data sets and validate our approach on manually annotated airway trees.
  • Tracking Indistinguishable Translucent Objects over Time using Weakly Supervised Structured Learning Authors: Luca Fiaschi, Ferran Diego, Konstantin Gregor, Martin Schiegg, Ullrich Koethe, Marta Zlatic, Fred A. Hamprecht
    We use weakly supervised structured learning to track and disambiguate the identity of multiple indistinguishable, translucent and deformable objects that can overlap for many frames. For this challenging problem, we propose a novel model which handles occlusions, complex motions and non-rigid deformations by jointly optimizing the flows of multiple latent intensities across frames. These flows are latent variables for which the user cannot directly provide labels. Instead, we leverage a structured learning formulation that uses weak user annotations to find the best hyperparameters of this model. The approach is evaluated on a challenging dataset for the tracking of multiple Drosophila larvae which we make publicly available. Our method tracks multiple larvae in spite of their poor distinguishability and minimizes the number of identity switches during prolonged mutual occlusion.
  • Multiscale Centerline Detection by Learning a Scale-Space Distance Transform Authors: Amos Sironi, Vincent Lepetit, Pascal Fua
    We propose a robust and accurate method to extract the centerlines and scale of tubular structures in 2D images and 3D volumes. Existing techniques rely either on filters designed to respond to ideal cylindrical structures, which lose accuracy when the linear structures become very irregular, or on classification, which is inaccurate because locations on centerlines and locations immediately next to them are extremely difficult to distinguish. We solve this problem by reformulating centerline detection in terms of a regression problem. We first train regressors to return the distances to the closest centerline in scale-space, and we apply them to the input images or volumes. The centerlines and the corresponding scale then correspond to the regressors local maxima, which can be easily identified. We show that our method outperforms state-of-the-art techniques for various 2D and 3D datasets.
  • Multivariate General Linear Models (MGLM) on Riemannian Manifolds with Applications to Statistical Analysis of Diffusion Weighted Images Authors: Hyunwoo J. Kim, Nagesh Adluru, Maxwell D. Collins, Moo K. Chung, Barbara B. Bendlin, Sterling C. Johnson, Richard J. Davidson, Vikas Singh
    Linear regression is a parametric model which is ubiquitous in scientific analysis. The classical setup where the observations and responses, i.e., (xi,yi) pairs, are Euclidean is well studied. The setting where yi is manifold valued is a topic of much interest, motivated by applications in shape analysis, topic modeling, and medical imaging. Recent work gives strategies for max-margin classifiers, principal components analysis, and dictionary learning on certain types of manifolds. For parametric regression specifically, results within the last year provide mechanisms to regress one real-valued parameter, xi in R, against a manifold-valued variable, yi in M. We seek to substantially extend the operating range of such methods by deriving schemes for multivariate multiple linear regression ��� a manifold-valued dependent variable against multiple independent variables, i.e., f : Rn -> M. Our variational algorithm efficiently solves for multiple geodesic bases on the manifold concurrently via gradient updates. This allows us to answer questions such as: what is the relationship of the measurement at voxel y to disease when conditioned on age and gender. We show applications to statistical analysis of diffusion weighted images, which give rise to regression tasks on the manifold GL(n)/O(n) for diffusion tensor images (DTI) and the Hilbert unit sphere for orientation distribution functions (ODF) from high angular resolution acquisition. The companion open-source code is available on
  • Preconditioning for Accelerated Iteratively Reweighted Least Squares in Structured Sparsity Reconstruction Authors: Chen Chen, Junzhou Huang, Lei He, Hongsheng Li
    In this paper, we propose a novel algorithm for structured sparsity reconstruction. This algorithm is based on the iterative reweighted least squares (IRLS) framework, and accelerated by the preconditioned conjugate gradient method. The convergence rate of the proposed algorithm is almost the same as that of the traditional IRLS algorithms, that is, exponentially fast. Moreover, with the devised preconditioner, the computational cost for each iteration is significantly less than that of traditional IRLS algorithms, which makes it feasible for large scale problems. Besides the fast convergence, this algorithm can be flexibly applied to standard sparsity, group sparsity, and overlapping group sparsity problems. Experiments are conducted on a practical application compressive sensing magnetic resonance imaging. Results demonstrate that the proposed algorithm achieves superior performance over 9 state-of-the-art algorithms in terms of both accuracy and computational cost.

Orals 3D : Low-Level Vision & Image Processing

  • Segmentation-Free Dynamic Scene Deblurring Authors: Tae Hyun Kim, Kyoung Mu Lee
    Most state-of-the-art dynamic scene deblurring methods based on accurate motion segmentation assume that motion blur is small or that the specific type of motion causing the blur is known. In this paper, we study a motion segmentation-free dynamic scene deblurring method, which is unlike other conventional methods. When the motion can be approximated to linear motion that is locally (pixel-wise) varying, we can handle various types of blur caused by camera shake, including out-of-plane motion, depth variation, radial distortion, and so on. Thus, we propose a new energy model simultaneously estimating motion flow and the latent image based on robust total variation (TV)-L1 model. This approach is necessary to handle abrupt changes in motion without segmentation. Furthermore, we address the problem of the traditional coarse-to-fine deblurring framework, which gives rise to artifacts when restoring small structures with distinct motion. We thus propose a novel kernel re-initialization method which reduces the error of motion flow propagated from a coarser level. Moreover, a highly effective convex optimization-based solution mitigating the computational difficulties of the TV-L1 model is established. Comparative experimental results on challenging real blurry images demonstrate the efficiency of the proposed method.
  • Shrinkage Fields for Effective Image Restoration Authors: Uwe Schmidt, Stefan Roth
    Many state-of-the-art image restoration approaches do not scale well to larger images, such as megapixel images common in the consumer segment. Computationally expensive optimization is often the culprit. While efficient alternatives exist, they have not reached the same level of image quality. The goal of this paper is to develop an effective approach to image restoration that offers both computational efficiency and high restoration quality. To that end we propose shrinkage fields, a random field-based architecture that combines the image model and the optimization algorithm in a single unit. The underlying shrinkage operation bears connections to wavelet approaches, but is used here in a random field context. Computational efficiency is achieved by construction through the use of convolution and DFT as the core components; high restoration quality is attained through loss-based training of all model parameters and the use of a cascade architecture. Unlike heavily engineered solutions, our learning approach can be adapted easily to different trade-offs between efficiency and image quality. We demonstrate state-of-the-art restoration results with high levels of computational efficiency, and significant speedup potential through inherent parallelism.
  • Camouflaging an Object from Many Viewpoints Authors: Andrew Owens, Connelly Barnes, Alex Flint, Hanumant Singh, William Freeman
    We address the problem of camouflaging a 3D object from the many viewpoints that one might see it from. Given photographs of an object's surroundings, we produce a surface texture that will make the object difficult for a human to detect. To do this, we introduce several background matching algorithms that attempt to make the object look like whatever is behind it. Of course, it is impossible to exactly match the background from every possible viewpoint. Thus our models are forced to make trade-offs between different perceptual factors, such as the conspicuousness of the occlusion boundaries and the amount of texture distortion. We use experiments with human subjects to evaluate the effectiveness of these models for the task of camouflaging a cube, finding that they significantly outperform na��ve strategies.
  • Scale-space Processing Using Polynomial Representations Authors: Gou Koutaki, Keiichi Uchimura
    In this study, we propose the application of principal components analysis (PCA) to scale-spaces. PCA is a standard method used in computer vision. The translation of an input image into scale-space is a continuous operation, which requires the extension of conventional finite matrix- based PCA to an infinite number of dimensions. In this study, we use spectral decomposition to resolve this infinite eigenproblem by integration and we propose an approximate solution based on polynomial equations. To clarify its eigensolutions, we apply spectral decomposition to the Gaussian scale-space and scale-normalized Laplacian of Gaussian (LoG) space. As an application of this proposed method, we introduce a method for generating Gaussian blur images and scale-normalized LoG images, where we demonstrate that the accuracy of these images can be very high when calculating an arbitrary scale using a simple linear combination. We also propose a new Scale Invariant Feature Transform (SIFT) detector as a more practical example.
  • Single Image Layer Separation using Relative Smoothness Authors: Yu Li, Michael S. Brown
    This paper addresses extracting two layers from an image where one layer is smoother than the other. This problem arises most notably in intrinsic image decomposition and reflection interference removal. Layer decomposition from a single-image is inherently ill-posed and solutions require additional constraints to be enforced. We introduce a novel strategy that regularizes the gradients of the two layers such that one has a long tail distribution and the other a short tail distribution. While imposing the long tail distribution is a common practice, our introduction of the short tail distribution on the second layer is unique. We formulate our problem in a probabilistic framework and describe an optimization scheme to solve this regularization with only a few iterations. We apply our approach to the intrinsic image and reflection removal problems and demonstrate high quality layer separation on par with other techniques but being significantly faster than prevailing methods.
  • Image Fusion with Local Spectral Consistency and Dynamic Gradient Sparsity Authors: Chen Chen, Yeqing Li, Wei Liu, Junzhou Huang
    In this paper, we propose a novel method for image fusion from a high resolution panchromatic image and a low resolution multispectral image at the same geographical location. Different from previous methods, we do not make any assumption about the upsampled multispectral image, but only assume that the fused image after downsampling should be close to the original multispectral image. This is a severely ill-posed problem and a dynamic gradient sparsity penalty is thus proposed for regularization. Incorporating the intra- correlations of different bands, this penalty can effectively exploit the prior information (e.g. sharp boundaries) from the panchromatic image. A new convex optimization algorithm is proposed to efficiently solve this problem. Extensive experiments on four multispectral datasets demonstrate that the proposed method significantly outperforms the state-of-the-arts in terms of both spatial and spectral qualities.