## TechTalks from event: IEEE CVPR 2011

Note: Award talks and user-uploaded contents are accessible for free. Other oral sessions are to be accessed by only those who registered for the main conference or for the webcast/video-proceedings. You can register to view video proceeding by visiting CVPR 2011 website and following the virtual-registration link.

## CVPR Poster Session

• CVPR Poster Upload Authors: You!
URL: http://webcast.weyond.com:8080/upload/56/newtalk/ You must be logged in to be able to upload. You can upload videos of upto 50MB size, in mp4/mov/f4v/flv format, with mpeg4/h.264 or VP6/Sorenson encoding. Once you have added your poster video, you will also be able to upload a pdf/jpg version of your poster slides.
• Topology-adaptive Multi-view Photometric Stereo Authors: Yusuke Yoshiyasu and Nobutoshi Yamazaki
In this paper, we present a novel technique that enables capturing of detailed 3D models from flash photographs integrating shading and silhouette cues. Our main contribution is an optimization framework which not only captures subtle surface details but also handles changes in topology. To incorporate normals estimated from shading, we employ a mesh-based deformable model using deformation gradient. This method is capable of manipulating precise geometry and, in fact, it outperforms previous methods in terms of both accuracy and efficiency. To adapt the topology of the mesh, we convert the mesh into an implicit surface representation and then back to a mesh representation. This simple procedure removes self-intersecting regions of the mesh and solves the topology problem effectively. In addition to the algorithm, we introduce a hand-held setup to achieve multi-view photometric stereo. The key idea is to acquire flash photographs from a wide range of positions in order to obtain a sufficient lighting variation even with a standard flash unit attached to the camera. Experimental results showed that our method can capture detailed shapes of various objects and cope with topology changes well.
• 3D Motion Reconstruction for Real-World Camera Motion Authors: Yingying Zhu, Mark Cox, Simon Lucey
This paper addresses the problem of 3D motion reconstruction from a series of 2D projections under low reconstructibility. Reconstructibility defines the accuracy of a 3D reconstruction from 2D projections given a particular trajectory basis, 3D point trajectory, and 3D camera center trajectory. Reconstructibility accuracy is inherently related to the correlation between point and camera trajectories. Poor correlation leads to good reconstruction, high correlation leads to poor reconstruction. Unfortunately, in most real-world situations involving non-rigid objects (e.g. bodies), camera and point motions are highly correlated (i.e., slow and smooth) resulting in poor reconstructibility. In this paper, we propose a novel approach for 3D motion reconstruction of non-rigid body motion in the presence of real-world camera motion. Specifically we: (i) propose the inclusion of a small number of keyframes in the video sequence from which 3D coordinates are inferred/estimated to circumvent ambiguities between point and camera motion, and (ii) employ a L1 penalty term to enforce a sparsity constraint on the trajectory basis coefficients so as to ensure our reconstructions are consistent with the natural compressibility of human motion. We demonstrate impressive 3D motion reconstruction for 2D projection sequences with hitherto low reconstructibility.
• Graph Matching through Entropic Manifold Alignment Authors: Francisco Escolano; Edwin Hancock; Miguel Lozano
In this paper we cast the problem of graph matching as one of non-rigid manifold alignment. The low dimensional manifolds are from the commute time embedding and are matched though coherent point drift. Although there have been a number of attempts to realise graph matching in this way, in this paper we propose a novel information-theoretic measure of alignment, the so-called symmetrized normalized-entropy-square variation. We succesfully test this dissimilarity measure between manifolds on a challenging database. The measure is estimated by means of the bypass Leonenko entropy functional. In addition we prove that the proposed measure induces a positive definite kernel between the probability density functions associated with the manifolds and hence between graphs after deformation. In our experiments we find that the optimal embedding is associated to the commute time distance and we also find that our approach, which is purely topological, outperforms several state-of-the-art graph-based algorithms for point matching.
• Scenario-Based Video Event Recognition by Constraint Flow Authors: Suha Kwak, Bohyung Han, and Joon Hee Han
We present a novel approach to representing and recognizing composite video events. A composite event is specified by a scenario, which is based on primitive events and their temporal-logical relations, to constrain the arrangements of the primitive events in the composite event. We propose a new scenario description method to represent composite events fluently and efficiently. A composite event is recognized by a constrained optimization algorithm whose constraints are defined by the scenario. The dynamic configuration of the scenario constraints is represented with constraint flow, which is generated from scenario automatically by our scenario parsing algorithm. The constraint flow reduces the search space dramatically, alleviates the effect of preprocessing errors, and guarantees the globally optimal solution for recognition. We validate our method to describe scenario and construct constraint flow for real videos and illustrate the effectiveness of our composite event recognition algorithm for natural video events.
• Max-margin Clustering: Detecting Margins from Projections of Points on Lines Authors: Raghuraman Gopalan and Jagan Sankaranarayanan
Given a unlabelled set of points X \in R^N belonging to k groups, we propose a method to identify cluster assignments that provides maximum separating margin among the clusters. We address this problem by exploiting sparsity in data points inherent to margin regions, which a max-margin classifier would produce under a supervised setting to separate points belonging to different groups. By analyzing the projections of X on the set of all possible lines L in R^N, we first establish some basic results that are satisfied only by those line intervals lying outside a cluster, under assumptions of linear separability of clusters and absence of outliers. We then encode these results into a pair-wise similarity measure to determine cluster assignments, where we accommodate non-linearly separable clusters using the kernel trick. We validate our method on several UCI datasets and on some computer vision problems, and empirically show its robustness to outliers, and in cases where the exact number of clusters is not available. The proposed approach offers an improvement in clustering accuracy of about 6% on the average, and up to 15% when compared with several existing methods.
• High-quality shape from multi-view stereo and shading under general illumination Authors: Chenglei Wu, Bennett Wilburn, Yasuyuki Matsushita, Christian Theobalt
Multi-view stereo methods reconstruct 3D geometry from images well for sufficiently textured scenes, but often fail to recover high-frequency surface detail, particularly for smoothly shaded surfaces. On the other hand, shape-fromshading methods can recover fine detail from shading variations. Unfortunately, it is non-trivial to apply shape-fromshading alone to multi-view data, and most shading-based estimation methods only succeed under very restricted or controlled illumination. We present a new algorithm that combines multi-view stereo and shading-based refinement for high-quality reconstruction of 3D geometry models from images taken under constant but otherwise arbitrary illumination. We have tested our algorithm on several scenes that were captured under several general and unknown lighting conditions, and we show that our final reconstructions rival laser range scans.
• Graph Embedding Discriminant Analysis on Grassmannian Manifolds for Improved Image Set Matching Authors: Mehrtash T. Harandi, Conrad Sanderson, Sareh Shirazi, Brian C. Lovell
A convenient way of dealing with image sets is to represent them as points on Grassmannian manifolds. While several recent studies explored the applicability of discriminant analysis on such manifolds, the conventional formalism of discriminant analysis suffers from not considering the local structure of the data. We propose a discriminant analysis approach on Grassmannian manifolds, based on a graphembedding framework. We show that by introducing withinclass and between-class similarity graphs to characterise intra-class compactness and inter-class separability, the geometrical structure of data can be exploited. Experiments on several image datasets (PIE, BANCA, MoBo, ETH-80) show that the proposed algorithm obtains considerable improvements in discrimination accuracy, in comparison to three recent methods: Grassmann Discriminant Analysis (GDA), Kernel GDA, and the kernel version of Affine Hull Image Set Distance. We further propose a Grassmannian kernel, based on canonical correlation between subspaces, which can increase discrimination accuracy when used in combination with previous Grassmannian kernels.
• Learning Better Image Representations Using 'Flobject' Analysis Authors: Patrick S. Li, Inmar Givoni, Brendan Frey
Unsupervised learning can be used to extract image representations that are useful for various and diverse vision tasks. After noticing that most biological vision systems for interpreting static images are trained using disparity information, we developed an analogous framework for unsupervised learning. The output of our method is a model that can generate a vector representation or descriptor from any static image. However, the model is trained using pairs of consecutive video frames, which are used to find representations that are consistent with optical flow-derived objects, or 'flobjects'. To demonstrate the flobject analysis framework, we extend the latent Dirichlet allocation bag-of-words model to account for real-valued word-specific flow vectors and image-specific probabilistic associations between flow clusters and topics. We show that the static image representations extracted using our method can be used to achieve higher classification rates and better generalization than standard topic models, spatial pyramid matching and gist descriptors.
• Support Tucker Machines Authors: Irene Kotsia and Ioannis Patras
In this paper we address the two-class classification problem within the tensor-based framework, by formulating the Support Tucker Machines (STuMs). More precisely, in the proposed STuMs the weights parameters are regarded to be a tensor, calculated according to the Tucker tensor decomposition as the multiplication of a core tensor with a set of matrices, one along each mode. We further extend the proposed STuMs to the ?/?w STuMs, in order to fully exploit the information offered by the total or the within-class covariance matrix and whiten the data, thus providing invariance to affine transformations in the feature space. We formulate the two above mentioned problems in such a way that they can be solved in an iterative manner, where at each iteration the parameters corresponding to the projections along a single tensor mode are estimated by solving a typical Support Vector Machine-type problem. The superiority of the proposed methods in terms of classification accuracy is illustrated on the problems of gait and action recognition.
• Efficient Euclidean Distance Transform Using Perpendicular Bisector Segmentation Authors: Jun Wang and Ying Tan
In this paper, we propose an efficient algorithm for computing the Euclidean distance transform of two-dimensional binary image, called PBEDT (Perpendicular Bisector Euclidean Distance Transform). PBEDT is a two-stage independent scan algorithm. In the first stage, PBEDT computes the distance from each point to its closest feature point in the same column using one time column-wise scan. In the second stage, PBEDT computes the distance transform for each point by row with intermediate results of the previous stage. By using the geometric properties of the perpendicular bisector, PBEDT directly computes the segmentation by feature points for each row and each segment corresponding to one feature point. Furthermore, by using integer arithmetic to avoid time consuming float operations, PBEDT still achieves exact results. All these methods reduce the Computational complexity significantly. Consequently, an efficient and exact linear time Euclidean distance transform algorithm is implemented. Detailed comparison with state-of-the-art linear time Euclidean distance transform algorithms shows that PBEDT is the fastest on most cases, and also the most stable one with respect to image contents.
• Global temporal registration of multiple non-rigid surface sequences Authors: Peng Huang, Chris Budd, Adrian Hilton
In this paper we consider the problem of aligning multi- ple non-rigid surface mesh sequences into a single tempo- rally consistent representation of the shape and motion. A global alignment graph structure is introduced which uses shape similarity to identify frames for inter-sequence reg- istration. Graph optimisation is performed to minimise the total non-rigid deformation required to register the input sequences into a common structure. The resulting global alignment ensures that all input sequences are resampled with a common mesh structure which preserves the shape and temporal correspondence. Results demonstrate tempo- rally consistent representation of several public databases of mesh sequences for multiple people performing a variety of motions with loose clothing and hair.
• Max-margin Clustering: Detecting Margins from Projections of Points on Lines Authors: Raghuraman Gopalan and Jagan Sankaranarayanan
Given a unlabelled set of points X \in R^N belonging to k groups, we propose a method to identify cluster assignments that provides maximum separating margin among the clusters. We address this problem by exploiting sparsity in data points inherent to margin regions, which a max-margin classifier would produce under a supervised setting to separate points belonging to different groups. By analyzing the projections of X on the set of all possible lines L in R^N, we first establish some basic results that are satisfied only by those line intervals lying outside a cluster, under assumptions of linear separability of clusters and absence of outliers. We then encode these results into a pair-wise similarity measure to determine cluster assignments, where we accommodate non-linearly separable clusters using the kernel trick. We validate our method on several UCI datasets and on some computer vision problems, and empirically show its robustness to outliers, and in cases where the exact number of clusters is not available. The proposed approach offers an improvement in clustering accuracy of about 6% on the average, and up to 15% when compared with several existing methods.
• Optimal Spatio-Temporal Path Discovery for Video Event Detection and Localization Authors: Du Tran, Junsong Yuan
We propose a novel algorithm for video event detection and localization as the optimal path discovery problem in spatio-temporal video space. By finding the optimal spatio-temporal path, our method not only detects the starting and ending points of the event, but also accurately locates it in each video frame. Moreover, our method is robust to the scale and intra-class variations of the event, as well as false and missed local detections, therefore improves the overall detection and localization accuracy. The proposed search algorithm obtains the global optimal solution with proven lowest computational complexity. Experiments on realistic video datasets demonstrate that our proposed method can be applied to different types of event detection tasks, such as abnormal event detection and walking pedestrian detection.
• Optimal Spatio-Temporal Path Discovery for Video Event Detection and Localization Authors: Du Tran, Junsong Yuan
We propose a novel algorithm for video event detection and localization as the optimal path discovery problem in spatio-temporal video space. By finding the optimal spatio-temporal path, our method not only detects the starting and ending points of the event, but also accurately locates it in each video frame. Moreover, our method is robust to the scale and intra-class variations of the event, as well as false and missed local detections, therefore improves the overall detection and localization accuracy. The proposed search algorithm obtains the global optimal solution with proven lowest computational complexity. Experiments on realistic video datasets demonstrate that our proposed method can be applied to different types of event detection tasks, such as abnormal event detection and walking pedestrian detection.
• Optimal Spatio-Temporal Path Discovery for Video Event Detection Authors: Du Tran and Junsong Yuan
We propose a novel algorithm for video event detection and localization as the optimal path discovery problem in spatio-temporal video space. By finding the optimal spatio-temporal path, our method not only detects the starting and ending points of the event, but also accurately locates it in each video frame. Moreover, our method is robust to the scale and intra-class variations of the event, as well as false and missed local detections, therefore improves the overall detection and localization accuracy. The proposed search algorithm obtains the global optimal solution with proven lowest computational complexity. Experiments on realistic video datasets demonstrate that our proposed method can be applied to different types of event detection tasks, such as abnormal event detection and walking pedestrian detection.
• Face Image Retrieval by Shape Manipulation Authors: Brandon M. Smith, Shengqi Zhu, Li Zhang
Current face image retrieval methods achieve impressive results, but lack efficient ways to refine the search, particularly for geometric face attributes. Users cannot easily find faces with slightly more furrowed brows or specific leftward pose shifts, for example. To address this problem, we propose a new face search technique based on shape manipulation that is complementary to current search engines. Users drag one or a small number of contour points, like the bottom of the chin or the corner of an eyebrow, to search for faces similar in shape to the current face, but with updated geometric attributes specific to their edits. For example, the user can drag a mouth corner to find faces with wider smiles, or the tip of the nose to find faces with a specific pose. As part of our system, we propose (1) a novel confidence score for face alignment results that automatically constructs a contour-aligned face database with reasonable alignment accuracy, (2) a simple and straightforward extension of PCA with missing data to tensor analysis, and (3) a new regularized tensor model to compute shape feature vectors for each aligned face, all built upon previous work. To the best of our knowledge, our system demonstrates the first face retrieval approach based chiefly on shape manipulation. We show compelling results on a sizable database of over 10,000 face images captured in uncontrolled environments.
• Dynamic Batch Mode Active Learning Authors: Shayok Chakraborty, Vineeth Balasubramanian and Sethuraman Panchanathan
Active learning techniques have gained popularity to reduce human effort in labeling data instances for inducing a classifier. When faced with large amounts of unlabeled data, such algorithms automatically identify the exemplar and representative instances to be selected for manual annotation. More recently, there have been attempts towards a batch mode form of active learning, where a batch of data points is simultaneously selected from an unlabeled set. Real-world applications require adaptive approaches for batch selection in active learning. However, existing work in this field has primarily been heuristic and static. In this work, we propose a novel optimization-based framework for dynamic batch mode active learning, where the batch size as well as the selection criteria are combined in a single formulation. The solution procedure has the same computational complexity as existing state-of-the-art static batch mode active learning techniques. Our results on four challenging biometric datasets portray the efficacy of the proposed framework and also certify the potential of this approach in being used for real world biometric recognition applications.
• A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video Authors: Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J. K. Aggarwal, Hyu
We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showing one action by one individual. Datasets have been developed for movies and sports, but, these actions and scene conditions do not apply effectively to surveillance videos. Our dataset consists of many outdoor scenes with actions occurring naturally by non-actors in continuously captured videos of the real world. The dataset includes large numbers of instances for 23 event types distributed throughout 29 hours of video. This data is accompanied by detailed annotations which include both moving object tracks and event examples, which will provide solid basis for large-scale evaluation. Additionally, we propose different types of evaluation modes for visual recognition tasks and evaluation metrics along with our preliminary experimental results. We believe that this dataset will stimulate diverse aspects of computer vision research and help us to advance the CVER tasks in the years ahead.
• A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video Authors: S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C-C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang
We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showing one action by one individual. Datasets have been developed for movies and sports, but, these actions and scene conditions do not apply effectively to surveillance videos. Our dataset consists of many outdoor scenes with actions occurring naturally by non-actors in continuously captured videos of the real world. The dataset includes large numbers of instances for 23 event types distributed throughout 29 hours of video. This data is accompanied by detailed annotations which include both moving object tracks and event examples, which will provide solid basis for large-scale evaluation. Additionally, we propose different types of evaluation modes for visual recognition tasks and evaluation metrics along with our preliminary experimental results. We believe that this dataset will stimulate diverse aspects of computer vision research and help us to advance the CVER tasks in the years ahead.
• P2C2: Programmable Pixel Compressive Camera for High Speed Imaging Authors: Dikpal Reddy, Ashok Veeraraghavan, Rama Chellappa
We describe an imaging architecture for compressive video sensing termed programmable pixel compressive camera (P2C2). P2C2 allows us to capture fast phenom- ena at frame rates higher than the camera sensor. In P2C2, each pixel has an independent shutter that is modulated at a rate higher than the camera frame-rate. The observed intensity at a pixel is an integration of the incoming light modulated by its specific shutter. We propose a reconstruc- tion algorithm that uses the data from P2C2 along with additional priors about videos to perform temporal super- resolution. We model the spatial redundancy of videos using sparse representations and the temporal redundancy using brightness constancy constraints inferred via optical flow. We show that by modeling such spatio-temporal redundan- cies in a video volume, one can faithfully recover the un- derlying high-speed video frames from the observed low speed coded video. The imaging architecture and the re- construction algorithm allows us to achieve temporal super- resolution without loss in spatial resolution. We implement a prototype of P2C2 using an LCOS modulator and recover several videos at 200 fps using a 25 fps camera.
• Non-Rigid Structure from Motion with Complementary Rank-3 Spaces Authors: Paulo F. U. Gotardo, Aleix M. Martinez
Non-rigid structure from motion (NR-SFM) is a dif?cult, underconstrained problem in computer vision. This paper proposes a new algorithm that revises the standard matrix factorization approach in NR-SFM. We consider two alternative representations for the linear space spanned by a small number K of 3D basis shapes. As compared to the standard approach using general rank-3K matrix factors, we show that improved results are obtained by explicitly modeling K complementary spaces of rank-3. Our new method is positively compared to the state-of-the-art in NR-SFM, providing improved results on high-frequency deformations of both articulated and simpler deformable shapes. We also present an approach for NR-SFM with occlusion.
• Entropy Rate Superpixel Segmentation Authors: Ming-Yu Liu, Oncel Tuzel, Srikumar Ramalingam, Rama Chellappa
We propose a new objective function for superpixel segmentation. This objective function consists of two components: entropy rate of a random walk on a graph and a balancing term. The entropy rate favors formation of compact and homogeneous clusters, while the balancing function encourages clusters with similar sizes. We present a novel graph construction for images and show that this construction induces a matroid--- a combinatorial structure that generalizes the concept of linear independence in vector spaces. The segmentation is then given by the graph topology that maximizes the objective function under the matroid constraint. By exploiting submodular and monotonic properties of the objective function, we develop an efficient greedy algorithm. Furthermore, we prove an approximation bound of $\frac{1}{2}$ for the optimality of the solution. Extensive experiments on the Berkeley segmentation benchmark show that the proposed algorithm outperforms the state of the art in all the standard evaluation metrics.
• 2.5D Building Modeling with Topology Control Authors: Qian-Yi Zhou and Ulrich Neumann
2.5D building reconstruction aims at creating building models composed of complex roofs and vertical walls. In this paper, we define 2.5D building topology as a set of roof features, wall features, and point features; together with the associations between them. Based on this definition, we extend 2.5D dual contouring into a 2.5D modeling method with topology control. Comparing with the previous method, we put less restrictions on the adaptive simplification process. We show results under intense geometry simplifications. Our results preserve significant topology structures while the number of triangle is comparable to that of manually created model or primitive-based models.
• Unsupervised Random Forest Indexing for Fast Action Search Authors: Gang YU, Junsong Yuan, Zicheng Liu
Despite recent successes of small object search in images, the search and localization of actions in crowded videos remains a challenging problem because of (1) the large variations of human actions and (2) the intensive computational cost of searching the video space. To address these challenges, we propose a fast action search and localization method that supports relevance feedback from user. By characterizing videos as spatio-temporal interest points and building a random forest to index and match these points, our query matching is robust and efficient. To enable efficient action localization, we propose a coarse-to-fine subvolume search scheme, which is several orders faster than the existing video branch and bound search. The challenging cross-data search of several actions validates the effectiveness and efficiency of our method.
• Activity Recognition Using Dynamical Subspace Angles Authors: Binlong Li, Mustafa Ayazoglu, Teresa Mao, Octavia Camps and Mario Sznaier
Cameras are ubiquitous everywhere and hold the promise of significantly changing the way we live and interact with our environment. Human activity recognition is central to understanding dynamic scenes for applications ranging from security surveillance, to assisted living for the elderly, to video gaming without controllers. Most current approaches to solve this problem are based in the use of local temporal-spatial features that limit their ability to recognize long and complex actions. In this paper, we propose a new approach to exploit the temporal information encoded in the data. The main idea is to model activities as the output of unknown dynamic systems evolving from unknown initial conditions. Under this framework, we show that activity videos can be compared by computing the principal angles between subspaces representing activity types which are found by a simple SVD of the experimental data. The proposed approach outperforms state-of-the-art methods classifying activities in the KTH dataset as well as in much more complex scenarios involving interacting actors.
• High-resolution hyperspectral imaging via matrix factorization Authors: Rei Kawakami, John Wright, Yu-Wing Tai, Yasuyuki Matsushita, Moshe Ben-Ezra, Katsushi Ikeuchi
Hyperspectral imaging is a promising tool for applications in geosensing, cultural heritage and beyond. However, compared to current RGB cameras, existing hyperspectral cameras are severely limited in spatial resolution. In this paper, we introduce a simple new technique for reconstructing a very high-resolution hyperspectral image from two readily obtained measurements: A lower-resolution hyperspectral image and a high-resolution RGB image. Our approach is divided into two stages: We ?rst apply an unmixing algorithm to the hyperspectral input, to estimate a basis representing re?ectance spectra. We then use this representation in conjunction with the RGB input to produce the desired result. Our approach to unmixing is motivated by the spatial sparsity of the hyperspectral input, and casts the unmixing problem as the search for a factorization of the input into a basis and a set of maximally sparse coef?cients. Experiments show that this simple approach performs reasonably well on both simulations and real data examples.
• City-scale landmark identification on mobile devices Authors: David Chen, Georges Baatz, Kevin Koeser, Sam Tsai, Ramakrishna Vedantham, Timo Pylvanainen, Kimmo Roimela, Xin Chen, Jeff Bach,
With recent advances in mobile computing, the demand for visual localization or landmark identification on mobile devices is gaining interest. We advance the state of the art in this area by fusing two popular representations of street level image dataâ€”-facade-aligned and viewpoint-alignedâ€”-and show that they contain complementary information that can be exploited to significantly improve the recall rates on the city scale. We also improve feature detection in low contrast parts of the street-level data, and discuss how to incorporate priors on a userâ€™s position (e.g. given by noisy GPS readings or network cells), which previous approaches often ignore. Finally, and maybe most importantly, we present our results according to a carefully designed, repeatable evaluation scheme and make publicly available a set of 1.7 million images with ground truth labels, geotags, and calibration data, as well as a difficult set of cell phone query images. We provide these resources as a benchmark to facilitate further research in the area.
• City-scale landmark identification on mobile devices Authors: D. Chen, G. Baatz, K. Koeser, S. Tsai, R. Vedantham, T. Pylvanainen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, R. Gr
With recent advances in mobile computing, the demand for visual localization or landmark identification on mobile devices is gaining interest. We advance the state of the art in this area by fusing two popular representations of street level image dataâ€”-facade-aligned and viewpoint-alignedâ€”-and show that they contain complementary information that can be exploited to significantly improve the recall rates on the city scale. We also improve feature detection in low contrast parts of the street-level data, and discuss how to incorporate priors on a user's position (e.g. given by noisy GPS readings or network cells), which previous approaches often ignore. Finally, and maybe most importantly, we present our results according to a carefully designed, repeatable evaluation scheme and make publicly available a set of 1.7 million images with ground truth labels, geotags, and calibration data, as well as a difficult set of cell phone query images. We provide these resources as a benchmark to facilitate further research in the area.
• Probabilistic 3D Gaze Estimation Authors: Jixu Chen, Qiang Ji
Existing eye gaze tracking systems typically require an explicit personal calibration process in order to estimate certain person-specific eye parameters. For natural human computer interaction, such a personal calibration is often cumbersome and unnatural. In this paper, we propose a new probabilistic eye gaze tracking system without explicit personal calibration. Unlike the traditional eye gaze tracking methods, which estimate the eye parameter deterministically, our approach estimates the probability distributions of the eye parameter and the eye gaze, by combining image saliency with the 3D eye model. By using an incremental learning framework, the subject doesn't need personal calibration before using the system. His/her eye parameter and gaze estimation can be improved gradually when he/she is naturally viewing a sequence of images on the screen. The experimental result shows that the proposed system can achieve less than three degrees accuracy for different people without calibration.
• Improving Classifiers with Unlabeled Weakly-Related Videos Authors: Christian Leistner, Martin Godec, Samuel Schulter, Amir Saffari, Manuel Werlberger, and Horst Bischof
Current state-of-the-art object classification systems are trained using large amounts of hand-labeled images. In this paper, we present an approach that shows how to use unlabeled video sequences, comprising weakly-related object categories towards the target class, to learn better classifiers for tracking and detection. The underlying idea is to exploit the space-time consistency of moving objects to learn classifiers that are robust to local transformations. In particular, we use dense optical flow to find moving objects in videos in order to train part-based random forests that are insensitive to natural transformations. Our method, which is called Video Forests, can be used in two settings: first, labeled training data can be regularized to force the trained classifier to generalize better towards small local transformations. Second, as part of a tracking-by-detection approach, it can be used to train a general codebook solely on pair-wise data that can then be applied to tracking of instances of a priori unknown object categories. In the experimental part, we show on benchmark datasets for both tracking and detection that incorporating unlabeled videos into the learning of visual classifiers leads to improved results.
• Improving Classifiers with Unlabeled Weakly-Related Videos Authors: Christian Leistner, Martin Godec, Samuel Schulter, Amir Saffari, Manuel Werlberger, and Horst Bischof
Current state-of-the-art object classification systems are trained using large amounts of hand-labeled images. In this paper, we present an approach that shows how to use unlabeled video sequences, comprising weakly-related object categories towards the target class, to learn better classifiers for tracking and detection. The underlying idea is to exploit the space-time consistency of moving objects to learn classifiers that are robust to local transformations. In particular, we use dense optical flow to find moving objects in videos in order to train part-based random forests that are insensitive to natural transformations. Our method, which is called Video Forests, can be used in two settings: first, labeled training data can be regularized to force the trained classifier to generalize better towards small local ransformations. Second, as part of a tracking-by-detection approach, it can be used to train a general codebook solely on pair-wise data that can then be applied to tracking of instances of a priori unknown object categories. In the experimental part, we show on benchmark datasets for both tracking and detection that incorporating unlabeled videos into the learning of visual classifiers leads to improved results.
• High-resolution hyperspectral imaging via matrix factorization Authors: Rei Kawakami, John Wright, Yu-Wing Tai, Yasuyuki Matsushita, Moshe Ben-Ezra, Katsushi Ikeuchi
Hyperspectral imaging is a promising tool for applications in geosensing, cultural heritage and beyond. However, compared to current RGB cameras, existing hyperspectral cameras are severely limited in spatial resolution. In this paper, we introduce a simple new technique for reconstructing a very high-resolution hyperspectral image from two readily obtained measurements: A lower-resolution hyperspectral image and a high-resolution RGB image. Our approach is divided into two stages: We ?rst apply an unmixing algorithm to the hyperspectral input, to estimate a basis representing re?ectance spectra. We then use this representation in conjunction with the RGB input to produce the desired result. Our approach to unmixing is motivated by the spatial sparsity of the hyperspectral input, and casts the unmixing problem as the search for a factorization of the input into a basis and a set of maximally sparse coef?cients. Experiments show that this simple approach performs reasonably well on both simulations and real data examples.
• Efficient Multi-Camera Detection, Tracking, and Identification using a Shared Set of Haar-Features Authors: Reyes Rios Cabrera, Tinne Tuytelaars, Luc Van Gool
This paper presents an integrated solution for the problem of detecting, tracking and identifying vehicles in a tunnel surveillance application, taking into account practical constraints including realtime operation, poor imaging conditions, and a decentralized architecture. Vehicles are followed through the tunnel by a network of non-overlapping cameras. They are detected and tracked in each camera and then identified, i.e. matched to any of the vehicles detected in the previous camera(s). To limit the computational load, we propose to reuse the same set of Haar-features for each of these steps. For the detection, we use an Adaboost cascade. Here we introduce a composite confidence score, integrating information from all stage of the cascades. A subset of the features used for detection is then selected, optimizing for the identification problem. This results in a compact binary â€˜vehicle fingerprintâ€™, requiring very limited bandwidth. Finally, we show that the same set of features can also be used for tracking. This Haar-features based â€˜tracking-by-identificationâ€™ yields surprisingly good results on standard datasets, without the need to update the model online.
• A General Method for the Point of Regard Estimation in 3D Space Authors: Fiora Pirri, Matia Pizzoli, Alessandro Rudi
A novel approach to 3D gaze estimation for wearable multi-camera devices is proposed and its effectiveness is demonstrated both theoretically and empirically. The proposed approach, firmly grounded on the geometry of the multiple views, introduces a calibration procedure that is efficient, accurate, highly innovative but also practical and easy. Thus, it can run online with little intervention from the user. The overall gaze estimation model is general, as no particular complex model of the human eye is assumed in this work. This is made possible by a novel approach, that can be sketched as follows: each eye is imaged by a camera; two conics are fitted to the imaged pupils and a calibration sequence, consisting in the subject gazing a known 3D point, while moving his/her head, provides information to 1) estimate the optical axis in 3D world; 2) compute the geometry of the multi-camera system; 3) estimate the Point of Regard in 3D world. The resultant model is being used effectively to study visual attention by means of gaze estimation experiments, involving people performing natural tasks in wide-field, unstructured scenarios.
• Gated Classifiers: Boosting under High Intra-Class Variation Authors: Oscar Danielsson, Babak Rasolzadeh and Stefan Carlsson
In this paper we address the problem of using boosting (e.g. AdaBoost [7]) to classify a target class with significant intra-class variation against a large background class. This situation occurs for example when we want to recognize a visual object class against all other image patches. The boosting algorithm produces a strong classifier, which is a linear combination of weak classifiers. We observe that we often have sets of weak classifiers that individually fire on many examples of the target class but never fire together on those examples (i.e. their outputs are anticorrelated on the target class). Motivated by this observation we suggest a family of derived weak classifiers, termed gated classifiers, that suppress such combinations of weak classifiers. Gated classifiers can be used on top of any original weak learner. We run experiments on two popular datasets, showing that our method reduces the required number of weak classifiers by almost an order of magnitude, which in turn yields faster detectors. We experiment on synthetic data showing that gated classifiers enables more complex distributions to be represented. We hope that gated classifiers will extend the usefulness of boosted classifier cascades [29].
• Structure-from-Motion Based Hand-Eye Calibration Using $L_{\infty}$ Minimization Authors: Jan Heller, Michal Havlena, Akihiro Sugimoto, Tomas Pajdla
This paper presents a novel method for so-called hand-eye calibration. Using a calibration target is not possible for many applications of hand-eye calibration. In such situations Structure-from-Motion approach of hand-eye calibration is commonly used to recover the camera poses up to scaling. The presented method takes advantage of recent results in the $L_{\infty}$-norm optimization using Second-Order Cone Programming (SOCP) to recover the correct scale. Further, the correctly scaled displacement of the hand-eye transformation is recovered solely from the image correspondences and robot measurements, and is guaranteed to be globally optimal with respect to the $L_{\infty}$-norm. The method is experimentally validated using both synthetic and real world datasets.
• Global temporal registration of multiple non-rigid surface sequences Authors: Peng Huang, Chris Budd, Adrian Hilton
In this paper we consider the problem of aligning multiple non-rigid surface mesh sequences into a single temporally consistent representation of the shape and motion. A global alignment graph structure is introduced which uses shape similarity to identify frames for inter-sequence registration. Graph optimisation is performed to minimise the total non-rigid deformation required to register the input sequences into a common structure. The resulting global alignment ensures that all input sequences are resampled with a common mesh structure which preserves the shape and temporal correspondence. Results demonstrate temporally consistent representation of several public databases of mesh sequences for multiple people performing a variety of motions with loose clothing and hair.
• GraphTrack: Fast and Globally Optimal Tracking in Videos Authors: Brian Amberg and Thomas Vetter
In video post-production it is often necessary to track interest points in the video. This is called off-line tracking, because the complete video is available to the algorithm and can be contrasted with on-line tracking, where an incoming stream is tracked in real time. Off-line tracking should be accurate and â€“ if used interactively â€“ needs to be fast, preferably faster than real-time. We describe a 50 to 100 frames per second off-line tracking algorithm, which globally maximizes the probability of the track given the complete video. The algorithm is more reliable than previous methods because it explains the complete frames, not only the patches of the ?nal track, making as much use of the data as possible. It achieves ef?ciency by using a greedy search strategy with deferred cost evaluation, focusing the computational effort on the most promising track candidates while ?nding the globally optimal track.
• The Magic Sigma Authors: Dirk Padfield
With the explosion in the usage of mobile devices and other smart electronics, embedded devices are becoming ubiquitous. Most such embedded architectures utilize fixed-point rather than floating-point computation to meet power, heat, and speed requirements leading to the need for integer-based processing algorithms. Operations involving Gaussian kernels are common to such algorithms, but the standard methods of constructing such kernels result in approximations and lack a property that enables efficient bitwise shift operations. To overcome these limitations, we present how to precisely combine the power of integer arithmetic and bitwise shifts with intrinsically real valued Gaussian kernels. We prove mathematically that there exist a set of what we call "magic sigmas" for which the integer kernels exactly represent the Gaussian function whose values are all powers-of-two, and we discovered that the maximum sigma that leads to such properties is about 0.85. We also designed a simple and precise algorithm for designing kernels composed exclusively of integers given any arbitrary sigma and show how this can be exploited for Gaussian filter design. Considering the ubiquity of Gaussian filtering and the need for integer computation for increasing numbers of embedded devices, this is an important result for both theoretical and practical purposes.