## Orals 4A : Computational Photography: Sensing and Display

• Diffuse Mirrors: 3D Reconstruction from Diffuse Indirect Illumination Using Inexpensive Time-of-Flight Sensors Authors: Felix Heide, Lei Xiao, Wolfgang Heidrich, Matthias B. Hullin
The functional difference between a diffuse wall and a mirror is well understood: one scatters back into all directions, and the other one preserves the directionality of reflected light. The temporal structure of the light, however, is left intact by both: assuming simple surface reflection, photons that arrive first are reflected first. In this paper, we exploit this insight to recover objects outside the line of sight from second-order diffuse reflections, effectively turning walls into mirrors. We formulate the reconstruction task as a linear inverse problem on the transient response of a scene, which we acquire using an affordable setup consisting of a modulated light source and a time-of-flight image sensor. By exploiting sparsity in the reconstruction domain, we achieve resolutions in the order of a few centimeters for object shape (depth and laterally) and albedo. Our method is robust to ambient light and works for large room-sized scenes. It is drastically faster and less expensive than previous approaches using femtosecond lasers and streak cameras, and does not require any moving parts.
• Fourier Analysis on Transient Imaging with a Multifrequency Time-of-Flight Camera Authors: Jingyu Lin, Yebin Liu, Matthias B. Hullin, Qionghai Dai
A transient image is the optical impulse response of a scene which visualizes light propagation during an ultra-short time interval. In this paper we discover that the data captured by a multifrequency time-of-flight (ToF) camera is the Fourier transform of a transient image, and identify the sources of systematic error. Based on the discovery we propose a novel framework of frequency-domain transient imaging, as well as algorithms to remove systematic error. The whole process of our approach is of much lower computational cost, especially lower memory usage, than Heide et al.'s approach using the same device. We evaluate our approach on both synthetic and real-datasets.
• Transparent Object Reconstruction via Coded Transport of Intensity Authors: Chenguang Ma, Xing Lin, Jinli Suo, Qionghai Dai, Gordon Wetzstein
Capturing and understanding visual signals is one of the core interests of computer vision. Much progress has been made w.r.t. many aspects of imaging, but the reconstruction of refractive phenomena, such as turbulence, gas and heat flows, liquids, or transparent solids, has remained a challenging problem. In this paper, we derive an intuitive formulation of light transport in refractive media using light fields and the transport of intensity equation. We show how coded illumination in combination with pairs of recorded images allow for robust computational reconstruction of dynamic two and three-dimensional refractive phenomena.
• 3D Shape and Indirect Appearance by Structured Light Transport Authors: Matthew O'Toole, John Mather, Kiriakos N. Kutulakos
We consider the problem of deliberately manipulating the direct and indirect light flowing through a time-varying, fully-general scene in order to simplify its visual analysis. Our approach rests on a crucial link between stereo geometry and light transport: while direct light always obeys the epipolar geometry of a projector-camera pair, indirect light overwhelmingly does not. We show that it is possible to turn this observation into an imaging method that analyzes light transport in real time in the optical domain, prior to acquisition. This yields three key abilities that we demonstrate in an experimental camera prototype: (1) producing a live indirect-only video stream for any scene, regardless of geometric or photometric complexity; (2) capturing images that make existing structured-light shape recovery algorithms robust to indirect transport; and (3) turning them into one-shot methods for dynamic 3D shape capture.
• Shape-Preserving Half-Projective Warps for Image Stitching Authors: Che-Han Chang, Yoichi Sato, Yung-Yu Chuang
This paper proposes a novel parametric warp which is a spatial combination of a projective transformation and a similarity transformation. Given the projective transformation relating two input images, based on an analysis of the projective transformation, our method smoothly extrapolates the projective transformation of the overlapping regions into the non-overlapping regions and the resultant warp gradually changes from projective to similarity across the image. The proposed warp has the strengths of both projective and similarity warps. It provides good alignment accuracy as projective warps while preserving the perspective of individual image as similarity warps. It can also be combined with more advanced local-warp-based alignment methods such as the as-projective-as-possible warp for better alignment accuracy. With the proposed warp, the field of view can be extended by stitching images with less projective distortion (stretched shapes and enlarged sizes).
• Parallax-tolerant Image Stitching Authors: Fan Zhang, Feng Liu
Parallax handling is a challenging task for image stitching. This paper presents a local stitching method to handle parallax based on the observation that input images do not need to be perfectly aligned over the whole overlapping region for stitching. Instead, they only need to be aligned in a way that there exists a local region where they can be seamlessly blended together. We adopt a hybrid alignment model that combines homography and content-preserving warping to provide flexibility for handling parallax and avoiding objectionable local distortion. We then develop an efficient randomized algorithm to search for a homography, which, combined with content-preserving warping, allows for optimal stitching. We predict how well a homography enables plausible stitching by finding a plausible seam and using the seam cost as the quality metric. We develop a seam finding method that estimates a plausible seam from only roughly aligned images by considering both geometric alignment and image content. We then pre-align input images using the optimal homography and further use content-preserving warping to locally refine the alignment. We finally compose aligned images together using a standard seam-cutting algorithm and a multi-band blending algorithm. Our experiments show that our method can effectively stitch images with large parallax that are difficult for existing methods.

## Orals 4B : Recognition: Detection, Categorization, Classification

• Learning Everything about Anything: Webly-Supervised Visual Concept Learning Authors: Santosh K. Divvala, Ali Farhadi, Carlos Guestrin
Recognition is graduating from labs to real-world applications. While it is encouraging to see its potential being tapped, it brings forth a fundamental challenge to the vision researcher: scalability. How can we learn a model for any concept that exhaustively covers all its appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance, gathering the training images and annotations, and learning the models? In this paper, we introduce a fully-automated approach for learning extensive models for a wide range of variations (e.g. actions, interactions, attributes and beyond) within any concept. Our approach leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Our approach organizes the visual knowledge about a concept in a convenient and useful way, enabling a variety of applications across vision and NLP. Our online system has been queried by users to learn models for several interesting concepts including breakfast, Gandhi, beautiful, etc. To date, our system has models available for over 50,000 variations within 150 concepts, and has annotated more than 10 million images with bounding boxes.
• Dirichlet-based Histogram Feature Transform for Image Classification Authors: Takumi Kobayashi
Histogram-based features have significantly contributed to recent development of image classifications, such as by SIFT local descriptors. In this paper, we propose a method to efficiently transform those histogram features for improving the classification performance. The (L1 -normalized) histogram feature is regarded as a probability mass function, which is modeled by Dirichlet distribution. Based on the probabilistic modeling, we induce the Dirichlet Fisher kernel for transforming the histogram feature vector. The method works on the individual histogram feature to enhance the discriminative power at a low computational cost. On the other hand, in the bag-of-feature (BoF) frame- work, the Dirichlet mixture model can be extended to Gaussian mixture by transforming histogram-based local descriptors, e.g., SIFT, and thereby we propose the method of Dirichlet-derived GMM Fisher kernel. In the experiments on diverse image classification tasks including recognition of subordinate objects and material textures, the pro- posed methods improve the performance of the histogram- based features and BoF-based Fisher kernel, being favor- ably competitive with the state-of-the-arts.
• BING: Binarized Normed Gradients for Objectness Estimation at 300fps Authors: Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, Philip Torr
Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 �� 8 and use the norm of the gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD, BITWISE SHIFT, etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1,000 proposals. Increasing the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.
• Context Driven Scene Parsing with Attention to Rare Classes Authors: Jimei Yang, Brian Price, Scott Cohen, Ming-Hsuan Yang
This paper presents a scalable scene parsing algorithm based on image retrieval and superpixel matching. We focus on rare object classes, which play an important role in achieving richer semantic understanding of visual scenes, compared to common background classes. Towards this end, we make two novel contributions: rare class expansion and semantic context description. First, considering the long-tailed nature of the label distribution, we expand the retrieval set by rare class exemplars and thus achieve more balanced superpixel classification results. Second, we incorporate both global and local semantic context information through a feedback based mechanism to refine image retrieval and superpixel matching. Results on the SIFTflow and LMSun datasets show the superior performance of our algorithm, especially on the rare classes, without sacrificing overall labeling accuracy.
• Patch to the Future: Unsupervised Visual Prediction Authors: Jacob Walker, Abhinav Gupta, Martial Hebert
In this paper we present a conceptually simple but surprisingly powerful method for visual prediction which combines the effectiveness of mid-level visual elements with temporal modeling. Our framework can be learned in a completely unsupervised manner from a large collection of videos. However, more importantly, because our approach models the prediction framework on these mid-level elements, we can not only predict the possible motion in the scene but also predict visual appearances ��� how are appearances going to change with time. This yields a visual "hallucination" of probable events on top of the scene. We show that our method is able to accurately predict and visualize simple future events; we also show that our approach is comparable to supervised methods for event prediction.
• Triangulation Embedding and Democratic Aggregation for Image Search Authors: Hervé Jégou, Andrew Zisserman
We consider the design of a single vector representation for an image that embeds and aggregates a set of local patch descriptors such as SIFT. More specifically we aim to construct a dense representation, like the Fisher Vector or VLAD, though of small or intermediate size. We make two contributions, both aimed at regularizing the individual contributions of the local descriptors in the final representation. The first is a novel embedding method that avoids the dependency on absolute distances by encoding directions. The second contribution is a democratization" strategy that further limits the interaction of unrelated descriptors in the aggregation stage. These methods are complementary and give a substantial performance boost over the state of the art in image search with short or mid-size vectors, as demonstrated by our experiments on standard public image retrieval benchmarks.

## Orals 4C : 3D Geometry & Shape

• Local Regularity-driven City-scale Facade Detection from Aerial Images Authors: Jingchen Liu, Yanxi Liu
In this paper, we cast the problem of point cloud matching as a shape matching problem by transforming each of the given point clouds into a shape representation called the Schr\"{o}dinger distance transform (SDT) representation. This is achieved by solving a static Schr\"{o}dinger equation instead of the corresponding static Hamilton-Jacobi equation in this setting. The SDT representation is an analytic expression and following the theoretical physics literature, can be normalized to have unit $L_2$ norm---making it a \emph{square-root density}, which is identified with a point on a unit Hilbert sphere, whose intrinsic geometry is fully known. The Fisher-Rao metric, a natural metric for the space of densities leads to analytic expressions for the geodesic distance between points on this sphere. In this paper, we use the well known Riemannian framework never before used for point cloud matching, and present a novel matching algorithm. We pose point set matching under rigid and non-rigid transformations in this framework and solve for the transformations using standard nonlinear optimization techniques. Finally, to evaluate the performance of our algorithm---dubbed SDTM---we present several synthetic and real data examples along with extensive comparisons to state-of-the-art techniques. The experiments show that our algorithm outperforms state-of the-art point set registration algorithms on many quantitative metrics.