April 25th Talk: Location Recognition

April 23, 2012 Leave a comment

This week I will present some  recent works on location recognition problems.


Main paper:

Fast Image-Based Localization using Direct 2D-to-3D Matching (PDFsupplementary materialproject). Torsten Sattler, Bastian Leibe, Leif Kobbelt. ICCV 2011.


Related works:

Location Recognition using Prioritized Feature Matching (PDF). Yunpeng Li, Noah Snavely, Dan Huttenlocher. ECCV 2010.

From Structure-From-Motion Point Clouds to Fast Location Recognition (PDF) Arnold Irschara, Christopher Zach, Jan-Michael Frahm, Horst Bischof. CVPR 2009.


See you there!

Categories: Talk Announcement

April 18th Talk: Latent Dirichlet Allocation And Its Application in CV

April 16, 2012 Leave a comment

This week I will present several work that applied LDA (Latent Dirichlet Allocation) and its variantions such as gLDA and pLDA to cv. Topic model and Bayesian Nonparametric modeling are becoming more and more popular. Jordan, Blei and  Ghahramani published many works that became the foundation of Dirichlet process while Zisserman, FeiFei applied topic models in solving many different types of problems.

Main paper:

J. Philbin, J. Sivic, A. Zisserman, Geometric Latent Dirichlet Allocation on a Matching Graph for Large-scale Image Datasets (pdf,dataset).  International Journal of Computer Vision, Volume 95, Number 2, page 138–153, nov 2011


Given a large-scale collection of images our aim is to efficiently associate images which contain the same entity, for example a building or object, and to discover the significant entities. To achieve this, we introduce the Geometric Latent Dirichlet Allocation (gLDA) model for unsupervised discovery of particular objects in unordered image collections. This explicitly represents images as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way. Applying standard inference techniques to this model enables images likely to contain the same object to be probabilistically grouped and ranked.

Additionally, to reduce the computational cost of applying the gLDA model to large datasets, we propose a scalable method that first computes a matching graph over all the images in a dataset. This matching graph connects images that contain the same object, and rough image groups can be mined from this graph using standard clustering techniques. The gLDA model can then be applied to generate a more nuanced representation of the data. We also discuss how “hub images” (images representative of an object or landmark) can easily be extracted from our matching graph representation.

We evaluate our techniques on the publicly available Oxford buildings dataset (5K images) and show examples of automatically mined objects. The methods are evaluated quantitatively on this dataset using a ground truth labeling for a number of Oxford landmarks. To demonstrate the scalability of the matching graph method, we show qualitative results on two larger datasets of images taken of the Statue of Liberty (37K images) and Rome (1M+ images).

Related work:

D.Blei, A.NG, M.Jordian, Latent Dirichlet Allocation. Journal of Machine Learning Research:993-1022 (2003) pdf

L.Fei-Fei and P.Perona. A Bayesian Hierachical Model for Learning Natural Scene Categories. CVPR 2005 pdf

J. J. Kivinen, E. B. Sudderth, and M. I. Jordan.  Learning multiscale representations of natural scenes using Dirichlet processes. .ICCV, 2007. pdf

S. Karayev, M.Fritz, S.Fidler, T. Darrell. A Probabilistic Model for Recursive Factorized Image Features. CVPR 2011. pdf

Categories: Uncategorized

April 11th Talk: Structure from Motion for ambiguous scenes

April 10, 2012 Leave a comment

This week I will present several papers on solving the structure from motion for ambiguous scenes: for example, a scene with multiple similar or exactly the same objects, where traditional methods tend to “fold” the scene reconstruction.

Following is the main paper I am going to present:

N. Jiang, P. Tan, and L. F. Cheong, Seeing Double Without Confusion: Structure-from-Motion in Highly Ambiguous Scenes,” IEEE Conference on Computer Vision and Patten Recognition (CVPR), 2012. [PDF]

Related works:

C. Zach, A. Irschara, and H. Bischof, “What can missing correspondences tell us about 3D structure and motion?,” in Computer Vision and Pattern Recognition (CVPR), 2008. [PDF]

C. Zach, M. Klopschitz, and M. Pollefeys, “Disambiguating visual relations using loop constraints,” in Computer Vision and Pattern Recognition (CVPR), 2010. [PDF]

R. Roberts, S. N. Sinha, R. Szeliski, and D. Steedly, “Structure from motion for scenes with large duplicate structures,” in Computer Vision and Pattern Recognition (CVPR), 2011. [PDF]

Categories: Talk Announcement

April 3rd Talk: Object affordance detection

April 3, 2012 Leave a comment

“the meaning or value of a thing consists of what it affords.”

-JJ Gibson (1979)

This week I will present several work related to object affordance detection. Affordance here is defined as the function/utility of an object.

Main paper:

Grabner H., Gall J., and van Gool L., What Makes a Chair a Chair? (PDF, Images/Data), IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11)


Many object classes are primarily defined by their functions. However, this fact has been left largely unexploited by visual object categorization or detection systems. We propose a method to learn an affordance detector. It identifies locations in the 3d space which “support” the particular function. Our novel approach “imagines” an actor performing an action typical for the target object class, in stead of relying purely on the visual object appearance. So, function is handled as a cue complementary to appearance, rather than being a consideration after appearance-based detection. Experimental results are given for the functional category “sitting”. Such affordance is tested on a 3d representation of the scene, as can be realistically obtained through SfM or depth cameras. In contrast to appearance-based object detectors, affordance detection requires only very few training examples and generalizes very well to other sittable objects like benches or sofas when trained on a few chairs.

Related work:

Abhinav Gupta, Scott Satkin, Alexei A. Efros and M. Hebert, From Scene Geometry to Human Workspace. In CVPR 2011 [pdf]

Gall J., Fossati A., and van Gool L., Functional Categorization of Objects using Real-time Markerless Motion Capture (PDF, Images/Video/Data). In CVPR 2011.

Categories: Talk Announcement

March 28th Talk: Overview of object co-segmentation

March 25, 2012 Leave a comment

This week I will present an overview of object co-segmentation, a topic that has started receiving a lot of attention recently. I will use the following paper that summarizes most approaches,

The following are the other papers I would touch upon in the talk:

Categories: Uncategorized

Mar. 14th talk: New directions in object recognition

This week I will cover some new work on object recognition

Main paper:

“The Truth About Cats and Dogs” by Omkar M Parkh et al. ICCV2011 [pdf]

Template-based object detectors such as the deformable parts model of Felzenszwalb et al. [11] achieve state-ofthe-art performance for a variety of object categories, but are still outperformed by simpler bag-of-words models for highly flexible objects such as cats and dogs. In these cases we propose to use the template-based model to detect a distinctive part for the class, followed by detecting the rest of the object via segmentation on image specific information learnt from that part. This approach is motivated by two observations: (i) many object classes contain distinctive parts that can be detected very reliably by template-based detectors, whilst the entire object cannot; (ii) many classes (e.g. animals) have fairly homogeneous coloring and texture that can be used to segment the object once a sample is provided in an image.

We show quantitatively that our method substantially outperforms whole-body template-based detectors for these highly deformable object categories, and indeed achieves accuracy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models such as bagof-words.


  • “Cat head detection-how to effectively exploit shape and texture features” by  W. Zhang et al., ECCV2008. [pdf]
  • “Stationary features and cat detection” by François Fleuret & Donald Geman, Journal of Machine Learning Research, 2008. [pdf]
  • “Segmentation as Selective Search for Object Recognition” by Koen E. A. van de Sande et al., ICCV2011. [pdf]
  • “Object Recognition as Ranking Holistic Figure-Ground Hypotheses”, by Fuxin Li et al., CVPR2010 [pdf]
Categories: Uncategorized

Mar. 7th talk: Manhattan Scene Understanding

This week I will present the stream of work from Alex Flint about indoor scene understanding and 3D reconstruction.

abstract: this paper (ICCV 2011, oral presentation) addresses scene understanding in the context of a moving camera, integrating semantic reasoning ideas from monocular vision with 3D information available through structure–from–motion. We combine geometric and photometric cues in a Bayesian framework, building on recent successes leveraging the indoor Manhattan assumption in monocular vision. We focus on indoor environments and show how to extract key boundaries while ignoring clutter and decorations. To achieve this we present a graphical model that relates photometric cues learned from labeled data, stereo photo–consistency across multiple views, and depth cues derived from structure–from–motion point clouds. We show how to solve MAP inference using dynamic programming, allowing exact, global inference in 100 ms (in addition to feature computation of under one second) without using specialized hardware. Experiments show our system out–performing the state–of–the–art.

Paper list:

“Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”, Alex Flint, David Murray, and Ian Reid, ICCV 2011, oral [pdf]

“A Dynamic Programming Approach to Reconstructing Building Interiors”, Alex Flint, Christopher Mei, David Murray, and Ian Reid, ECCV 2010, [pdf]

“Growing Semantically Meaningful Models for Visual SLAM”, Alex Flint, Christopher Mei, Ian Reid, and David Murray, CVPR 2010, [pdf]

Categories: Uncategorized