[3/18] Nonlinear dimensionality reduction by locally linear embedding, Roweis & Saul

0 意見

Topic: Manifold Learning


A. Summary

Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here,we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighbor-hood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.
B. Note


Steps of locally linear embedding:
(1) Assign neighbors to each data point Xi (for example by using the K nearest neighbors).
(2) Compute the weights Wij that best linearly reconstruct Xi from its neighbors, solving the constrained least-squares problem in Eq. 1.
(3) Compute the low-dimensional embedding vectors Yi best reconstructed by Wij , minimizing Eq. 2 by finding the smallest eigenmodes of the sparse symmetric matrix in Eq. 3. Although the weights Wij and vectors Yi are computed by methods in linear algebra, the constraint that points are only reconstructed from neighbors can result in highly nonlinear embeddings.

*require dense feature-sets (otherwise it can not find neighbors)

[3/18] Eigenfaces for recognition, M Turk, A Pentland

0 意見

A. Summary
"We have developed a near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals. The computational approach taken in this system is motivated by both physiology and information theory, as well as by the practical requirements of near-real-time performance and accuracy.

Our approach treats the face recognition problem as an intrinsically two-dimensional (2-D) recognition problem rather than requiring recovery of three dimensional geometry, taking advantage of the fact that faces are normally upright and thus may be described by a small set of 2-D characteristic views.

The system functions by projecting face images onto a feature space that spans the significant variations among known face images. The significant features are known as "eigenfaces," because they are the eigenvectors (principal components) of the set of faces; they do not necessarily correspond to features such as eyes, ears, and noses.

The projection operation characterizes an individual face by a weighted sum of the eigenface features, and so to recognize a particular face it is necessary only to compare these weights to those of known individuals.

Some particular advantages of our approach are that it provides for the ability to learn and later recognize new faces in an unsupervised manner, and that it is easy to implement using a neural network architecture."

B. Note

[3/11] Latent Dirichlet allocation, D. Blei, A. Ng, and M. Jordan.

0 意見
A. Summary (mostly come from from ABSTRACT)

This paper describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.
Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document.
We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

B. Note

part1: corpus generative process model

LDA assumes the following generative process for each document w in a corpus D:
1.  Choose N ~ Poisson(ξ).
2.  Choose θ ~ Dir(α).
3.  For each of the N words wn:
  (a)  Choose a topic zn ~ Multinomial(θ).
  (b)  Choose a word wn from p(wn | zn,β), a multinomial probability conditioned on the topic zn.


part2: Estimate parameter

Maximizing p(D | α,β) instead of maximizing p(α,β | D), because they are equal (by Bayesian theorem)

but p(w | α,β) is intractable, so they used a variational distribution to approximate γ,φ

[3/11] Probabilistic latent semantic indexing, T. Hofmann

0 意見
A. Summary (mostly come from from ABSTRACT)

Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.

Fitted from a training corpus of text do cuments by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain-specific synonymy
as well as with polysemous words.

In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model.

Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI.

In particular, the combination of models with different dimensionalities has proven to be advantageous.

B. Note

LSA
  document->concept->word
  SVD(singular value secomposition)
pLSA
  EM algorithm

[3/4] Efficient visual search of videos cast as text retrieval, J. Sivic, and A. Zisserman'09

0 意見
A. Summary (mostly come from from ABSTRACT)

We describe an approach to object retrieval which searches for and localizes all the occurrences of an object in a video, given a query image of the object.

The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.

The temporal continuity of the video within a shot is used to track the regions in order to reject those that are unstable.

Efficient retrieval is achieved by employing methods from statistical text retrieval, including inverted file systems, and text and document frequency weightings.

This requires a visual analogy of a word which is provided here by vector quantizing the region descriptors.

The final ranking also depends on the spatial layout of the regions. The result is that retrieval is immediate, returning a ranked list of shots in the manner of Google [6].

We report results for object retrieval on the full length feature films ‘Groundhog Day’, ‘Casablanca’ and ‘Run Lola Run’, including searches from within the movie and specified by
external images downloaded from the Internet.

We investigate retrieval performance with respect to different quantizations of region descriptors and compare the performance of several ranking measures. Performance is also compared to a baseline method implementing standard frame to frame matching.

B. Conclusion (rewrite from CONCLUSION, plus some comments)

1. Object retrieval, based on vector-quantized viewpoint invariant descriptors.

2. Vector quantization does not appear to introduce a significant loss in retrieval performance (precision or recall) compared to nearest neighbour matching.

Currently, descriptors are assigned to the nearest cluster centre using linear search. Recently however, efficient search methods using
hierarchical tree structured vocabulary [24]
vocabulary indexed by randomized trees [28]
descriptor indexing by decision trees [14], [26] have been used.
Hierarchical vocabularies [11], [24] can also reduce descriptor quantization effects and can, to some extent, overcome the difficulty with choosing the number of cluster centres.

The spatial consistency re-ranking was shown to be very effective in improving the precision and removing false positive matches. However, the precision could be further improved by a more thorough (and more expensive) verification, based on a stricter measure of spatial similarity (e.g. angular ordering of regions [35], region overlap [10], deformable mesh matching [29], or common affine geometric transformation [18], [28]).

Unless the system is being designed solely to retrieve rigid objects, care must be taken not to remove true positive matches on deformable objects, such as people’s clothing, by using measures that apply only to rigid geometry. To reduce the computational cost, verification can be implemented as a sequence of progressively more thorough (and more expensive) filtering stages.

Spatially verified returns can be used to automatically expand the initial user-given query
with additional visual words leading to a significantly improved retrieval performance [8].

[3/4] Distinctive Image Features from Scale-Invariant Keypoints, Lawe'04

0 意見
A. Summary (mostly come from from ABSTRACT)

This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or 'scene'.

major stages of computation used to generate the set of image features:
1. Scale-space extrema detection:
searches over all scales and image locations by using DoG
2. Keypoint localization
3. Orientation assignment (based on local image gradient directions)
4. Keypoint descriptor

The features are invariant to image scale and rotation, and are shown to provide robust matching across a a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination.

The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images.

This paper also describes an approach to using these features for object recognition.
The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters.

This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.


B. Conclusion (rewrite from CONCLUSION, plus some comments)

1. Distinctiveness, so that enables the correct match for a keypoint to be selected from a large database of other keypoints.

2. Invariant to image rotation and scale.
Achieve scale-invariant by apply various Gaussian kernel, and rotation-invariant by dominant gradient.

3. Robust across a substantial range of affine distortion, addition of noise, and change in
illumination.

4. Large numbers of keypoints leads to robustness in extracting small objects among clutter.
Also, use bags of feature points comes with time issue when matching keypoints among large-scale database.

5. The research of Weber, Welling, and Perona (2000) and Fergus, Perona, and Zisserman (2003) has shown the potential of this approach by learning small sets of local features that are suited to recognizing generic classes of objects.

[3/4] Lecture 02: Local Features and Visual Words

0 意見
what is the most important characteristic of a feature
efficiency? yes
dimension length? (yes but usually long for image and video)
invariant? !! scale, rotation

*local feature began widely used about when '01

bag of feature points
problem: efficiency, thousands of them need to be matched

Local Feature = Detector + Descriptor (where are them, how to represent)

Detector:
point-based
region-based
scale invariant: LoG, DoG(for low quality), Harris-Laplace
affine invariant: Harris-Affine, Hessian-Affine, ..., etc

*DoG(Difference of Gaussian) detector:
apply various kernels and scales
deal with scale-invariant

Descriptor:
*SIFT: empirically 4x4(windows)x8(bins)=128 dimensions
deal with rotation-invariant


Visual Word
  quantized local descriptor (by clustering) e.g. SIFT
*some issues
  the more, the better?
    yes, but saturates in object retrieval, or even degrade in classification
  efficiency issue.
    Algorithm-wise or Parallelize
  sampling method
    sparse, dense, or random
  feature selection and reduction
    pLSA and LDA
  low recalls
    soft-assignment, query expansion