A. Summary (mostly come from from ABSTRACT)
We describe an approach to object retrieval which searches for and localizes all the occurrences of an object in a video, given a query image of the object.
The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
The temporal continuity of the video within a shot is used to track the regions in order to reject those that are unstable.
Efficient retrieval is achieved by employing methods from statistical text retrieval, including inverted file systems, and text and document frequency weightings.
This requires a visual analogy of a word which is provided here by vector quantizing the region descriptors.
The final ranking also depends on the spatial layout of the regions. The result is that retrieval is immediate, returning a ranked list of shots in the manner of Google [6].
We report results for object retrieval on the full length feature films ‘Groundhog Day’, ‘Casablanca’ and ‘Run Lola Run’, including searches from within the movie and specified by
external images downloaded from the Internet.
We investigate retrieval performance with respect to different quantizations of region descriptors and compare the performance of several ranking measures. Performance is also compared to a baseline method implementing standard frame to frame matching.
B. Conclusion (rewrite from CONCLUSION, plus some comments)
1. Object retrieval, based on vector-quantized viewpoint invariant descriptors.
2. Vector quantization does not appear to introduce a significant loss in retrieval performance (precision or recall) compared to nearest neighbour matching.
Currently, descriptors are assigned to the nearest cluster centre using linear search. Recently however, efficient search methods using
hierarchical tree structured vocabulary [24]
vocabulary indexed by randomized trees [28]
descriptor indexing by decision trees [14], [26] have been used.
Hierarchical vocabularies [11], [24] can also reduce descriptor quantization effects and can, to some extent, overcome the difficulty with choosing the number of cluster centres.
The spatial consistency re-ranking was shown to be very effective in improving the precision and removing false positive matches. However, the precision could be further improved by a more thorough (and more expensive) verification, based on a stricter measure of spatial similarity (e.g. angular ordering of regions [35], region overlap [10], deformable mesh matching [29], or common affine geometric transformation [18], [28]).
Unless the system is being designed solely to retrieve rigid objects, care must be taken not to remove true positive matches on deformable objects, such as people’s clothing, by using measures that apply only to rigid geometry. To reduce the computational cost, verification can be implemented as a sequence of progressively more thorough (and more expensive) filtering stages.
Spatially verified returns can be used to automatically expand the initial user-given query
with additional visual words leading to a significantly improved retrieval performance [8].
張貼留言