The effectiveness of a video retrieval system largely depends on the choice of underlying text and image retrieval components.
The unique properties of video collections (e.g., multiple sources, noisy features and temporal relations) suggest we examine
the performance of these retrieval methods in such a multimodal environment, and identify the relative importance of the underlying
retrieval components. In this paper, we review a variety of text/image retrieval approaches as well as their individual components
in the context of broadcast news video. Numerous components of text/image retrieval have been discussed in detail, including
retrieval models, text sources, temporal expansion methods, query expansion methods, image features, and similarity measures.
For each component, we conduct a series of retrieval experiments on TRECVID video collections to identify their advantages
and disadvantages. To provide a more complete coverage of video retrieval, we briefly discuss an emerging approach called
concept-based video retrieval, and review strategies for combining multiple retrieval outputs.
Keywords Video retrieval - Text retrieval - Image retrieval - Concept-based retrieval - Fusion - Review