In this paper, an online Bayesian formulation is presented to detect and describe the most significant key-frames and shot
boundaries of a video sequence. Visual information is encoded in terms of a reduced number of degrees of freedom in order
to provide robustness to noise, gradual transitions, flashes, camera motion and illumination changes. We present an online
algorithm where images are classified according to their appearance contents-pixel values plus shape information- in order
to obtain a structured representation from sequential information. This structured representation is presented on a grid where
nodes correspond to the location of the representative image for each cluster. Since the estimation process takes simultaneously
into account clustering and nodes’ locations in the representation space, key-frames are placed considering visual similarities
among neighbors. This fact not only provides a powerful tool for video navigation but also offers an organization for posterior
higher-level analysis such as identifying pieces of news, interviews, etc.
This work was supported by CICYT grant TEL99-1206-C02-02 and CERTAP Generalitat de Catalunya.