This paper addresses the problem of real-time speaker segmentation and speaker tracking in audio content analysis in which
no prior knowledge of the number of speakers and the identities of speakers is available. Speaker segmentation is to detect
the speaker change boundaries in a speech stream. It is performed by a two-step algorithm, which includes potential change
detection and refinement. Speaker tracking is then performed based on the results of speaker segmentation by identifying the
speaker of each segment. In our approach, incremental speaker model updating and segmental clustering is proposed, which makes
the unsupervised speaker segmentation and tracking feasible in real-time processing. A Bayesian fusion method is also proposed
to fuse multiple audio features to obtain a more reliable result, and different noise levels are utilized to compensate for
background mismatch. Experiments show that the proposed algorithm can recall 89% of speaker change boundaries with 15% false
alarms, and 76% of speakers can be unsupervised identified with 20% false alarms. Compared with previous works, the algorithm
also has low computation complexity and can perform in 15% of real time with a very limited delay in analysis.
Keywords: Audio content analysis - Audio indexing - Speaker segmentation - Speaker change detection - Speaker tracking
Published online: 12 January 2005
Part of the work presented in this paper was published in the 10th ACM International Conference on Multimedia, 1-6 December
2002