We present a text-dependent speaker verification system based on Hidden Markov Models. A set of features, based on the temporal
duration of context-dependent phonemes, is used in order to distinguish amongst speakers. Our approach was tested using the
YOHO corpus; it was found that the HMM-based system achieved an equal error rate (EER) of 0.68% using conventional (acoustic)
features and an EER of 0.32% when the time features were combined with the acoustic features. This compares well with state-of-the-art
results on the same test, and shows the value of the temporal features for speaker verification. These features may also be
useful for other purposes, such as the detection of replay attacks, or for improving the robustness of speaker-verification
systems to channel or speaker variations. Our results confirm earlier findings obtained on text-independent speaker recognition
[1] and text-dependent speaker verification [2] tasks, and contain a number of suggestions on further possible improvements.
Keywords Speaker verification - triphones - time durations - Hidden Markov Models