Welcome!
To use the personalized features of this site, please log in or register.
If you have forgotten your username or password, we can help.
|
 |
Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks
| |
|
Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks
Sven Wachsmuth5 , Hans Brandt-Pook5, Gudrun Socher5, 6 , Franz Kummert5 and Gerhard Sagerer5
| (5) |
Technical Faculty, University of Bielefeld, P.O. Box 100131, 33501 Beilefeld, Germany |
| (6) |
Vidam Communications Inc., 2 N 1st St., San Jose, CA, 95113 |
Abstract
The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences
on pure qualitative high level descriptions miss a lot of information when concerned with erroneous, vague, or incomplete
data. We propose a new architecture that integrates various levels of processing by using multiple representations of the
visually observed scene. They are vertically connected by Bayesian networks in order to find the most plausible interpretation
of the scene.
The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation
of the scene. Using this concept, the key problem is the identification of the verbally specified object instances in the
visually observed scene. Therefore, a Bayesian network is generated dynamically from the spoken utterance and the visual scene
representation. In this network spatial knowledge as well as knowledge extracted from psycholinguistic experiments is coded.
First results show the robustness of our approach.
The work of G. Socher has been supported by the German Research Foundation (DFG).
Fulltext Preview (Small, Large)
 References secured to subscribers.
|
|
|
|
|
|