Human-Robot Interaction (HRI) invariably involves dialogue about objects in the environment in which the agents are situated.
The paper focuses on the issue of resolving discourse references to such visual objects. The paper addresses the problem using
strategies for intra-modal fusion (identifying that different occurrences concern the same object), and inter-modal fusion, (relating object references across different modalities). Core to these strategies are sensorimotoric coordination, and
ontology-based mediation between content in different modalities. The approach has been fully implemented, and is illustrated
with several working examples.