Conversational dialogue systems cannot be evaluated in a fully formal manner, because dialogue is heavily dependent on context
and current dialogue theory is not precise enough to specify a target output ahead of time. Instead, we evaluate dialogue
systems in a semi-formal manner, using human judges to rate the coherence of a conversational character and correlating these
judgments with measures extracted from within the system. We present a series of three evaluations of a single conversational
character over the course of a year, demonstrating how this kind of evaluation helps bring about an improvement in overall
dialogue coherence.