Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets,
eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series,
noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms
from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced
over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the
same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance
measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures
defined on the original time series.
In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity
reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance
measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it
allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical
results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation
on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.
Keywords Time series - Data mining - Symbolic representation - Discretize
Responsible editor: Johannes Gehrke.