The problem of finding a specified pattern in a time series database (i.e., query by content) has received much attention
and is now a relatively mature field. In contrast, the important problem of enumerating all surprising or interesting patterns
has received far less attention. This problem requires a meaningful definition of “surprise”, and an efficient search technique.
All previous attempts at finding surprising patterns in time series use a very limited notion of surprise, and/or do not scale
to massive datasets. To overcome these limitations we propose a novel technique that defines a pattern surprising if the frequency
of its occurrence differs substantially from that expected by chance, given some previously seen data. This notion has the
advantage of not requiring the user to explicitly define what is a surprising pattern, which may be hard, or perhaps impossible,
to elicit from a domain expert. Instead, the user gives the algorithm a collection of previously observed “normal” data. Our
algorithm uses a suffix tree to efficiently encode the frequency of all observed patterns and allows a Markov model to predict
the expected frequency of previously unobserved patterns. Once the suffix tree has been constructed, a measure of surprise
for all the patterns in a new database can be determined in time and space linear in the size of the database. We demonstrate
the utility of our approach with an extensive experimental evaluation.
Keywords Time Series - Suffix Tree - Novelty Detection - Anomaly Detection - Markov Model - Feature Extraction