In the
windowed data stream model, we observe items coming in over time. At any time
t, we consider the
window of the last
N observations
a
t-(N-1),
a
t-(N-2), . . . ,
a
t, each
a
i ε 1, . . . ,
u; we are required to support queries about the data in the window. A crucial restriction is that we are only allowed
o(
N) (often polylogarithmic in
N) storage space, so not all items within the window can be archived.
We study two basic problems in the windowed data stream model. The first is the estimation of the rarity of items in the window.
Our second problem is one of estimating similarity between two data stream windows using the Jacard’s coefficient. The problems
of estimating rarity and similarity have many applications in mining massive data sets. We present novel, simple algorithms
for estimating rarity and similarity on windowed data streams, accurate up to factor 1 ± ε using space only logarithmic in
the window size.
This work was done while the author was a DIMACS visitor.