In the frequent string mining problem, one is given
m databases
D1,...,Dm{\cal D}_1,...,{\cal D}_m
of strings and searches for strings that fulfill certain frequency constraints. The constraints consist of
m pairs of thresholds
(minf1,maxf1),(\mathit{minf}_1,\mathit{maxf}_1),
...,(minfm,maxfm)...,(\mathit{minf}_m,\mathit{maxf}_m)
and one wants to find all strings
φ that satisfy
minfi £ freq(f, Di) £ maxfi\mathit{minf}_i \le \mathit{freq}(\phi, {\cal D}_i) \le \mathit{maxf}_i
for all
i with 1 ≤
i ≤
m, where
freq(f,Di) = |{ y Î Di : f is a substring of y}|\mathit{freq}(\phi,\mathcal{D}_i) = |\{ \psi \in \mathcal{D}_i : \phi \mbox{ is a substring of } \psi \}|
.
Fischer et al. [2] presented an algorithm that solves the frequent string mining problem in linear time under the assumption
that the number of databases is treated as a constant. The space consumption of this algorithm, however, is proportional to
the total size of all databases. We improve this algorithm in such a way that its space consumption is proportional to the
size of the largest database, and it takes linear time regardless of the number of databases. Also, our algorithm is more
flexible in the sense that one of several databases can be replaced without having to recalculate everything, that is, intermediate
data can be stored on file and be reused.
This is an extended abstract of an article published in the Data Mining and Knowledge Discovery journal [1].