This paper addresses the problem of learning a statistical distribution of data in a relational database. Data we want to
focus on are represented with trees which are a quite natural way to represent structured information. These trees are used
afterwards to infer a stochastic tree automaton, using a well-known grammatical inference algorithm. We propose two extensions
of this algorithm: use of sorts and generalization of the infered automaton according to a local criterion. We show on some
experiments that our approach scales with large databases and both improves the predictive power of the learned model and
the convergence of the learning algorithm.
Keywords Stochastic tree automata - multi-relational data mining - generalization - sorts