Identifier attributes—very high-dimensional categorical attributes such as particular product ids or people's names—rarely
are incorporated in statistical modeling. However, they can play an important role in relational modeling: it may be informative
to have communicated with a particular set of people or to have purchased a particular set of products. A key limitation of
existing relational modeling techniques is how they aggregate bags (multisets) of values from related entities. The aggregations
used by existing methods are simple summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM,
or COUNT. This paper's main contribution is the introduction of aggregation operators that capture more information about
the value distributions, by storing meta-data about value distributions and referencing this meta-data when aggregating—for
example by computing class-conditional distributional distances. Such aggregations are particularly important for aggregating
values from high-dimensional categorical attributes, for which the simple aggregates provide little information. In the first
half of the paper we provide general guidelines for designing aggregation operators, introduce the new aggregators in the
context of the relational learning system ACORA (Automated Construction of Relational Attributes), and provide theoretical
justification. We also conjecture special properties of identifier attributes, e.g., they proxy for unobserved attributes
and for information deeper in the relationship network. In the second half of the paper we provide extensive empirical evidence
that the distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical attributes, and in
support of the aforementioned conjectures.
Keywords identifiers - relational learning - aggregation - networks
Editors: Hendrik Blockeel, David Jensen and Stefan Kramer
An erratum to this article is available at http://dx.doi.org/10.1007/s10994-006-8633-8.