Probabilistic classifiers are developed by assuming generative models which are product distributions over the original attribute
space (as in naive Bayes) or more involved spaces (as in general Bayesian networks). While this paradigm has been shown experimentally
successful on real world applications, despite vastly simplified probabilistic assumptions, the question of why these approaches
work is still open.
This paper resolves this question. We show that almost all joint distributions with a given set of marginals (i.e., all distributions that could have given rise to the classifier learned)
or, equivalently, almost all data sets that yield this set of marginals, are very close (in terms of distributional distance)
to the product distribution on the marginals; the number of these distributions goes down exponentially with their distance
from the product distribution. Consequently, as we show, for almost all joint distributions with this set of marginals, the
penalty incurred in using the marginal distribution rather than the true one is small. In addition to resolving the puzzle
surrounding the success of probabilistic classifiers our results contribute to understanding the tradeoffs in developing probabilistic
classifiers and will help in developing better classifiers.