Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random
samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared
to using the entire data. However, how to set the starting sample size for PS is still an open problem. We show that an improper
starting sample size can still make PS expensive in computation due to running the learning algorithm on a large number of
instances (of a sequence of random samples before achieving convergence) and excessive database scans to fetch the sample
data. Using a suitable starting sample size can further improve the efficiency of PS. In this paper, we present a statistical
approach which is able to efficiently find such a size. We call it the Statistical Optimal Sample Size(SOSS), in the sense that a sample of this size sufficiently resembles the entire data. We introduce an information-based measure
of this resemblance (Sample Quality) to define the SOSS and show that it can be efficiently obtained in one scan of the data.
We prove that learning on a sample of SOSS will produce model accuracy that asymptotically approaches the highest achievable
accuracy on the entire data. Empirical results on a number of large data sets from the UCIKDD repository show that SOSS is
a suitable starting size for Progressive Sampling.