We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty
with measuring trafic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on
the order of gigabytes a second. As a result, network routers can collect only relatively few statistics about the data. The
central problem addressed here is to use the limited memory of routers to determine essential features of the network traffic
stream. A particularly difficult and representative subproblem is to determine the top
k categories to which the most packets belong, for a desired value of
k and for a given notion of categorization such as the destination IP address.
We present an algorithm that deterministically finds (in particular) all categories having a frequency above 1/(m+1) using
m counters, which we prove is best possible in the worst case. We also present a sampling-based algorithm for the case that
packet categories follow an arbitrary distribution, but their order over time is permuted uniformly at random. Under this
model, our algorithm identifies flows above a frequency threshold of roughly 1/√nm with high probability, where m is the number of counters and n is the number of packets observed. This guarantee is not far off from the ideal of identifying all flows (probability 1/n), and we prove that it is best possible up to a logarithmic factor. We show that the algorithm ranks the identified flows
according to frequency within any desired constant factor of accuracy.
This research is partially supported by the Natural Science and Engineering Research Council of Canada, by the Canada Research
Chairs Program, and by the Nippon Telegraph and Telephone Corporation through the NTT-MIT research collaboration.