When you see some data on the Web, do you ever wonder how it got there? The chances are that it is in no sense original, but
was copied from some other source, which in turn was copied from some other source, and so on. If you are a scientist using
a scientific database or some other kind of scholar using a digital library, you will probably be keenly interested in this
information because it is crucial to your assessment of the accuracy and timeliness of the data. Data provenance is the understanding
of the history of a piece of data: its origins and the process by which it travelled from database to database. Existing database
tools give us little or no help in recording provenance; indeed database schemas make it difficult to record this kind of
information. I shall report on some recent work that characterizes data provenance. It is based on a model for data, both
structured and semistructured, which accounts for both the structure and location of data. Using this model, we can draw a
distinction between “why provenance” and “where provenance”. The former expresses all the data in the source databases that
contributed to the existence of the data of interest; the latter specifies the locations from which it was drawn. In particular,
we can take a query in a generic semistructured query language and use it to provide a formal derivation of both forms of
provenance and to derive a number of useful properties of these forms. The work generalizes existing work on relational databases
that is limited to why provenance. This is a report of joint work with Sanjeev Khanna and WangChiew Tan.