While a number of scientific workflow systems support data provenance, they primarily focus on collecting and querying provenance
for single workflow runs. Scientific research projects, however, typically involve (1) many interrelated workflows (where
data from one or more workflow runs are selected and used as input to subsequent runs) and (2) tasks between workflow runs
that cannot be fully automated. This paper addresses the need for recording data dependencies across multiple workflow runs
and accommodating data management activities performed between runs. We define a new conceptual model for representing project-level
provenance based on the notion of project histories and folders, and describe mechanisms to support this model in the collection-oriented
modeling and design framework of Kepler. Our approach allows users to conveniently organize their projects and data using the familiar folder-hierarchy metaphor,
while at the same time integrating this information with detailed provenance of data products generated via automated scientific
workflows.
This work supported in part by NSF grants DBI-053368, EAR-0225673, IIS-0630033, IIS-0612326, and EF-0228651; and DOE grant
DE-FC02-01ER25486.