Managing large datasets has become one major application of Grids. Life science applications usually manage large databases
that should be replicated to scale applications. The growing number of users and the simple access to Internet-based application
has stressed Grid middleware. Such environment are thus asked to manage data and schedule computation tasks at the same time.
These two important operations have to be tightly coupled. This paper presents an algorithm (Scheduling and Replication Algorithm,
SRA) that combines data management and scheduling using a steady-state approach. Using a model of the platform, the number
of requests as well as their distribution, the number and size of databases, we define a linear program to satisfy all the
constraints at every level of the platform in steady-state. The solution of this linear program will give us a placement for
the databases on the servers as well as providing, for each kind of job, the server on which they should be executed. Our
theoretical results are validated using simulation and logs from a large life science application.
Key words bioinformatics applications - data management - Grid computing - scheduling
This work was supported in part by the ACI GRID and Grid5000 projects of the French Department of Research.