The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data storage systems
for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed
systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing
and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software
components run in conjunction with operating system services and standard network protocols. Through softw are-implemented
fault tolerance, the system tolerates multiplenode, link, and switch failures, with no single point of failure. The RAIN technology
has been transfered to RAIN finity, a start-up company focusing on creating clustered solutions for improving the performance
and availability of Internet data centers. In this paper we describe the following contributions: 1) fault-tolerant interconnect
topologies and communication protocols providing consistent error reporting of link failures; 2) fault management techniques
based on group membership; and 3) data storage schemes based on computationally efficient error-control codes. We present
several proof-of-concept applications: highly available video and web servers, and a distributed checkpointing system.
Supported in part by an NSF Young Investigator Award (CCR-9457811), by a Sloan Research Fellowship, by an IBM Partnership
Award and by DARPA through an agreement with NASA/OSAT.