When massive volumes of data is expected to always be available on the Internet, how is that engineered?
A one in a million hardware fluke would happen every day at Google, so reliability must be handled by smart software. In this talk from the 2008 Velocity Conference, Sean Quinlan of Google describes the tools that Google uses to manage terabytes of data spread over millions of machines.
Because their needs are too big for a single machine, Google does not look at single machine performance. Rather, they look for the most performance bang for the buck and buy lots of it. They then layer software on top of it, which allows them to replicate data across multiple machines to compensate for the potential failure of any one machine. The two major systems they use are the Google File System (GFS) and BigTable. GFS is a cluster file system laid on top of a data center that stores chunks of data either as append-only sequences similar to log files or as read-only sorted tables of key/value pairs. These restrictions allow seamless and reliable storage at Google’s immense scale.