‘The google file system’ by Ghemawat et al. was published in ACM Symposium on Operating Systems Principles (SOSP) in 2003. The paper describes the file system developed by google (in case the paper title didn’t give that away) to support their increasing storage needs. It is a seminal paper and a staple reading for almost every graduate class on operating systems, cloud computing, distributed systems, or storage systems. The paper is densely packed with great systems design insights and I found it hard to “summarize” it because it really deserves a thorough reading. So instead of trying to explain the system design in detail, I am going to highlight some of my personal favourite insights from the paper that I believe are applicable to systems.
Understand your workload and optimize for it. GFS explicitly optimizes for large files with mostly append and sequential read accesses.
Commodity hardware with software-based fault-tolerance. Instead of using high end hardware, GFS uses off-the-shelf hardware which fails frequently and leaves fault tolerance to the software. This design is a good example of the ‘end-to-end principle in system design’ as well. Not surprisingly, this is almost a given in most systems today.
Simplicity is good, as long as it meets your requirements. GFS consists of one master, which greatly simplifies its design and implementation. However, it comes with some limitations in terms of the size of the GFS clusters. Because the limit was higher than what Google needed at the time, the designers of GFS chose the simple design.
Invest in debugging infrastructure. The authors note that having detailed logs were useful in debugging bugs in the distributed system.
That’s a wrap. There are obviously other interesting design ideas and insights from the GFS paper and many of them are widely used now. The above is just a collection of my favourite takeaways from reading the paper after a long time.