Site Reliability Engineering

How Google Runs Production Systems

by: Niall Richard Murphy|Betsy Beyer|Chris Jones|Jennifer Petoff

in: Networking & Cloud Computing

Summary:

The book provides insights into Google's approach to ensuring their large-scale systems are reliable and scalable, discussing principles and practices such as service level objectives (SLOs), automation, and incident management. It covers the role of a Site Reliability Engineer (SRE), detailing how they balance the need for system stability with the demands of new features and growth.

Key points:

1. Risk Acceptance: The book promotes balancing risk and failure cost, introducing an "error budget" as the acceptable failure rate. This aids in deciding when to prioritize system reliability over feature development.

Buy book on Amazon