Return to previous page

Site Reliability Engineering

How Google Runs Production Systems
Summary:

The book provides insights into Google's approach to ensuring their large-scale systems are reliable and scalable, discussing principles and practices such as service level objectives (SLOs), automation, and incident management. It covers the role of a Site Reliability Engineer (SRE), detailing how they balance the need for system stability with the demands of new features and growth.

Key points:

1. Risk Acceptance: The book promotes balancing risk and failure cost, introducing an "error budget" as the acceptable failure rate. This aids in deciding when to prioritize system reliability over feature development.

Books similar to "Site Reliability Engineering":